I have this nifty little script that does a nice job of manipulating some data files for me…first it strips out unwanted data after the first semicolon, then it changes the data into a Unicode string, then removes any newline chars, and finally shuffles it into two mixed files (a and b) that I need to use.
It works beautifully with small files, but I’m now dealing with a file that’s so large that sed is hanging. Or perhaps that’s what’s happening…I don’t know exactly. Is there anyone out there who can offer a suggestion on how to (maybe?) buffer this or prevent it from hanging? (I’ve got 16GB ram and the file is…1707772 (k? I’m “ls -la”ing)…is that too large?) I’m seeing 100%cpu usage that’s never going away…only killing the process returns the window to usable.
Here’s the code:
#!/bin/bash
a="a";
b="b";
echo "Input Filename:";
read ifilename;
echo "Output Filename:";
read ofilename;
awk '{
#dbg print "$0=" $0
sub(/;.*$/, "")
len=length($0)
if (len == 4) {print "�" $0 ";"}
else if (len == 5) {print "&#x" $0 ";"}
else {print "error in input: found len=" len " in XX" $0 "xx"}
}' /home/myhome/$ifilename > temp.txt;
cat temp.txt | tr -d "\n" > temp_nolfs.txt;
cat temp_nolfs.txt | sed -r 's/(.[^;]*;)/ \1 /g' | tr " " "\n" | shuf | tr -d "\n" > $ofilename$a".txt";
cat temp_nolfs.txt | sed -r 's/(.[^;]*;)/ \1 /g' | tr " " "\n" | shuf | tr -d "\n" > $ofilename$b".txt";
rm temp.txt;
rm temp_nolfs.txt;
echo "Done!";
Thanks for any and all suggestions!
Many thanks for the helpful suggestions; however, the issue wasn’t sed at all…I had been feeding it data with NO semicolon, which meant it was looking forever for something that didn’t exist. Worked fine, redundancies nonwithstanding, once I fed it properly structured data.