I have to grep from a file (5MB) containing specific strings the same strings (and other information) from a big file (27GB).
To speed up the analysis I split the 27GB file into 1GB files and then applied the following script (with the help of some people here). However it is not very efficient (to produce a 180KB file it takes 30 hours!).
Here’s the script. Is there a more appropriate tool than grep? Or a more efficient way to use grep?
#!/bin/bash
NR_CPUS=4
count=0
for z in `echo {a..z}` ;
do
for x in `echo {a..z}` ;
do
for y in `echo {a..z}` ;
do
for ids in $(cat input.sam|awk '{print $1}');
do
grep $ids sample_"$z""$x""$y"|awk '{print $1" "$10" "$11}' >> output.txt &
let count+=1
[[ $((count%NR_CPUS)) -eq 0 ]] && wait
done
done #&
A few things you can try:
1) You are reading
input.sammultiple times. It only needs to be read once before your first loop starts. Save the ids to a temporary file which will be read bygrep.2) Prefix your grep command with
LC_ALL=Cto use the C locale instead of UTF-8. This will speed upgrep.3) Use
fgrepbecause you’re searching for a fixed string, not a regular expression.4) Use
-fto makegrepread patterns from a file, rather than using a loop.5) Don’t write to the output file from multiple processes as you may end up with lines interleaving and a corrupt file.
After making those changes, this is what your script would become:
Also, check out GNU Parallel which is designed to help you run jobs in parallel.