I am trying to execute a command like this:
find ./ -name "*.gz" -print -exec ./extract.sh {} \;
The gz files themselves are small. Currently my extract.sh contains the following:
# Start delimiter
echo "#####" $1 >> Info
zcat $1 > temp
# Series of greps to extract some useful information
grep -o -P "..." temp >> Info
grep -o -P "..." temp >> Info
rm temp
echo "####" >> Info
Obviously, this is not parallelizable because if I run multiple extract.sh instances, they all write to the same file. What is a smart way of doing this?
I have 80K gz files on a machine with massive horse power of 32 cores.
I would create a temporary directory. Then create an output file for each grep (based on the name of te file it processed). Files created under
/tmpare located on a RAM disk and so will not thrash your harddrive with lots of writes.You can then either cat it all together at the end, or get each grep to signal another process when it has finished and that process can begin catting files immediately (and removing them when done).
Example:
extract.sh