Question about code performance: I’m trying to run ~25 regex rules against a ~20g text file. The script should output matches to text files; each regex rule generates its own file. See the pseudocode below:
regex_rules=~/Documents/rulesfiles/regexrulefile.txt
for tmp in *.unique20gbfile.suffix; do
while read line
# Each $line in the looped-through file contains a regex rule, e.g.,
# egrep -i '(^| )justin ?bieber|(^| )selena ?gomez'
# $rname is a unique rule name generated by a separate bash function
# exported to the current shell.
do
cmd="$line $tmp > ~/outputdir/$tmp.$rname.filter.piped &"
eval $cmd
done < $regex_rules
done
Couple thoughts:
-
Is there a way to loop the text file just once, evaluating all rules and splitting to individual files in one go? Would this be faster?
-
Is there a different tool I should be using for this job?
Thanks.
This is the reason
grephas a-foption. Reduce yourregexrulefile.txtto just the regexps, one per line, and runThis produces all the matches in a single output stream, but you can do your loop thing on it afterward to separate them out. Assuming the combined list of matches isn’t huge, this will be a performance win.