I have millions of files in a folder (nested). I need to scan a value from those files and print lines containing this value (say LINE_TXT). Earlier I used to sed each file but it used to take 45mins to do this. My earlier solution was something like this:
FILES=$(find $1 -type f -name 'filename.txt')
for f in $FILES
do
if [[ "$LINE" == *LINE_TXT* ]]; then
echo $LINE
fi
done
I figured out that pipemill is best way to achieve this. My primary solution is something like this:
makefifo mypipe
find $1 -type f -name 'filename.txt' | xargs cat > my pipe &
while read -r LINE
do
if [[ "$LINE" == *LINE_TXT* ]]; then
echo $LINE
fi
done << mypipe
Run time is 1min around. Can I improve on this further ?
Seems to me that less script overhead would make things faster.
Just let grep do its own recursion through your directories with
-r. And if you don’t want its output to include the filename in its output, include the-hoption. You can pipe its output through whatever you need for post-processing.If you want to search only for specific filenames, grep’s
-roption has options of its own:--includeand--exclude, mentioned on its man page. For example:While the
findcommand is excellent, and invaluable in certain situations, if you can use options built in to a single tool likegrep, you will incur less overhead. Thefindcommand doesn’t look inside files, so it would still have to launchgrepfor each one of them. If you DID want to usefind, it might look something like this:This has the benefit of giving you access to
find‘s directory searching capabilities, but if all you want to do is look for a particularly named file in your directory tree, grep’s-r --includeis probably sufficient and is sure to run faster.