I’ve got about 400’000 files that need some text to be replaced.
I tried the following Perl script:
@files = <*.html>;
foreach $file (@files) {
`perl -0777 -i -pe 's{<div[^>]+?id="user-info"[^>]*>.*?</div>}{}gsmi;' $file`;
`perl -0777 -i -pe 's{<div[^>]+?class="generic"[^>]*>[^\s]*<small>[^\s]*Author.*?</div>.*?</div>.*?</div>.*?</div>.*?</div>}{}gsmi;' $file`;
`perl -0777 -i -pe 's{<script[^>]+?src="javascript.*?"[^>]*>.*?</script>}{}gsmi;' $file`;
`perl -p -i -e 's/.css.html/.css/g;' $file`;
}
I don’t have a deep Perl knowledge, but the script runs too slow (updates only about 180 files per day).
Is there a way to speed it up?
Thank you in advance!
PS: When I tested it on a smaller number of files, I’ve noticed a much better performance…
First off, if you load 400,000 file names into memory, that’s going to suck up some memory. You can easily just iterate through the file list by for example:
File::Findopendir+while (readdir($dh))(does not load the entire list)Second, using backticks spawns a new process in the shell, and it is very ineffective. You could just open the files normally, slurp them, and then reprint to the same file name. E.g.
Lastly.. using regexes to edit html is not ideal. It might work for your case, but it might be worthwhile to invest some time learning an html parser. Not sure how suitable it would be for this particular case, but it might be worth looking into, to make your code more stable.