I have an application that fires several processes. Each process loads an HTML file and tries to find whether a pattern appears in it, something like this:
OUTER:
while(my ($prov,$arr_ref) = each(%{$self->{TAGS}})) {
foreach my $tag (@{$arr_ref}) {
if ($html =~ m/\Q$tag\E/i) {
$provider = $prov;
last OUTER;
}
}
}
$self->{TAGS} key is a pattern name, and the value is a reference to array with strings (scalars).
I was profiling the program, and found that this part:
$html =~ m/\Q$tag\E/i
makes my CPU jump to 100%. If I remove it, it barely gets to 10%.
I have only one approach in mind, which is turning all the scalars (strings) inside each array ref to compiled regex (qr/.../). I guess it won’t improve it so much, since I guess the issue in fact when the regex actually searches all the HTML pages, which can be hundreds of bytes in size.
What can I do to improve this section?
SUB-QUESTION: due to the answers below,and some testing I made, I will sharpen my question, the issue is NOT the regex, I already tried the index way before I asked this question, also tried compiled regex with qr//, this issue is, with the size of the html files, the $html contents are HTML text, sometimes its small, and sometimes its big, so the issue here is: WHAT IS THE BEST WAY (Resource wise…) TO FIND IF A STRING APPEARS INSIDE A LARGER (LETS SAY 1MB IN SIZE) STRING?
Thanks.
Using
indexshould increase performance since you’ll get rid of all the overhead of using regular-expressions. Please, do a benchmark!If you’d like to increase it even more you should store all your
$tags as lowercase strings so that you don’t have tolcthe same string multiple times.Documentation