Executive Summary:
preg_replace() ran faster than string comparisons. Why? Shouldn’t regular expressions be slower?
In a recent question about detecting any of an array of disallowed substrings within a given input, I suggested comparing the result of a preg_replace() call to the original input, since preg_replace() can take an array of patterns as input. Thus my method for this could be a single if whereas the other solutions required one (or many) loops.
I’m not interested in debating my answer, because really it is less readable/maintainable than the loops. My answer there still holds a -1, and I’ll accept that for readability/ease of maintenance, but the biggest fault pointed out with my method was a lack of efficiency. That got me curious, and led me to do some testing. My results were a bit surprising to me: with all other factors held equal, preg_replace() was faster than any of the other methods.
Can you explain why this was the case?
My code for these tests can be found below, along with the results:
$input = "In a recent question about detecting any of an array of disallowed substrings within a given input, I suggested comparing the result of a `preg_replace()` call to the original input, since `preg_replace()` can take an array of patterns as input. Thus my method for this could be a single `if` whereas the other solutions required one (or many) loops. I'm not interested in debating my answer, because really it is less readable/maintainable than the loops. However, the biggest fault pointed out with my method was a lack of efficiency. That got me curious, and led me to do some testing. My results were a bit surprising to me: with all other factors held equal, `preg_replace()` was **faster** than any of the other methods. Can you explain why this was the case?";
$input2 = "Short sentence - no matches";
$input3 = "Word";
$input4 = "Short sentence - matches loop";
$start1 = microtime(true);
$rejectedStrs = array("loop", "efficiency", "explain");
$p_matches = 0;
for ($i = 0; $i < 10000; $i++) {
if (str_check($rejectedStrs, $input)) $p_matches++;
if (str_check($rejectedStrs, $input2)) $p_matches++;
if (str_check($rejectedStrs, $input3)) $p_matches++;
if (str_check($rejectedStrs, $input4)) $p_matches++;
}
$start2 = microtime(true);
$rejectedStrs = array("loop", "efficiency", "explain");
$l_matches = 0;
for ($i = 0; $i < 10000; $i++) {
if (loop_check($rejectedStrs, $input)) $l_matches++;
if (loop_check($rejectedStrs, $input2)) $l_matches++;
if (loop_check($rejectedStrs, $input3)) $l_matches++;
if (loop_check($rejectedStrs, $input4)) $l_matches++;
}
$start3 = microtime(true);
$rejectedStrs = array("/loop/", "/efficiency/", "/explain/");
$s_matches = 0;
for ($i = 0; $i < 10000; $i++) {
if (preg_check($rejectedStrs, $input)) $s_matches++;
if (preg_check($rejectedStrs, $input2)) $s_matches++;
if (preg_check($rejectedStrs, $input3)) $s_matches++;
if (preg_check($rejectedStrs, $input4)) $s_matches++;
}
$end = microtime(true);
echo $p_matches." ".$l_matches." ".$s_matches."\n";
echo "str_match: ".$start1." ".$start2."= ".($start2-$start1)."\nloop_match: ".$start2." ".$start3."=".($start3-$start2)."\npreg_match: ".$start3." ".$end."=".($end-$start3);
function preg_check($rejectedStrs, $input) {
if($input == preg_replace($rejectedStrs, "", $input))
return true;
return false;
}
function loop_check($badwords, $string) {
foreach (str_word_count($string, 1) as $word) {
foreach ($badwords as $bw) {
if (stripos($word, $bw) === 0) {
return false;
}
}
}
return true;
}
function str_check($badwords, $str) {
foreach ($badwords as $word) {
if (stripos(" $str ", " $word ") !== false) {
return false;
}
}
return true;
}
Results
20000 20000 20000
str_match: 1282270516.6934 1282270518.5881= 1.894730091095
loop_match: 1282270518.5881 1282270523.0943=4.5061857700348
preg_match: 1282270523.0943 1282270523.6191=0.52475500106812
Let’s first look at
preg_checkandloop_check. Both of them will have to traverse the entire string, and they will have to check each of the individual words in each traversal. So their behavior will at least beO(n*m), wherenis the length of the string andmthe number of bad words. You can test this by running the algorithm with increasing values ofnandmand plotting the 3D graphs (however, you may, or may not, have to run it with very high values ofnandmto see this behavior).loop_checkis more (asymptoticly) efficient here. The reason is that the number of words a string has is not proportional to their length — I seem to recall it typically follows a logarithmic function. It probably uses a hash table to store the words it finds through the way, which is done in average constant time (if we ignore that we may have to rebuild the hash table from time to time to accommodate more elements).Therefore
loop_checkwill have an asymptotic behavior that follows something liken + m * log(n), which is better thann*m.Now, this refers to the asymptotic behavior of the algorithms, i.e., when
mandngrow very (and it may require “very very”) large. For small values ofmandnthe constants play a big part. In particular, execution of PHP opcodes and PHP function calls are more costly than the same task implemented in C, just one function call away. This doesn’t make the regex algorithm faster, it just makes it faster for small values ofmandn.