After seeing several threads rubbishing the regexp method of finding a term to match within an HTML document, I’ve used the Simple HTML DOM PHP parser (http://simplehtmldom.sourceforge.net/) to get the bits of text I’m after, but I want to know if my code is optimal. It feels like I’m looping too many times. Is there a way to optimise the following loop?
//Get the HTML and look at the text nodes
$html = str_get_html($buffer);
//First we match the <body> tag as we don't want to change the <head> items
foreach($html->find('body') as $body) {
//Then we get the text nodes, rather than any HTML
foreach($body->find('text') as $text) {
//Then we match each term
foreach ($terms as $term) {
//Match to the terms within the text nodes
$text->outertext = str_replace($term, '<span class="highlight">'.$term.'</span>', $text->outertext);
}
}
}
For example, would it make a difference to determine check if I have any matches before I start the loop maybe?
You don’t need the outer foreach loop; there’s generally only one body tag in a well-formed document. Instead, just use
$body = $html->find('body',0);.However, since a loop with only a single iteration is essentially equivalent in run time to not looping at all, it probably won’t have much performance impact either way. So in reality, you really just have 2 nested loops even in your original code, not 3.