I have a regular expression that looks through html content for some keywords that used to work, but now fails and i don’t understand why. (The regular expression came from this thread.)
$find = '/(?![^<]+>)(?<!\w)(' . preg_quote($t['label']) . ')\b/s';
$text = preg_replace_callback($find, 'replaceCallback', $text);
function replaceCallback($match) {
if (is_array($match)) {
$htmlVersion = $match[1];
$urlVersion = urlencode($htmlVersion);
return '<a class="tag" rel="tag-definition" title="Click to know more about ' . $htmlVersion . '" href="?tag=' . $urlVersion . '">' . $htmlVersion . '</a>';
}
return $match;
}
The error message points to the preg_replace_Callback call and says:
Warning: preg_replace_callback() [function.preg-replace-callback]: Unknown modifier 't' in /frontend.functions.php on line 43
Please note: this is not an attempt to provide a fix for the regex. It is just here to show how difficult it is (dare I say impossible) to create a regex that will successfully parse HTML. Even well structured XHTML would be nightmarishly difficult, but poorly structured HTML is a no-go for regular expressions.
I agree 100% that using regular expressions to attempt HTML parsing is a very bad idea. The following code uses the supplied function to parse some simple HTML tags. It trips up on its second attempt when it finds the nested HTML tag
<em>Test<em>: