I expected to find this in SO already… but haven’t so far
I’m talking about a regex which looks at an HTML ENCODED string: e.g. something like
blip ♦ trout’s mouth
Have I covered all the bases with &\w+; and &#[0-9]+;?
$encoded_string = htmlspecialchars($_GET["searchterms"]);
echo "<b>Search results for submitted string: \"$encoded_string\"</b><br><br>";
$html_special_chars_pattern = "!(&\\w+;|&#[0-9]+;)!";
$non_html_tokens = preg_split( $html_special_chars_pattern, $encoded_string, -1, PREG_SPLIT_DELIM_CAPTURE );
You are missing the
&#xH;or&#XH;numeric character references.That is,
&#[xX][a-fA-F0-9]+;in regular expression.