I am having issues with character encoding. I have simplified it to this below script:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<?php
$string = 'Stan’s';
echo $string.'<br><br>'; // Stan's
echo html_entity_decode($string).'<br><br>'; // Stan's
echo html_entity_decode($string, ENT_QUOTES, 'UTF-8'); // Stans
?>
</body>
</html>
I would like to make use of the last echo. However, it removes the ', why?
Update
I have tried all three options ENT_COMPAT, ENT_QUOTES, ENT_NOQUOTES and it removes the ' in all cases.
The problem is that
’decodes to the Unicode character U+0092, UTF-8C2 92, known as PRIVATE USE TWO:I.e., this doesn’t decode to a usual apostrophe.
html_entity_decode($string)works because it doesn’t actually decode the entity, since the default target charset is latin-1, which cannot represent this character. If you specify UTF-8 as the target charset, the entity is actually decoded.The target of that entity is the Windows-1252 charset:
Quoting Wikipedia:
So you’re dealing with legacy HTML entities here, which PHP apparently doesn’t handle the same way “some” browsers do. You may want to check if the decoded entities are in the range specified above, that you transcode/redecode them in Windows-1252, then convert them to UTF-8. Or require your users to pass valid HTML.
This function should handle both legacy and regular HTML entities: