In the htmlspecialchars function, if you set the ENT_SUBSTITUTE flag, it is supposed to replace some invalid characters.
What characters are replaced? And what is the mapping between the invalid characters and the ones that are used to replace it?
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
There is only one, universal replacement character: U+FFFD. If you are writing out UTF-8, then this codepoint is appropriately encoded. If not, you get the corresponding character reference
�instead.There is no reversible mapping. By definition, the original byte sequence was invalid, i.e. it does not have a value (valid = has a value).
Bytes (not really “characters”) that are replaced are those that are not valid in the assumed source encoding. For example, if your source encoding was UTF-16 and you had a lone surrogate, that would be “invalid” (though technically any text processor is supposed to abort fatally in that situation). As a better example, if the source encoding is ASCII, then any value above 127 is an invalid character.