Can have an unicode ligature character fi (Unicode U+FB01) more than one representation in UTF8? Which one? For each normalization form?
Can have an unicode ligature character fi (Unicode U+FB01) more than one representation in
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
This depends on the meaning of “character,” which is rather obscure. In Unicode, “character” usually means a codepoint assigned to a character, and this does match exactly the intuitive concept of “character.”
A single codepoint, such as U+FB01, has only one representation in UTF-8, because UTF-8 defines an unambiguous algorithm for generating the encoded form.
An intuitive character, such as the fi ligature, may have different representations as a codepoint or as a sequence of codepoints, which each have UTF-8 representations. Unicode normalization rules define, in part, mappings between such alternatives.
But the compatibility mapping for U+FB01 (to U+0066 U+0069, i.e. “f” followed by “i”) does not preserve the identity of an intuitive character: the ligature is mapped to two normal letters.
On the other hand, you can ask for, or suggest, ligature behavior by inserting U+200D ZERO WIDTH JOINER (ZWJ) between two letters, like “f” and “i”. In a sense, the sequence U+0066 U+200D U+0069 is an alternative representation of the fi ligature, but this is not a formal property of character, and it depends on rendering software whether it pays attention to ZWJ.