How can I convert decomposed unicode character sequences like “LATIN SMALL LETTER E” + “COMBINING ACUTE ACCENT” (or U+0075 + U+0301) so they become the precomposed form: “LATIN SMALL LETTER E WITH ACUTE” (or U+00E9) using native Python 2.5+ functions?
If it matters, I am on Mac OS X (10.6.4) and I have seen the question Converting to Precomposed Unicode String using Python-AppKit-ObjectiveC but unfortunately while the described OS X native CoreFoundation function CFStringNormalize does not fail or halt the script execution it just doesn’t do anything.
And by that I don’t mean that it doesn’t return anything (its return type is void – it mutates in place). I have also tried all possible values for the constant parameter that specifies precomposing or decomposing in either canonical or non-canonical forms.
That is why I am searching for a Python native method of handling this case.
Thank you very much for reading!
André
‘NFC’ tells ud.normalize to apply the canonical decomposition (‘NFD’), then
compose pre-combined characters:
They both print the same:
But their reprs are different:
And their encodings, in say
utf_8, are (not surprisingly) different too: