Once again, I am very confused with a unicode question. I can’t figure out how to successfully use unicodedata.normalize to convert non-ASCII characters as expected. For instance, I want to convert the string
u"Cœur"
To
u"Coeur"
I am pretty sure that unicodedata.normalize is the way to do this, but I can’t get it to work. It just leaves the string unchanged.
>>> s = u"Cœur"
>>> unicodedata.normalize('NFKD', s) == s
True
What am I doing wrong?
Your problem seems not to have to do with Python, but that the character you are trying to decompose (u’\u0153′ – ‘œ’) is not a composition itself.
Check as your code works with a string containing normal composite characters like “ç” and “ã”:
And then, if you check the unicode reference for both characters (yours and c + cedila) you will see that the later has a “decomposition” specification the former lacks:
http://www.fileformat.info/info/unicode/char/153/index.htm
http://www.fileformat.info/info/unicode/char/00e7/index.htm
It like “œ” is not formally equivalent to “oe” – (at least not for the people who defined this unicode part) – so, the way to go to normalize text containing this is to make a manual replacement of the char for the sequence with unicode.replace – as hacky as it sounds.