I understand that unicodedata.normalize converts diacritics to their non diacritic counterparts:
import unicodedata
''.join( c for c in unicodedata.normalize('NFD', u'B\u0153uf')
if unicodedata.category(c) != 'Mn'
)
My question is (and can be seen in this example): does unicodedata has a way to replace combined char diacritics into their counterparts? (u’œ’ becomes ‘oe’)
If not I assume I will have to put a hit out for these, but then I might as well compile my own dict with all uchars and their counterparts and forget about unicodedata altogether…
There’s a bit of confusion about terminology in your question. A diacritic is a mark that can be added to a letter or other character but generally does not stand on its own. (Unicode also uses the more general term combining character.) What
normalize('NFD', ...)does is to convert precomposed characters into their components.Anyway, the answer is that œ is not a precomposed character. It’s a typographic ligature:
The
unicodedatamodule provides no method for splitting ligatures into their parts. But the data is there in the character names:(Of course you wouldn’t do it like this in practice: you’d preprocess the Unicode database to generate a lookup table as you suggest in your question. There aren’t all that many ligatures in Unicode.)