I understand that unicodedata.normalize converts diacritics to their non diacritic counterparts: import unicodedata ”.join(

Question

0

Asked: June 11, 20262026-06-11T06:23:48+00:00 2026-06-11T06:23:48+00:00

I understand that unicodedata.normalize converts diacritics to their non diacritic counterparts: import unicodedata ”.join(

0

I understand that unicodedata.normalize converts diacritics to their non diacritic counterparts:

import unicodedata
''.join( c for c in unicodedata.normalize('NFD', u'B\u0153uf') 
            if unicodedata.category(c) != 'Mn'
       )

My question is (and can be seen in this example): does unicodedata has a way to replace combined char diacritics into their counterparts? (u’œ’ becomes ‘oe’)

If not I assume I will have to put a hit out for these, but then I might as well compile my own dict with all uchars and their counterparts and forget about unicodedata altogether…

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T06:23:49+00:00

There’s a bit of confusion about terminology in your question. A diacritic is a mark that can be added to a letter or other character but generally does not stand on its own. (Unicode also uses the more general term combining character.) What normalize('NFD', ...) does is to convert precomposed characters into their components.

Anyway, the answer is that œ is not a precomposed character. It’s a typographic ligature:

>>> unicodedata.name(u'\u0153')
'LATIN SMALL LIGATURE OE'

The unicodedata module provides no method for splitting ligatures into their parts. But the data is there in the character names:

import re
import unicodedata

_ligature_re = re.compile(r'LATIN (?:(CAPITAL)|SMALL) LIGATURE ([A-Z]{2,})')

def split_ligatures(s):
    """
    Split the ligatures in `s` into their component letters. 
    """
    def untie(l):
        m = _ligature_re.match(unicodedata.name(l))
        if not m: return l
        elif m.group(1): return m.group(2)
        else: return m.group(2).lower()
    return ''.join(untie(l) for l in s)

>>> split_ligatures(u'B\u0153uf \u0132sselmeer \uFB00otogra\uFB00')
u'Boeuf IJsselmeer ffotograff'

(Of course you wouldn’t do it like this in practice: you’d preprocess the Unicode database to generate a lookup table as you suggest in your question. There aren’t all that many ligatures in Unicode.)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I understand that unicodedata.normalize converts diacritics to their non diacritic counterparts: import unicodedata ”.join(

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply