I’m looking for a fast and possibly convenient way in Python 3 to translate strings with non-ascii letters to words with only ascii letters.
Examples!
żółw => zolw
móżdżek => mozdzek
łódź => lodz
and so on…
There are many letters in national alphabets that can be turned into ASCII letters (like ń to n). I can do it manually for my language (Polish), by specifying how to translate each letter. But is there any automated way to do that? Or some library which would do what I need?
Pythons str.encode() won’t do, because "żółw".encode('ascii', 'replace') == "???w" and "żółw".encode('ascii', 'ignore') == "w"…
I can do such translation for polish letters but I don’t want to do it for every other language:
>>> utf8_letters = ['ą','ę','ć','ź','ż','ó','ł','ń','ś']
>>> ascii_letters = ['a','e','c','z','z','o','l','n','s']
>>> trans_dict = dict(zip(utf8_letters,ascii_letters))
>>> turtle = "żółw"
>>> out = []
>>> for l in turtle:
... out.append(trans_dict[l] if l in trans_dict else l)
>>> result = ''.join(out)
>>> result
'zolw'
The above code does what I want with polish letters, but it’s ugly :< What is the best way to do this?
Of course such translations will change the meanings of some words, but thats ok.
The unicodedata module can be used for this.
It has functions to manipulate Unicode character names:
nameandlookup.Now let’s look at them closer.
See a pattern? Let’s make a function that utilizes it:
It looks for the word WITH in the character name, removes everything that goes after it and feeds it back to the
lookupfunction.If there is no ‘WITH‘,
ValueErroris raised and when there is no character with such name,KeyErroris raised, so the function returns the character unchanged.And here is a function that "translates" a string based on the previous function:
So this solution is obviously very good, but I’ll leave the previous ones below.
The
unicodedatamodule also has a function that promises similar results –normalizewith'NFKD'parameter (compatibility decomposition), but it misses most characters.If you have the character data, the code you provided can be improved.
Here is a nice table with character data. This is JavaScript but can be used easily for Python.
And if you don’t mind using external libraries, you might want to try Unidecode. It was made just for this.