Do you know if there are any linux programs out there to remove accents from lists of foreign words (in utf8)? Like Spanish, Czech, French. For instance:
administrátoři (czech) administratori
français (french) francais
niñez (spanish) ninez etc.
I know I could do it manually with sed, but it’s relatively time-consuming considering that I’m working on a lot of languages. I thought a program that could do just that might exist already.
What you want is called Unicode decomposition — the reverse process of Unicode composition (where you combine a base character with a diacritic). There are a number of related SO questions using:
which you can use as a starting point.
The Python repository has
unicodedata.decompositionwhich returns a decomposed mapping.Your system probably also has
iconvand with suitable Normalization it may get you there too!