I’m searching for a method to remove diacritics and other letter marks in a

Question

0

Asked: June 15, 20262026-06-15T06:27:50+00:00 2026-06-15T06:27:50+00:00

I’m searching for a method to remove diacritics and other letter marks in a

0

I’m searching for a method to remove diacritics and other letter marks in a text and simplify it in a way that it is a good fit for a text search index.

For removing the diacritics, I already found these:

questions for PHP: 1, 2
question for Java: 1, related: 2
question for Bash: 1
questions for .Net: 1, 2
question for Javascript: 1
question for Python: 1

I was wondering about a generic solution, language independent. (Also, this reference list might be useful for some.)

Removing the diacritics works for äöüò, etc. But I also want:

ø → o
Я → R
Ł → L
ɲ → n
æ → a (it could also be “ae” but in my case, “a” makes more sense because I also want to replace “ae” by “a”)

For example, I want to index the name Røyksopp which sometimes also occurs as Röyksopp just under the simplified name Royksopp. Or KoЯn should be KoRn.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T06:27:51+00:00

Some ICU magic:

echo "ë ö ø Я Ł ɲ æ å ñ 開 당" | uconv -x any-name | perl -wpne 's/ WITH [^}]+//g;' | uconv -x name-any | uconv -x any-latin -t iso-8859-1 -c | uconv -f iso-8859-1 -t ascii -x latin-ascii -c

yields

e o o A L n ae a n ki dang

This uses the cmdline tool uconv, but the same can be done with ICU’s Java or C or C++ API, and ICU has bindings for almost any language.

Note Я -> A because that is the correct behavior. What you want is not how Unicode defines that character – blame KoЯn for abusing it.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m searching for a method to remove diacritics and other letter marks in a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply