(this is NOT duplicate of How to detect the language of a string? )

Question

0

Asked: June 14, 20262026-06-14T01:52:01+00:00 2026-06-14T01:52:01+00:00

(this is NOT duplicate of How to detect the language of a string? )

0

(this is NOT duplicate of How to detect the language of a string?)

I need to be able to determinate the alphabet of given string (single word) by the language/alphabet specific characters.
For example, if the string contains:

‘Ü’ it should be recognized as German,
‘ش’ as Arabic,
‘Φ’ as Greek and etc

I’m looking for list of alphabet-specific characters listed by language/alphabet. As is single non-dictionary word using GoogleTranslate API or other dictionary based solutions won’t work

(Although the question isn’t programming language specific, the actual code is written in C#)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T01:52:03+00:00

You could start with the unicode name of each character. For example (in Python):

>>> import unicodedata
>>> unicodedata.name(u'Φ')
'GREEK CAPITAL LETTER PHI'
>>> unicodedata.name(u'ش')
'ARABIC LETTER SHEEN'
>>> unicodedata.name(u'Ü')
'LATIN CAPITAL LETTER U WITH DIAERESIS'

You might have to special-case the Latin characters, since Unicode doesn’t assign them to particular language-specific alphabets. Most of them appear in several languages that use Latin-based alphabets, but if you’re somehow confident that your data will contain Ü only if it is German, then you can identify that character as German for your purposes. There are only a few dozen Latin characters to worry about.

Similarly, loads of languages use the Unicode CYRILLIC letters, and so in most cases their presence doesn’t tell you the language. Some are described by Unicode as belonging to particular languages. CYRILLIC SMALL LETTER YI has the note “Ukranian” in http://www.unicode.org/charts/PDF/U0400.pdf. I don’t know whether or not those notes are exhaustive, i.e. whether or not Ukranian is the only language that uses that character. And I’m certain that there are plenty of Ukranian words that don’t have that character in them. Fundamentally you cannot distinguish Ukranian words from Russian words solely by the presence or absence of Ukranian-specific letters.

I expect the same is true of other alphabets in Unicode. If you’re really lucky you might find a Unicode database that includes any such notes on each character, so you can mine it for mention of particular languages.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

(this is NOT duplicate of How to detect the language of a string? )

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply