(this is NOT duplicate of How to detect the language of a string?)
I need to be able to determinate the alphabet of given string (single word) by the language/alphabet specific characters.
For example, if the string contains:
- ‘Ü’ it should be recognized as German,
- ‘ش’ as Arabic,
- ‘Φ’ as Greek and etc
I’m looking for list of alphabet-specific characters listed by language/alphabet. As is single non-dictionary word using GoogleTranslate API or other dictionary based solutions won’t work
(Although the question isn’t programming language specific, the actual code is written in C#)
You could start with the unicode name of each character. For example (in Python):
You might have to special-case the Latin characters, since Unicode doesn’t assign them to particular language-specific alphabets. Most of them appear in several languages that use Latin-based alphabets, but if you’re somehow confident that your data will contain Ü only if it is German, then you can identify that character as German for your purposes. There are only a few dozen Latin characters to worry about.
Similarly, loads of languages use the Unicode
CYRILLICletters, and so in most cases their presence doesn’t tell you the language. Some are described by Unicode as belonging to particular languages.CYRILLIC SMALL LETTER YIhas the note “Ukranian” in http://www.unicode.org/charts/PDF/U0400.pdf. I don’t know whether or not those notes are exhaustive, i.e. whether or not Ukranian is the only language that uses that character. And I’m certain that there are plenty of Ukranian words that don’t have that character in them. Fundamentally you cannot distinguish Ukranian words from Russian words solely by the presence or absence of Ukranian-specific letters.I expect the same is true of other alphabets in Unicode. If you’re really lucky you might find a Unicode database that includes any such notes on each character, so you can mine it for mention of particular languages.