Is it possible to determine whether data is in English or Chinese?
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
This is for example possible using statistical methods. English language has a very distinctive distribution of characters that appear at all, and a very distinctive distribution of what characters appear following another character (that would be called a level-1 model).
If ‘e’ is the most common symbol, it is very unlikely that the language is not something of European origin.
It may also be possible rather trivially (but maybe not 100% reliably) to do such a distinction by looking at Unicode character values (converting between character sets if necessary). If there are characters with a Unicode value greater than 127, English is somewhat unlikely (note that there are symbols like € though).
If there are many characters with Unicode values in the thousands, east Asian languages become more and more likely, with codes > 65535 being guaranteed to be Chinese.