I need to implement a very crude language identification algorithm. In my world, there are only two languages: English and not-English. I have ArrayList and I need to determine if each String is likely in English or the other language which has its Unicode chars in a certain range. So what I want to do is to check each String against this range using some type of “presence” test. If it passes the test, I say the String is not English, otherwise it’s English. I want to try two type of tests:
- TEST-ANY: If any char in the string falls within the range, the string passes the test
- TEST-ALL: If all chars in the string fall within the range, the string passes the test
Since the array might be very long, I need to implement this very efficiently. What would be the fastest way of doing this in Java?
Thx
UPDATE: I am specifically checking for non-English by looking at a specific range of Unicodes rather then checking for whether the characters are ASCII, in part to take care of the “resume” problem mentioned below. What I am trying to figure out is whether Java provides any classes/methods that essentially implement TEST-ANY or TEST-ALL (or another similar test) as efficiently as possible. In other words, I am trying to avoid reinventing the wheel especially if the wheel invented before me is better anyway.
Here’s how I ended up implementing TEST-ANY: