I need to implement a very crude language identification algorithm. In my world, there

Question

0

Asked: June 5, 20262026-06-05T02:05:51+00:00 2026-06-05T02:05:51+00:00

I need to implement a very crude language identification algorithm. In my world, there

0

I need to implement a very crude language identification algorithm. In my world, there are only two languages: English and not-English. I have ArrayList and I need to determine if each String is likely in English or the other language which has its Unicode chars in a certain range. So what I want to do is to check each String against this range using some type of “presence” test. If it passes the test, I say the String is not English, otherwise it’s English. I want to try two type of tests:

TEST-ANY: If any char in the string falls within the range, the string passes the test
TEST-ALL: If all chars in the string fall within the range, the string passes the test

Since the array might be very long, I need to implement this very efficiently. What would be the fastest way of doing this in Java?

Thx

UPDATE: I am specifically checking for non-English by looking at a specific range of Unicodes rather then checking for whether the characters are ASCII, in part to take care of the “resume” problem mentioned below. What I am trying to figure out is whether Java provides any classes/methods that essentially implement TEST-ANY or TEST-ALL (or another similar test) as efficiently as possible. In other words, I am trying to avoid reinventing the wheel especially if the wheel invented before me is better anyway.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T02:05:53+00:00

Here’s how I ended up implementing TEST-ANY:

// TEST-ANY
String str = "wordToTest";
int UrangeLow = 1234; // can get range from e.g. http://www.utf8-chartable.de/unicode-utf8-table.pl
int UrangeHigh = 2345;
for(int iLetter = 0; iLetter < str.length() ; iLetter++) {
   int cp = str.codePointAt(iLetter);
   if (cp >= UrangeLow && cp <= UrangeHigh) {
      // word is NOT English
      return;
   } 
}
// word is English
return;

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to implement a very crude language identification algorithm. In my world, there

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply