In a large data set I have some data that looks like this: guide

Question

0

Asked: June 17, 20262026-06-17T00:26:05+00:00 2026-06-17T00:26:05+00:00

In a large data set I have some data that looks like this: guide

0

In a large data set I have some data that looks like this:

"guide (but, yeah, itâ€™s okay to share it with â€˜em)."

I’ve opened the file in a hex editor and run the raw byte data through a character encoding detection algorithm (http://code.google.com/p/juniversalchardet/) and it’s positively detected as UTF-8.

It appears to me that the source of the data mis-interpreted the original character set and wrote valid UTF-8 as the output that I have received.

I’d like to validate the data to the best I can. Are there any heuristics/algorithms out there that might help me take a stab at validation?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T00:26:06+00:00

You cannot do that once you have the string, you have to do it while you still have the raw input. Once you have the string, there is no way to automatically tell whether â€™ was actually intended input without some seriously fragile tests. For example:

public static boolean isUTF8MisInterpreted( String input ) {
          //convenience overload for the most common UTF-8 misinterpretation
          //which is also the case in your question
      return isUTF8MisInterpreted( input, "Windows-1252");  
}

public static boolean isUTF8MisInterpreted( String input, String encoding) {

    CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
    CharsetEncoder encoder = Charset.forName(encoding).newEncoder();
    ByteBuffer tmp;
    try {
        tmp = encoder.encode(CharBuffer.wrap(input));
    }

    catch(CharacterCodingException e) {
        return false;
    }

    try {
        decoder.decode(tmp);
        return true;
    }
    catch(CharacterCodingException e){
        return false;
    }       
}

public static void main(String args[]) {
    String test = "guide (but, yeah, itâ€™s okay to share it with â€˜em).";
    String test2 = "guide (but, yeah, it’s okay to share it with ‘em).";
    System.out.println( isUTF8MisInterpreted(test)); //true
    System.out.println( isUTF8MisInterpreted(test2)); //false

}

If you still have access to raw input, you can see if a byte array amounts to fully valid utf-8 byte sequences with this:

public static boolean isValidUTF8( byte[] input ) {

    CharsetDecoder cs = Charset.forName("UTF-8").newDecoder();

    try {
        cs.decode(ByteBuffer.wrap(input));
        return true;
    }
    catch(CharacterCodingException e){
        return false;
    }       
}

You can also use the CharsetDecoder with streams, by default it throws exception as soon as it sees invalid bytes in the given encoding.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In a large data set I have some data that looks like this: guide

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply