In a large data set I have some data that looks like this:
"guide (but, yeah, it’s okay to share it with ‘em)."
I’ve opened the file in a hex editor and run the raw byte data through a character encoding detection algorithm (http://code.google.com/p/juniversalchardet/) and it’s positively detected as UTF-8.
It appears to me that the source of the data mis-interpreted the original character set and wrote valid UTF-8 as the output that I have received.
I’d like to validate the data to the best I can. Are there any heuristics/algorithms out there that might help me take a stab at validation?
You cannot do that once you have the string, you have to do it while you still have the raw input. Once you have the string, there is no way to automatically tell whether
’was actually intended input without some seriously fragile tests. For example:If you still have access to raw input, you can see if a byte array amounts to fully valid utf-8 byte sequences with this:
You can also use the CharsetDecoder with streams, by default it throws exception as soon as it sees invalid bytes in the given encoding.