What’s the best way to identify if a string (is or) might be UTF-8 encoded? The Win32 API IsTextUnicode isn’t of much help here. Also, the string will not have an UTF-8 BOM, so that cannot be checked for. And, yes, I know that only characters above the ASCII range are encoded with more than 1 byte.
What’s the best way to identify if a string (is or) might be UTF-8
Share
chardet character set detection developed by Mozilla used in FireFox. Source code
jchardet is a java port of the source from mozilla’s automatic charset detection algorithm.
NCharDet is a .Net (C#) port of a Java port of the C++ used in the Mozilla and FireFox browsers.
Code project C# sample that uses Microsoft’s MLang for character encoding detection.
UTRAC is a command line tool and library written in c++ to detect string encoding
cpdetector is a java project used for encoding detection
chsdet is a delphi project, and is a stand alone executable module for automatic charset / encoding detection of a given text or file.
Another useful post that points to a lot of libraries to help you determine character encoding http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html
You could also take a look at the related question How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?, it has some useful content.