The problem I am looking to solve is converting between unicode storage types. As I understand it, one character in UTF-8 can be represented by 1 to 4 bytes of data whereas a character in UTF-16 can be represented in 1-2, two bytes blocks of data. This variable length means it’s a pain to convert between the two and produce something that is sensible in the english language.
What I am looking for is a library that would let me specify a language or locale, and a storage mechanism (utf-8 etc.) and have it produce a more sensible result. Am I dreaming in the clouds?
Is
String.getBytes(String charsetname)not sufficient?http://download.oracle.com/javase/1.5.0/docs/api/java/lang/String.html#getBytes(java.lang.String)
It lets you get the raw bytes of a String in a particular encoding.
String has a [constructor][2] that will take a byte array and charset name as well, so you can use that for decoding.
[2]: http://download.oracle.com/javase/1.5.0/docs/api/java/lang/String.html#String(byte%5B%5D, java.lang.String)