Due to some awkward legacy code, I need to pass some non-English text around as ansi/ascii strings that are visibly UTF-8 encoded. For the most part, this is working alright (I’m using URLEncoder). However, now I need it to be able to output different versions of UTF-8 in different circumstances, and I don’t know how to do that.
For example, this character can be UTF-8 encoded these ways:
大
%u5927
大
%E5%A4%A7
But nothing seems to talk about the different versions, as though there is no difference. I know URLEncoder does not do the second version, because the & is a reserved character, but the second one is what I need in some instances. How can I convert the text to the specific version I want?
Specifically, it’s being passed to a .jsp that contains a library called displaytag that handles the data and displays a table without much developer input, but it doesn’t seem to have any options for setting the encoding. I know the second encoding (passed as ansi/ascii) in the above list is displays correctly without changing the .jsp, though, which is the safest option for me. I just need to get it that way.
First is the unicode code point in hex and is URL encoded, second is same in decimal and is the HTML/XML entity form.
Never used it for your purpose but I think StringEscapeUtils escapeHtml or escapeXml should give you the second form.
BTW the second form also has a hex version:
大Third looks like a conversion by a non utf-8 aware function which has converted the three bytes that in utf-8 make up the single code point separately. The third is in my view incorrect because you cannot see if it are three ascii bytes or that it is in fact utf-8.