I’m using JNI to interface between a Java program and a C++ function. The C++ function deals with multi-byte strings (CP 1252). I use this C++ code to convert the Java String to a char*:
char *arg=(char*) jEnv->GetStringUTFChars(jArg2,0);
This works fine unless I have some high-order characters. For example, if my input is:
Àlan (UTF: c2 6c 61 6e 20 4a 6f 6e 65 7e)
I can see that the resultant arg is:
c3 82 6c 61 6e
But, I would expect to see:
c0 6c 61 6e
Seeing that GetStringUTFChars() is supposed to return UTF strings, I tried obtaining the Unicode string with GetStringChars() and converting it via WideCharToMultiByte():
const jchar *str=jEnv->GetStringChars(jArg2,0);
WideCharToMultiByte(CP_UTF8,0,(LPCWSTR) str,jEnv->GetStringLength(jArg2),str,szStr,0,0);
(you can assume that I’ve allocated str and set szStr properly). In this situation, I see this in the resultant str:
c3 82 6c 61 6e
I’ve tried other CP_ values for the first parameter to WideCharToMultiByte, none yield useful results (they either return the above or substitute a ‘?’ for the ‘À’.
I would expect that somehow I could get this resultant str:
c0 6c 61 6e
But so far, I’ve had no luck.
Java uses a modified version of UTF-8. Here is a quote from Java’s documentation:
The byte sequence
c2 6c 61 6e 20 4a 6f 6e 65 7eis not valid under standard UTF-8. In cp1252, that same byte sequence would be the stringÂlan Jone~(noticeÂinstead ofÀ).Under standard UTF-8, the string
Àlan Jone~would be the byte sequencec3 80 6c 61 6e 20 4a 6f 6e 65 7e(noticec3 80 6cinstead ofc2 6c).All Java strings are natively UTF-16, so you don’t need to retreive the string as UTF-8. Use
GetStringChars()to get a original UTF-16 encoded characters and pass them as-is toWideCharToMultiByte()specifying1252as the codepage (note, in your example you are usingstrfor both the UTF-16 input buffer and the cp1252 output buffer – don’t get your variables confused!), eg: