I have a C++ code snippet that uses MultiByteToWideChar to convert UTF-8 string to UTF-16
For C++, if input is “Hôtel”, the output is “Hôtel” which is correct
For C#, if input is “Hôtel”, the output is “Hôtel” which is not correct.
The C# code to convert from UTF8 to UTF16 looks like
Encoding.Unicode.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.Unicode,
Encoding.UTF8.GetBytes(utf8)));
In C++ the conversion code looks like
MultiByteToWideChar(
CP_UTF8, // convert from UTF-8
0, // default flags
utf8.data(), // source UTF-8 string
utf8.length(), // length (in chars) of source UTF-8 string
&utf16[0], // destination buffer
utf16.length() // size of destination buffer, in wchar_t's
)
I want to have the same results in C# that I am getting in C++. Is there anything wrong with the C# code ?
It appears you want to treat string characters as Windows-1252 (Often mislabeled as ANSI) code points, and have those code points decoded as UTF-8 bytes, where
Windows-1252 code point == UTF-8 byte value.The reason the accepted answer doesn’t work is that it treats the string characters as unicode code points, rather than
Windows-1252. It can get away with most characters because Windows-1252 maps them exactly the same as unicode, but input with characters
like
–,€,™,‘,’,”,•etc.. will fail because Windows-1252 maps those differently than unicode in this sense.So what you want is simply this:
Then: