While developing a program the other day, I had to convert an ASCII string into a Unicode string. I’m working on Windows with Visual Studio 2012, by the way. I noticed some strange behaviour with the Win32 function MultiByteToWideChar which I couldn’t sort out. I wrote some test code, below:
int main()
{
/* Create const test string */
char str[] = "test string";
/* Create empty wchar_t buffer to hold Unicode form of above string, and initialize (zero) it */
wchar_t *buffer = (wchar_t*) LocalAlloc(LMEM_ZEROINIT, sizeof(wchar_t) * strlen(str));
/* Convert str to Unicode and store in buffer */
int result = MultiByteToWideChar(CP_UTF8, NULL, str, strlen(str), buffer, strlen(str));
if (result == 0)
printf("GetLastError result: %d\n", GetLastError());
/* Print MultiByteToWideChar result, str's length, and buffer's length */
printf_s(
"MultiByteToWideChar result: %d\n"
"'str' length: %d\n"
"'buffer' length: %d\n",
result, strlen(str), wcslen(buffer));
/* Create a message box to display the Unicode string */
MessageBoxW(NULL, buffer, L"'buffer' contents", MB_OK);
/* Also write buffer to file, raw */
FILE *stream = NULL;
fopen_s(&stream, "c:\\test.dat", "wb");
fwrite(buffer, sizeof(wchar_t), wcslen(buffer), stream);
fclose(stream);
return 0;
}
As you can see, it simply takes an ordinary character string, creates a buffer to store the Unicode string in, puts that converted Unicode string into the buffer, and shows me some results, also writing the buffer to a file.
The output:
MultiByteToWideChar result: 11
'str' length: 11
'buffer' length: 16
Already weird. The function is processing the correct number of characters in the C string, but wcslen is reporting the output buffer to be longer than the C string! I’m pretty sure I allocated the buffer correctly, too.
I’ve tried using different sized string lengths, but there’s always junk at the end, and wcslen always reports the buffer’s length to be a multiple of 4.
Finally, for this particular string ("test string"), here’s the raw buffer that was printed to file:
74 00 65 00 73 00 74 00 20 00 73 00 74 00 72 00 t.e.s.t. .s.t.r.
69 00 6E 00 67 00 AB AB AB AB AB AB AB AB EE FE i.n.g...........
(That’s 32 bytes, or 16 Unicode characters.)
The 10 bytes at the end are five characters; four U+ABAB, and one U+FEEE, which are meaningless to me.
In different amounts, they occur every time I try converting a string.
I’m kinda out of ideas. Anyone?
Thanks in advance!
This is really where the problem started. The value of strlen(str) is meaningless, especially so when the input string is encoded in utf-8. You tend to get away with it by accident because it usually creates a buffer that’s too long, not counting the off-by-one bug.
But you would also have easily avoided that bug by doing it the Right Way. You must call the function twice. The first time, pass 0 for the last argument (cchWideChar). The function returns the required size of the buffer (chars, not bytes). Which is now good enough to allocate the buffer and pass the correct value the second time you call the function.