Sometimes manipulating character strings at the character level is unavoidable.
Here I have a function written for ANSI/ASCII based character strings that replaces CR/LF sequences with LF only, and also replaces CR with LF. We use this because incoming text files often have goofy line endings due to various text or email programs that have made a mess of them, and I need them to be in a consistent format to make parsing / processing / output work properly down the road.
Here’s a fairly efficient implementation of this compression from various line-endings to LF only, for single byte per character implementations:
// returns the in-place conversion of a Mac or PC style string to a Unix style string (i.e. no CR/LF or CR only, but rather LF only)
char * AnsiToUnix(char * pszAnsi, size_t cchBuffer)
{
size_t i, j;
for (i = 0, j = 0; pszAnsi[i]; ++i, ++j)
{
// bounds checking
ASSERT(i < cchBuffer);
ASSERT(j <= i);
switch (pszAnsi[i])
{
case '\n':
if (pszAnsi[i + 1] == '\r')
++i;
break;
case '\r':
if (pszAnsi[i + 1] == '\n')
++i;
pszAnsi[j] = '\n';
break;
default:
if (j != i)
pszAnsi[j] = pszAnsi[i];
}
}
// append null terminator if we changed the length of the string buffer
if (j != i)
pszAnsi[j] = '\0';
// bounds checking
ASSERT(pszAnsi[j] == 0);
return pszAnsi;
}
I’m trying to transform this into something that will work correctly with multibyte/unicode strings, where the size of the next character can be multible bytes wide.
So:
- I need to look at a character only at a valid character-point (not in the middle of a character)
- I need to copy over the portion of the character that is part of the rejected piece properly (i.e. copy whole characters, not just bytes)
I understand that _mbsinc() will give me the address of the next start of a real character. But what is the equivalent for Unicode (UTF16), and are there already primitives to be able to copy a full character (e.g. length_character(wsz))?
One of the beautiful things about UTF-8 is that if you only care about the ASCII subset, your code doesn’t need to change at all. The non-ASCII characters get encoded to multi-byte sequences where all of the bytes have the upper bit set, keeping them out of the ASCII range themselves. Your CR/LF replacement should work without modification.
UTF-16 has the same property. Characters that can be encoded as a single 16-bit entity will never conflict with the characters that require multiple entities.