I am trying to convert NSStrings to byte arrays and then back to NSStrings. I have tried with NSUnicodeEncoding and NSUTF8StringEncoding. My question is that as I iterate over the byte arrays, I’m seeing different data
Only change in this code is that I change NSUTF8StringEncoding to NSUnicodeEncoding and that I add dataLength += 2 so that it accounts for the BOM.
NSString *message = @"testing";
NSUInteger dataLength = [message lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
void *byteData = malloc( dataLength );
NSRange range = NSMakeRange(0, [message length]);
BOOL result = [message getBytes:byteData maxLength:dataLength usedLength:&actualLength encoding:NSUTF8StringEncoding options:0 range:range remainingRange:&remain];
for( NSUInteger x = 0; x < dataLength; x++ )
{
NSLog( @"byte data: %s", (char *)byteData);
int t = (int)*(char *)byteData;
byteData++;
}
The difference is in the NSLog :
As NSUTF8StringEncoding I see
- testing`
- esting`
- sting`
- ting`
- …
As NSUnicodeEncoding I see
- null
- t
- null
- e
- …
The int t value is correct for the given character, but I don’t understand why the byteData is so different. I would expect them both to act like the NSUnicodeEncoding.
In UTF8, the letter F is represented by a single F byte. The string “FU” is represented by an ASCII F byte followed by an ASCII U byte. In Unicode (as used here), each character occupies two bytes. Standard ASCII characters are preceded by a zero byte.
It’s not clear why the behavior you see isn’t exactly what you’d expect. In UTF-8, standard ASCII characters occupy one byte. In your Unicode encoding, the occupy two. So it certainly won’t be at all the same.