I’m stuck on stoopid today as I can’t convert a simple piece of ObjC code to its Cpp equivalent. I have this:
const UInt8 *myBuffer = [(NSString*)aRequest UTF8String];
And I’m trying to replace it with this:
const UInt8 *myBuffer = (const UInt8 *)CFStringGetCStringPtr(aRequest, kCFStringEncodingUTF8);
This is all in a tight unit test that writes an example HTTP request over a socket with CFNetwork APIs. I have working ObjC code that I’m trying to port to C++. I’m gradually replacing NS API calls with their toll free bridged equivalents. Everything has been one for one so far until this last line. This is like the last piece that needs completed.
This is one of those things where Cocoa does all the messy stuff behind the scenes, and you never really appreciate just how complicated things can be until you have to roll up your sleeves and do it yourself.
The simple answer for why it’s not ‘simple’ is because
NSString(andCFString) deal with all the complicated details of dealing with multiple character sets, Unicode, etc, etc, while presenting a simple, uniform API for manipulating strings. It’s object oriented at its best- the details of ‘how’(NS|CF)Stringdeals with strings that have different string encodings (UTF8, MacRoman, UTF16, ISO 2022 Japanese, etc) is a private implementation detail. It all ‘just works’.It helps to understand how
[@"..." UTF8String]works. This is a private implementation detail, so this isn’t gospel, but based on observed behavior. When you send a string aUTF8Stringmessage, the string does something approximating (not actually tested, so consider it pseudo-code, and there’s actually simpler ways to do the exact same thing, so this is overly verbose):You don’t have to worry about the memory management issues of dealing with the buffer that
-UTF8Stringreturns because theNSMutableDatais autoreleased.A string object is free to keep the contents of the string in whatever form it wants, so there’s no guarantee that its internal representation is the one that would be most convenient for your needs (in this case, UTF8). If you’re using just plain C, you’re going to have to deal with managing some memory to hold any string conversions that might be required. What was once a simple
-UTF8Stringmethod call is now much, much more complicated.Most of
NSStringis actually implemented in/with CoreFoundation /CFString, so there’s obviously a path from aCFStringRef->-UTF8String. It’s just not as neat and simple asNSString‘s-UTF8String. Most of the complication is with memory management. Here’s how I’ve tackled it in the past:NOTE: I haven’t tested this code, but it is modified from working code. So, aside from obvious errors, I believe it should work.
The above tries to get the pointer to the buffer that
CFStringuses to store the contents of the string. IfCFStringhappens to have the string contents encoded in UTF8 (or a suitably compatible encoding, such as ASCII), then it’s likelyCFStringGetCStringPtr()will return non-NULL. This is obviously the best, and fastest, case. If it can’t get that pointer for some reason, say ifCFStringhas its contents encoded in UTF16, then it allocates a buffer withmalloc()that is large enough to contain the entire string when its is transcoded to UTF8. Then, at the end of the function, it checks to see if memory was allocated andfree()‘s it if necessary.And now for a few tips and tricks…
CFString‘tends to’ (and this is a private implementation detail, so it can and does change between releases) keep ‘simple’ strings encoded as MacRoman, which is an 8-bit wide encoding. MacRoman, like UTF8, is a superset of ASCII, such that all characters < 128 are equivalent to their ASCII counterparts (or, in other words, any character < 128 is ASCII). In MacRoman, characters >= 128 are ‘special’ characters. They all have Unicode equivalents, and tend to be things like extra currency symbols and ‘extended western’ characters. See Wikipedia – MacRoman for more info. But just because aCFStringsays it’s MacRoman (CFStringencoding value ofkCFStringEncodingMacRoman,NSStringencoding value ofNSMacOSRomanStringEncoding) doesn’t mean that it has characters >= 128 in it. If akCFStringEncodingMacRomanencoded string returned byCFStringGetCStringPtr()is composed entirely of characters < 128, then it is exactly equivalent to its ASCII (kCFStringEncodingASCII) encoded representation, which is also exactly equivalent to the strings UTF8 (kCFStringEncodingUTF8) encoded representation.Depending on your requirements, you may be able to ‘get by’ using
kCFStringEncodingMacRomaninstead ofkCFStringEncodingUTF8when callingCFStringGetCStringPtr(). Things ‘may’ (probably) be faster if you require strict UTF8 encoding for your strings but usekCFStringEncodingMacRoman, then check to make sure the string returned byCFStringGetCStringPtr(string, kCFStringEncodingMacRoman)only contains characters that are < 128. If there are characters >= 128 in the string, then go the slow route bymalloc()ing a buffer to hold the converted results. Example:Like I said, you don’t really appreciate just how much work Cocoa does for you automatically until you have to do it all yourself. 🙂