I’m using CFStringTokenizer to break a load of text into words, but I’m having difficulty bridging whatever encoding CFString is using and UTF8. Consider this:
NSString *theString = @"Lorem ipsum dolor sit amet!";
const char *theCString = [theString cStringUsingEncoding:NSUTF8StringEncoding];
tokenizer = CFStringTokenizerCreate(kCFAllocatorDefault,
(__bridge CFStringRef)theString,
CFRangeMake(0, [theString length]),
kCFStringTokenizerUnitWordBoundary,
locale);
while ((tokenType = CFStringTokenizerAdvanceToNextToken(tokenizer)) != kCFStringTokenizerTokenNone) {
tokenRange = CFStringTokenizerGetCurrentTokenRange(tokenizer);
memcpy(resultPtr, theCString+tokenRange.location, tokenRange.length);
}
Unfortunately the range reported by the tokenizer is incorrect when trying to read from the C string if any non-ascii characters have been encountered. How can I go about getting the correct range from the tokenizer to be able to pull the correct chars from my C string?
To clarify, the memcpy stuff is a tad more complex than above, and is necessary for performance on my target device, the iPhone. So I can’t even do anything like create a CFString substring and convert that, I need the range in the C string. Is there any way to do that without reimplementing various word boundary libraries to get it working for the various different locales I need it to work with? (which is as many as possible, so I can’t just iterate through looking for ‘ ‘ unfortunately..)
Alec
NSStrings and CFStrings deal in UTF-16, not UTF-8, but that isn’t the real problem.
Your code has two problems:
#1 is the cause of the range mismatches, and #2 causes potentially high memory usage, depending on the length and content of the string. (UTF-8 can take as many as four bytes per character in some alphabets—and then add one for the C string terminator.)
You can solve both of these problems in a single change.
Create an NSMutableData to hold the output. For each token, set the data’s length to the range’s
length; then, tell the string to get bytes within the desired range in the desired encoding and store them in the data’smutableBytesbuffer. NSString has a method with a very long selector (briefly,getBytes:::::::) that you will want to use for this.Since you use the range that is relative to the string exclusively with the string, there is no index/range mismatch, and each token will be output correctly.
If you really need a C string, you can set the data’s length to the range’s
length+ 1, then set the last byte to'\0'with a separate assignment after getting the token bytes. (Without the separate assignment, the byte may hold a previous value.)