I’m using CFStringTokenizer to break a load of text into words, but I’m having

Question

0

Asked: May 27, 20262026-05-27T20:12:31+00:00 2026-05-27T20:12:31+00:00

I’m using CFStringTokenizer to break a load of text into words, but I’m having

0

I’m using CFStringTokenizer to break a load of text into words, but I’m having difficulty bridging whatever encoding CFString is using and UTF8. Consider this:

NSString *theString = @"Lorem ipsum dolor sit amet!";

const char *theCString = [theString cStringUsingEncoding:NSUTF8StringEncoding];

tokenizer = CFStringTokenizerCreate(kCFAllocatorDefault, 
                                    (__bridge CFStringRef)theString, 
                                    CFRangeMake(0, [theString length]), 
                                    kCFStringTokenizerUnitWordBoundary, 
                                    locale);

while ((tokenType = CFStringTokenizerAdvanceToNextToken(tokenizer)) != kCFStringTokenizerTokenNone) {
    tokenRange = CFStringTokenizerGetCurrentTokenRange(tokenizer);
    memcpy(resultPtr, theCString+tokenRange.location, tokenRange.length);
}

Unfortunately the range reported by the tokenizer is incorrect when trying to read from the C string if any non-ascii characters have been encountered. How can I go about getting the correct range from the tokenizer to be able to pull the correct chars from my C string?

To clarify, the memcpy stuff is a tad more complex than above, and is necessary for performance on my target device, the iPhone. So I can’t even do anything like create a CFString substring and convert that, I need the range in the C string. Is there any way to do that without reimplementing various word boundary libraries to get it working for the various different locales I need it to work with? (which is as many as possible, so I can’t just iterate through looking for ‘ ‘ unfortunately..)

Alec

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T20:12:32+00:00

NSStrings and CFStrings deal in UTF-16, not UTF-8, but that isn’t the real problem.

Your code has two problems:

You’re assuming that the C string’s indexes correspond to the source string’s indexes.
You’re copying and converting the entire string to a UTF-8 C string at once.

#1 is the cause of the range mismatches, and #2 causes potentially high memory usage, depending on the length and content of the string. (UTF-8 can take as many as four bytes per character in some alphabets—and then add one for the C string terminator.)

You can solve both of these problems in a single change.

Create an NSMutableData to hold the output. For each token, set the data’s length to the range’s length; then, tell the string to get bytes within the desired range in the desired encoding and store them in the data’s mutableBytes buffer. NSString has a method with a very long selector (briefly, getBytes:::::::) that you will want to use for this.

Since you use the range that is relative to the string exclusively with the string, there is no index/range mismatch, and each token will be output correctly.

If you really need a C string, you can set the data’s length to the range’s length + 1, then set the last byte to '\0' with a separate assignment after getting the token bytes. (Without the separate assignment, the byte may hold a previous value.)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m using CFStringTokenizer to break a load of text into words, but I’m having

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply