I am pulling data from a website via NSURLConnection and stashing the received data away in an instance of NSMutableData. In the connectionDidFinishLoading delegate method the data is convert into a string with a call to NSString’s appropriate method:
NSString *result = [[NSString alloc] initWithData:data
encoding:NSUTF8StringEncoding]
The resulting string turns out to be a null. If I use the NSASCIIStringEncoding, however, I do obtain the appropriate string, albeit with unicode characters garbled up as expected. The server’s Content-Type header does not specify the UTF-8 encoding, but I have attempted a number of different websites with a similar scenario, and there string conversion happens just fine. It seems like the problem only pertains to the given web service but I have no clue why.
On a side note, is pulling web pages and data from an API good practice, i.e. buffering the data, converting into a string, and manipulating the string afterwards?
Much appreciated!
You say that it “is definitely UTF-8”, but without a Content-Type header, you don’t really know that. (And even if you did have a header saying that, it could still be wrong.)
My guess is that your data is usually ASCII, which always parses correctly as UTF-8, but you sometimes are trying to parse data that’s actually encoded in ISO 8859-1 or Windows codepage 1252. Such data will generally be mostly ASCII, but with some bytes outside the 0–127 range ASCII defines. UTF-8 would expect such bytes to form a sequence of code units within a specified sequence of ranges, but in other encodings, any byte, regardless of value, is a complete character on its own. Trying to interpret non-ASCII non-UTF-8 data as UTF-8 will almost always get you either wrong results (wrong characters) or no results at all (cannot decode; decoder returns
nil), because the data was never encoded in UTF-8 in the first place.You should try UTF-8 first, and if it fails, use ISO 8859-1. If you’re letting the user retrieve any web page, you should let them change the encoding you use to decode the data, in case they discover that it was actually 8859-9 or codepage-1252 or some other 8-bit encoding.
If you’re downloading the data from a specific server, and especially if you have influence on what runs on that server, you should make it serve up an accurate Content-Type header and/or fix whatever bug is causing it to serve up text that isn’t in UTF-8.