Im trying to parse a response from google weather api, but i get this not well-formed error, as far as i can tell the response is well formed.
Here’s the relevant code:
f = urllib.urlopen(WEATHERPATH + sys.argv[1])
parser = make_parser()
parser.setContentHandler(GoogleWeatherHandler())
parser.parse(f)
XML:
<?xml version="1.0"?>
<xml_api_reply version="1">
<weather module_id="0" tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0" >
<forecast_information>
<city data="Ciudad Ju�rez, Chihuahua"/><postal_code data="Juarez"/>
<latitude_e6 data=""/>
<longitude_e6 data=""/>
<forecast_date data="2012-08-14"/>
<current_date_time data="2012-08-15 02:51:00 +0000"/>
<unit_system data="US"/></forecast_information>
<current_conditions>
<condition data="Cloudy"/>
<temp_f data="91"/>
<temp_c data="33"/>
<humidity data="Humidity: 22%"/>
<icon data="/ig/images/weather/cloudy.gif"/>
<wind_condition data="Wind: SE at 6 mph"/>
</current_conditions>
// similar markup
</weather>
</xml_api_reply>
and the error:
Traceback (most recent call last):
File "weather.py", line 34, in <module>
main()
File "weather.py", line 30, in main
parser.parse(f)
File "c:\Python26\lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "c:\Python26\lib\xml\sax\xmlreader.py", line 123, in parse
self.feed(buffer)
File "c:\Python26\lib\xml\sax\expatreader.py", line 211, in feed
self._err_handler.fatalError(exc)
File "c:\Python26\lib\xml\sax\handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:179: not well-formed (invalid
token)
All imports are already in place, i trust the interpreter but i can’t find the erron on the xml, second: it would be healpful to know what <unknown>:1:179 means.
Thanks.
Looks like the accented
ácharacter inJuárezis the problem. You haven’t told the parser what the encoding is, so it’s obviously defaulted one, probably UTF-8, in which that character value is invalid — i.e. it’s expecting the UTF-8 encoding and your actual encoding is probably ISO-8859-1.Configure the parser to expect ISO-8859-1 and your problem should go away.
If you can modify the XML, change the header to
Unicode is the standard that defines the character sets and is an abstract assignment of a unique number to every possible character in all known languages.
UTF-8 is just one of several possible ways to encode those characters in 8-bit bytes. Since UTF-8 has to encode more than 256 characters, it uses 2-, 3- and 4-bytes sequences. To avoid ambiguity, those sequences must begin with characters that cannot otherwise be used, so a set of high-order bit patterns (and thus certain sets of byte values) is reserved to mark the beginning of these multi-byte sequences. The encoding used in ISO-8859-1 (a different way to encode characters) for
áhappens to conflict with the characters reserved in UTF-8 to mark multi-byte sequences.Part of the confusion over these issues stems from the fact that, for character codes 0x20 thru 0x7f, all the different encoding methods are the same (a single byte) for backwards compatibility. When you venture into characters that are not part of standard ASCII, things diverge depending on the encoding.
To get more specific:
What happened here is that historically (before Unicode)
áwas already assigned the value 0xE1 in various computer standards (Windows-1252 for example). When Unicode was devised, they kept this code, but when it came time to encode this value in UTF-8, the rules specify that this becomes a 2-byte sequence 0xc3 0xa1. The single character value 0xE1 is not permitted to occur by itself in UTF-8 (I believe it marks the start of a 4-byte sequence, but I could be mistaken).