I have a python script that parsing an xml file and is returning the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 614617: character maps to <undefined>
I’m pretty sure the error is occurring because there are some illegal characters within the xml document I am trying to parse, however I don’t have access to directly fix this particular xml file that I am reading from.
Am I able to have it so that these characters don’t trip up my script and allows it to keep parsing without error?
This is the part of the script tat is reading the xml and decoding it:
def ReadXML(self, path):
self.logger.info("Reading XML from %s" % path)
codec = "Windows-1252"
xmlReader = open(path, "r")
return xmlReader.read().decode(codec)
When you call
decode(), you can pass the optionalerrorsargument. By default it is set tostrict(which raises an error if it finds something it can’t parse), but you can also set it toreplace(which replaces the problematic character with\ufffd) orignore(which just leaves the problematic character out).So it would be:
or whatever level you choose.
More info can be found in the Python Unicode HOWTO.