I’m parsing the xml files encoded with utf-16 using ElementTree.parse function.
The program would break down when the file contains some not well-formed characters such as ♀, ♂ .etc. And the error “xml.parsers.expat.ExpatError: not well-formed (invalid token)“occurs.
How could I avoid this error and resolve this problem? How could I just ignore these not well-formed characters? Thanks! below is my code:
tree = ElementTree()
root = tree.parse(xml_file)
xml_file is the file encoded in UTF-16 format.
The error would point out the line and column number of the not well-formed character.
Since
xml.parsers.expat.ParserCreatesupports only four encodings I would try them all. Those encodings are:UTF-8,UTF-16,ISO-8859-1(Latin1), andASCII.You can now run
ElementTree.parsewith the encoding like: