I’m using minidom to parse an xml file and it threw an error indicating that the data is not well formed. I figured out that some of the pages have characters like ไà¸à¹€à¸Ÿà¸¥ &, causing the parser to hiccup. Is there an easy way to clean the file before I start parsing it? Right now I’m using a regular expressing to throw away anything that isn’t an alpha numeric character and the </> characters, but it isn’t quite working.
I’m using minidom to parse an xml file and it threw an error indicating
Share
Try
It will get rid of everything except 0x20-0x7F range.
You may start from \x01, if you want want to keep control characters like tab, line breaks.