I have a xml file, which I need to convert to utf8.
Unfortunately the entities contain text like this:
/mytext,
I’m using the codec library to convert files to utf8, but html entities won’t work with it.
Is there an easy way to get rid of the html encoding?
Thanks
You can pass the text of the file through an unescape function before passing it to the XML parser.
Alternatively, if you’re only parsing HTML, lxml’s http parser does this for you: