This is somehow related to my question here.
I process tons of texts (in HTML and XML mainly) fetched via HTTP. I’m looking for a library in python that can do smart encoding detection based on different strategies and convert texts to unicode using best possible character encoding guess.
I found that chardet does auto-detection extremely well. However auto-detecting everything is the problem because it is SLOW and very much against all standards. As per chardet FAQ I don’t want to screw the standards.
From the same FAQ here is the list of places where I want to look for encoding:
- charset parameter in HTTP
Content-typeheader. <meta http-equiv="content-type">element in
the<head>of a web page for HTML
documents.- encoding attribute in the XML prolog for XML
documents. - Auto-detect the character encoding as a last resort.
Basically I want to be able to look in all those place and also deal with conflicting information automatically.
Is there such library out there or do I need to write it myself?
BeautifulSoup (the html parser) incorporates a class called UnicodeDammit that does just that. Have a look and see if you like it.