I have a list of html pages which may contain certain encoded characters. Some examples are as below –
<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada@graphics.maestro.com</em>
<em>mel@graphics.maestro.com</em>
I would like to decode (escape, I’m unsure of the current terminology) these strings to –
<a href="mailto:lad at maestro dot com">
<em>ada@graphics.maestro.com</em>
<em>mel@graphics.maestro.com</em>
Note, the HTML pages are in a string format. Also, I DO NOT want to use any external library like a BeautifulSoup or lxml, only native python libraries are ok.
Edit –
The below solution isn’t perfect. HTML Parser unescaping with urllib2 throws a
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 31: ordinal not in range(128)
error in some cases.
You need to unescape HTML entities, and URL-unquote.
The standard library has
HTMLParserandurllib2to help with those tasks.Result:
Edit:
If your pages can contain non-ASCII characters, you’ll need to take care to decode on input and encode on output.
The sample file you uploaded has charset set to
cp-1252, so let’s try decoding from that to Unicode:Edit2:
If you don’t care about the non-ASCII characters you can simplify a bit: