I’m parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn’t automatically decode for me:
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<p>£682m</p>")
>>> text = soup.find("p").string
>>> print text
£682m
How can I decode the HTML entities in text to get "£682m" instead of "£682m".
Python 3.4+
Use
html.unescape():FYI
html.parser.HTMLParser.unescapeis deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.Python 2.6-3.3
You can use
HTMLParser.unescape()from the standard library:HTMLParserhtml.parserYou can also use the
sixcompatibility library to simplify the import: