Looks like Resharper is it. However I wouldn't worry too…

Question

0

Asked: May 14, 20262026-05-14T07:05:41+00:00 2026-05-14T07:05:41+00:00

There are so many html and xml libraries built into python , that it’s

0

There are so many html and xml libraries built into python, that it’s hard to believe there’s no support for real-world HTML parsing.

I’ve found plenty of great third-party libraries for this task, but this question is about the python standard library.

Requirements:

Use only Python standard library components (any 2.x version)
DOM support
Handle HTML entities ( )
Handle partial documents (like: Hello, <i>World</i>!)

Bonus points:

XPATH support
Handle unclosed/malformed tags. (<big>does anyone here know <html ???

Here’s my 90% solution, as requested. This works for the limited set of HTML I’ve tried, but as everyone can plainly see, this isn’t exactly robust. Since I did this by staring at the docs for 15 minutes and one line of code, I thought I would be able to consult the stackoverflow community for a similar but better solution…

from xml.etree.ElementTree import fromstring
DOM = fromstring("<html>%s</html>" % html.replace('&nbsp;', '&#160;'))

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-14T07:05:41+00:00

Parsing HTML reliably is a relatively modern development (weird though that may seem). As a result there is definitely nothing in the standard library. HTMLParser may appear to be a way to handle HTML, but it’s not — it fails on lots of very common HTML, and though you can work around those failures there will always be another case you haven’t thought of (if you actually succeed at handling every failure you’ll have basically recreated BeautifulSoup).

There are really only 3 reasonable ways to parse HTML (as it is found on the web): lxml.html, BeautifulSoup, and html5lib. lxml is the fastest by far, but can be a bit tricky to install (and impossible in an environment like App Engine). html5lib is based on how HTML 5 specifies parsing; though similar in practice to the other two, it is perhaps more “correct” in how it parses broken HTML (they all parse pretty-good HTML the same). They all do a respectable job at parsing broken HTML. BeautifulSoup can be convenient though I find its API unnecessarily quirky.

How to approach applying for a job at a company ...

What is a programmer’s life like?

How to handle personal stress caused by utterly incompetent and ...

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions