I’m trying to parse some html in Python. There were some methods that actually

Question

0

Asked: May 17, 20262026-05-17T23:57:39+00:00 2026-05-17T23:57:39+00:00

I’m trying to parse some html in Python. There were some methods that actually

0

I’m trying to parse some html in Python. There were some methods that actually worked before… but nowadays there’s nothing I can actually use without workarounds.

beautifulsoup has problems after SGMLParser went away
html5lib cannot parse half of what’s “out there”
lxml is trying to be “too correct” for typical html (attributes and tags cannot contain unknown namespaces, or an exception is thrown, which means almost no page with Facebook connect can be parsed)

What other options are there these days? (if they support xpath, that would be great)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T23:57:40+00:00

Make sure that you use the html module when you parse HTML with lxml:

>>> from lxml import html
>>> doc = """<html>
... <head>
...   <title> Meh
... </head>
... <body>
... Look at this interesting use of <p>
... rather than using <br /> tags as line breaks <p>
... </body>"""
>>> html.document_fromstring(doc)
<Element html at ...>

All the errors & exceptions will melt away, you’ll be left with an amazingly fast parser that often deals with HTML soup better than BeautifulSoup.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to parse some html in Python. There were some methods that actually

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply