I’m trying to parse some html in Python. There were some methods that actually worked before… but nowadays there’s nothing I can actually use without workarounds.
- beautifulsoup has problems after SGMLParser went away
- html5lib cannot parse half of what’s “out there”
- lxml is trying to be “too correct” for typical html (attributes and tags cannot contain unknown namespaces, or an exception is thrown, which means almost no page with Facebook connect can be parsed)
What other options are there these days? (if they support xpath, that would be great)
Make sure that you use the
htmlmodule when you parse HTML withlxml:All the errors & exceptions will melt away, you’ll be left with an amazingly fast parser that often deals with HTML soup better than BeautifulSoup.