Generally I use lxml for my HTML parsing needs, but that isn’t available on Google App Engine. The obvious alternative is BeautifulSoup, but I find it chokes too easily on malformed HTML. Currently I am testing libxml2dom and have been getting better results.
Which pure Python HTML parser have you found performs best? My priority is the ability to handle bad HTML over speed.
No longer a problem – lxml is supported:
https://developers.google.com/appengine/docs/python/tools/libraries27