I’m aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or malformed tag attributes. Is there a way to clean up the errors in Python natively or any third party modules I could install?
Share
I would suggest Beautifulsoup. It has a wonderful parser that can deal with malformed tags quite gracefully. Once you’ve read in the entire tree you can just output the result.
I’ve used this many times and it works wonders. If you’re simply pulling out the data from bad-html then BeautifulSoup really shines when it comes to pulling out data.