I am looking for algorithms & data structures one would use to fix broken HTML. I know lots of inbuilt tools exist in every language to do this. But I want to learn this. Some approaches I can think of is –
- Using Regular Expressions – seems like a naive approach
- Create DOM – but how would DOM tree get created with broken html?
UPDATE: This is more of a general discussion I am expecting. But if you refer to any tools in C, C++, Python or Java is fine by me.
thanks
Parse the markup using the HTML 5 parsing algorithm (which is designed to handle brokenness), and build a DOM from it. You can then serialize back to HTML.