I have a collection of documents that I’m attempting to parse. Like HTML, they are fairly well structured and have a complex syntax/grammar. Also like HTML, many of the documents do not fully adhere to the desired syntax.
My question is, what general strategies do browsers and HTML/XML parsing libraries use when parsing documents that don’t strictly follow the right syntax? They seem to deal with misplaced or missing tags well. And I’m sure there are other situations, such as misspelled tags, incorrect attributes, etc. that must be dealt with and not simply ignored.
Malformed or bad HTML is referred to as “tag soup”. Browsers have to deal with this and do so in different ways based on the browser (IE, Firefox, Chrome, etc.), but here is a good article on tag soup and some general strategies:
http://en.wikipedia.org/wiki/Tag_soup