This is an extension of this question. I’m trying to parse HTML snippets embedded in an XML backup of a Blogger blog and retag them with InDesign tags.
Blogger doesn’t standardize the HTML for any of its posts, and the posts can be written in Word, Windows Live Writer, the native Blogger interface, or text editors, resulting in tons of different forms of HTML. Some posts don’t mark paragraphs and only use double <br>s in between paragraphs—others use actual <p> tags.
What’s the best way to parse this unstandard conglomeration of tags?
Additionally, each post is not a complete HTML file–just a snippet that gets inserted into a template—which means that there is no overall HTML structure to parse (<html><body></body></html>, etc.) Does that have any effect on XML/HTML parsing?
Here’s some potential examples, mostly standard HTML, missing paragraphs:
This is a section of a blog post. It has <a href="#">links</a> and lists and stuff. Weee....
<br>
<br>
Here's a list
<br/>
<br />
<ul><li>Item 1</li><li>Item 2</li><ul>
And another paragraph here...
<br>
<br/>
Etc.
The Word HTML looks like this – http://www.timeatlas.com/mos/images/stories/word_html_tags.png
The HTML generated by Word is relatively easier to deal with. I would just get rid of all the tag attributes (unless you care about styles). That would live you with fairly plain HTML which you can then style.
HTML::TokeParser::Simple can help make that relatively painless.
As for the other stuff, that will take some trial and error. I am going to think more about that and post later if I can think of something clever.
Later Update:
Well, here is something that makes me cringe a little but it seems to work:
Output: