I maintain a database of articles with HTML formatting. Unfortunately the editors who wrote articles didn’t know proper HTML, so they often have written stuff like:
<div class="highlight"><html><head></head><body><p>Note that ...</p></html></div>
I tried using HTML::TreeBuilder to parse this HTML but after parsing it and dumping the resulting tree, all the elements between <div class="highlight">...</div> are gone. I’m left with just <div class="highlight"></div>.
The editors often have also done things like:
<div class="article"><style>@font-face { font-family: "Cambria"; }</style>Article starts here</div>
Parsing this with HTML::TreeBuilder results in empty <div class="article"></div> again.
Any ideas how to approach this broken HTML and actually make sense out of it?
I would first run it through HTML::Tidy:
Output:
You can control the output by setting various configuration options.
Then, feed the cleaned HTML through a parser.
Otherwise, you can try building a tree one-step-at-a-time using HTML::TokeParser::Simple or even just HTML::Parser, but I believe that way lies insanity.
Keep in mind that a parser that tries to build a tree representation will be stricter than a stream parser that just recognizes various elements as it sees them.