I maintain a database of articles with HTML formatting. Unfortunately the editors who wrote

Question

0

Asked: June 7, 20262026-06-07T04:26:04+00:00 2026-06-07T04:26:04+00:00

I maintain a database of articles with HTML formatting. Unfortunately the editors who wrote

0

I maintain a database of articles with HTML formatting. Unfortunately the editors who wrote articles didn’t know proper HTML, so they often have written stuff like:

<div class="highlight"><html><head></head><body><p>Note that ...</p></html></div>

I tried using HTML::TreeBuilder to parse this HTML but after parsing it and dumping the resulting tree, all the elements between <div class="highlight">...</div> are gone. I’m left with just <div class="highlight"></div>.

The editors often have also done things like:

<div class="article"><style>@font-face {   font-family: "Cambria"; }</style>Article starts here</div>

Parsing this with HTML::TreeBuilder results in empty <div class="article"></div> again.

Any ideas how to approach this broken HTML and actually make sense out of it?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-07T04:26:07+00:00

I would first run it through HTML::Tidy:

#!/usr/bin/env perl

use strict; use warnings;
use HTML::Tidy;

my $html = <<EO_HTML;
<div class="highlight"><html><head></head>
<body><p>Note that ...</p></html>
</div>
EO_HTML

my $tidy = HTML::Tidy->new;

print $tidy->clean( $html );

Output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<meta name="generator" content="tidyp for Windows (v1.04), see www.w3.org">
<title></title>
</head>
<body>
<div class="highlight">
<p>Note that ...</p>
</div>
</body>
</html>

You can control the output by setting various configuration options.

Then, feed the cleaned HTML through a parser.

Otherwise, you can try building a tree one-step-at-a-time using HTML::TokeParser::Simple or even just HTML::Parser, but I believe that way lies insanity.

Keep in mind that a parser that tries to build a tree representation will be stricter than a stream parser that just recognizes various elements as it sees them.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I maintain a database of articles with HTML formatting. Unfortunately the editors who wrote

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply