This is an extension of this question . I’m trying to parse HTML snippets

Question

0

Asked: May 11, 20262026-05-11T20:08:38+00:00 2026-05-11T20:08:38+00:00

This is an extension of this question . I’m trying to parse HTML snippets

0

This is an extension of this question. I’m trying to parse HTML snippets embedded in an XML backup of a Blogger blog and retag them with InDesign tags.

Blogger doesn’t standardize the HTML for any of its posts, and the posts can be written in Word, Windows Live Writer, the native Blogger interface, or text editors, resulting in tons of different forms of HTML. Some posts don’t mark paragraphs and only use double <br>s in between paragraphs—others use actual <p> tags.

What’s the best way to parse this unstandard conglomeration of tags?

Additionally, each post is not a complete HTML file–just a snippet that gets inserted into a template—which means that there is no overall HTML structure to parse (<html><body></body></html>, etc.) Does that have any effect on XML/HTML parsing?

Here’s some potential examples, mostly standard HTML, missing paragraphs:

This is a section of a blog post. It has <a href="#">links</a> and lists and stuff. Weee....
<br>
<br>
Here's a list
<br/>
<br />
<ul><li>Item 1</li><li>Item 2</li><ul>
And another paragraph here...
<br>
<br/>
Etc.

The Word HTML looks like this – http://www.timeatlas.com/mos/images/stories/word_html_tags.png

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-11T20:08:38+00:00

The HTML generated by Word is relatively easier to deal with. I would just get rid of all the tag attributes (unless you care about styles). That would live you with fairly plain HTML which you can then style.

HTML::TokeParser::Simple can help make that relatively painless.

As for the other stuff, that will take some trial and error. I am going to think more about that and post later if I can think of something clever.

Later Update:

Well, here is something that makes me cringe a little but it seems to work:

#!/usr/bin/perl

use strict;
use warnings;

use File::Slurp;
use Text::Markdown qw( markdown );

my $html = read_file \*DATA;

$html =~ s{(?:<br(:? ?/)*>)}{\n\n}g;

print markdown( $html );

__DATA__
This is a section of a blog post. It has <a href="#">links</a> and lists and stuff. Weee....
<br>
<br>
Here's a list
<br/>
<br />
<ul><li>Item 1</li><li>Item 2</li></ul>
And another paragraph here...
<br>
<br/>

Output:

<p>This is a section of a blog post. It has <a href="#">links</a> and lists and
stuff. Weee....</p>

<p>Here's a list</p>

<ul><li>Item 1</li><li>Item 2</li></ul>

<p>And another paragraph here...</p>

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This is an extension of this question . I’m trying to parse HTML snippets

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply