I have a project where my input files used to be XML. I’m now

Question

0

Asked: May 20, 20262026-05-20T07:21:22+00:00 2026-05-20T07:21:22+00:00

I have a project where my input files used to be XML. I’m now

0

I have a project where my input files used to be XML. I’m now being asked to start processing HTML with embedded CSS instead, and I’d like to accomplish this as cleanly and with as few code changes as possible. I was using XML::LibXML to parse the XML files, but now that we’re moving to HTML with CSS, I’m thinking I’ll need to move to something else. That said, before I dig myself knee deep into silly decisions I’ll likely regret, I wanted to ask here: what do you guys use for this kind of task?

The structures of the old XML and the new HTML input files are pretty similar, with both holding the same information. The HTML uses divs in place of the XML’s text nodes, and holds its style information in style tags and attributes instead of separated xml attributes.

An example of the old XML is:

<text font="TimesNewRoman,BoldItalic" size="11.04" x="59" y="405" w="52"
      h="12" bold="yes" italic="yes" cs="4.6" o_bbox="59,405;52,12"
      o_size="11.04" o_cs="4.6">
Some text
</text>

An example of the new HTML is:

<div o="9ka" style="position:absolute;top:145;left:89;x-pdf-top:744;x-pdf-left:60;x-pdf-bottom:732;x-pdf-right:536;">
  <span class="ft19" >
    Some text
  </span></nobr>
</div>

where “ft19” refers to a css style element from the top of the page of the format:

.ft19{ vertical-align:top;font-size:14px;x-pdf-font-size:14px;
       font-family:Times;color:#000000;x-pdf-color:#000000;font-style:italic;
       x-pdf-letter-spacing:0.83px;}

Basically, all I want is a parser that can read the stylistic elements of each node as attributes, so I could do something like:

my @texts_arr = $page_node->findnodes('text');
my $test_node = $texts_arr[1];
print "node\'s bold value is: " . $text_node->getAttribute('bold');

as I’m able to do with the XML. Does anything like that exist for parsing HTML? I’d really like to make sure I start this the right way instead of finding something that sort of does what I want on CPAN and realizing two months later that there was another module that was way better for what I’m trying to do.

Ideas?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T07:21:23+00:00

The basic one I am aware of is HTML::Parser.

There is also a project that works with it, Marpa::HTML which is the work of the larger parser project Marpa, which parses any language that can be described in BNF, documented on the author’s blog which is very interesting but much newer and experimental.

I also see that wildly successful WWW::Mechanize uses HTML::TokeParser, and it uses HTML::PullParser, so there’s that too.

If you need something even more generic (and evil) you can look into “writing” your own using something like Text::Balanced (which has some nice methods for tags, not sure about tag properties though) or even Regexp::Grammars, but again this means reinventing the wheel somewhat, I would only choose these routes if the above don’t do what you need.

Perhaps I haven’t helped. Perhaps I have just done a literature search for you, but maybe one of these will work better for you than others.

Edit: one more parser for you, seems like it might do what you need HTML::Tree. Then look at methods like look_down from HTML::Element to act on the tree. I saw an example here.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a project where my input files used to be XML. I’m now

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply