I have an XML document like this:
<article>
<author>Smith</author>
<date>2011-10-10</date>
<description>Article about <b>frobnitz</b>, crulps and furtikurty's. Mainly frobnitz</description>
</article>
I need to parse this in Perl and then add new tags around some words or phrases (eg to link to definitions). I want to tag only the first instance of a target word and narrow my search to just what’s in a given tag (eg description tag only).
I can parse with XML::Twig and set a “twig_handler” for the description tag. But when I call $node->text I get the text with intervening tags removed. Really what I want to do is traverse down the (very small) tree so that existing tags are preserved and not broken. The final XML output should therefore look like this:
<article>
<author>Smith</author>
<date>2011-10-10</date>
<description>Article about <b><a href="dictionary.html#frobnitz">frobnitz</a></b>, <a href="dictionary.html#crulps">crulps</a> and <a href="dictionary.html#furtikurty">furtikurty</a>'s. Mainly frobnitz</description>
</article>
I also have XML::LibXML available on the target environment but I’m not sure how to start there…
Here’s my minimal test case so far. Appreciate any help!
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my %dictionary = (
frobnitz => 'dictionary.html#frobnitz',
crulps => 'dictionary.html#crulps',
furtykurty => 'dictionary.html#furtykurty',
);
sub markup_plain_text {
my ( $text ) = @_;
foreach my $k ( keys %dictionary ) {
$text =~ s/(^|\W)($k)(\W|$)}/$1<a href="$dictionary{$k}">$2<\/a>$3/si;
}
return $text;
}
sub convert {
my( $t, $node ) = @_;
warn "convert: TEXT=[" . $node->text . "]\n";
$node->set_text( markup_plain_text($node->text) );
return 1;
}
sub markup {
my ( $text ) = @_;
my $t = XML::Twig->new(
twig_handlers => { description => \&convert },
pretty_print => 'indented',
);
$t->parse( $text );
return $t->flush;
}
my $orig = <<END_XML;
<article>
<author>Smith</author>
<date>2011-10-10</date>
<description>Article about <b>frobnitz</b>, crulps and furtikurty's. Mainly frobnitz's</description>
</article>
END_XML
;
markup($orig);
It’s a slightly tricky one, but XML::Twig is designed (and I use it heavily) to this kind of processing. So there is a specific method, called
mark, that takes a regexp and tags the matches.In this case the regexp will likely be quite big. I used Regexp::Assempble to build it, so it gets optimized. Then an other problem is that
markdoesn’t let you use the text of the match to set an attribute (I might work on this in the next version of the module, that would be useful), so I has to mark first, then go back and set thehrefattribute in a second pass (in any case the second pass is needed to “un-link” words that have already been linked).One last word: I nearly gave up on writing the solution, because your example data has a few typos. There is nothing worse than getting the code right, just to see the test still fail because you use ‘dictionary’ in the code and ‘definitions’ in the data, or ‘furtykurtle’, ‘furtikurty’ and ‘furtijurty’ where it should all be the same word. So please, before posting, make sure your data is right. Thankfully I was writing the code as a test.