Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 4612806
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 22, 20262026-05-22T01:27:17+00:00 2026-05-22T01:27:17+00:00

I have an XML document like this: <article> <author>Smith</author> <date>2011-10-10</date> <description>Article about <b>frobnitz</b>, crulps

  • 0

I have an XML document like this:

<article>
  <author>Smith</author>
  <date>2011-10-10</date>
  <description>Article about <b>frobnitz</b>, crulps and furtikurty's. Mainly frobnitz</description>
</article>

I need to parse this in Perl and then add new tags around some words or phrases (eg to link to definitions). I want to tag only the first instance of a target word and narrow my search to just what’s in a given tag (eg description tag only).

I can parse with XML::Twig and set a “twig_handler” for the description tag. But when I call $node->text I get the text with intervening tags removed. Really what I want to do is traverse down the (very small) tree so that existing tags are preserved and not broken. The final XML output should therefore look like this:

<article>
  <author>Smith</author>
  <date>2011-10-10</date>
  <description>Article about <b><a href="dictionary.html#frobnitz">frobnitz</a></b>, <a href="dictionary.html#crulps">crulps</a> and <a href="dictionary.html#furtikurty">furtikurty</a>'s. Mainly frobnitz</description>
</article>

I also have XML::LibXML available on the target environment but I’m not sure how to start there…

Here’s my minimal test case so far. Appreciate any help!

#!/usr/bin/perl
use strict;
use warnings;

use XML::Twig;

my %dictionary = (
    frobnitz    => 'dictionary.html#frobnitz',
    crulps      => 'dictionary.html#crulps',
    furtykurty  => 'dictionary.html#furtykurty',
    );

sub markup_plain_text { 
    my ( $text ) = @_;

    foreach my $k ( keys %dictionary ) {
        $text =~ s/(^|\W)($k)(\W|$)}/$1<a href="$dictionary{$k}">$2<\/a>$3/si;
    }

    return $text;
}

sub convert {
    my( $t, $node ) = @_;
    warn "convert: TEXT=[" . $node->text . "]\n";
    $node->set_text( markup_plain_text($node->text) );
    return 1;
}

sub markup {
    my ( $text ) = @_;

    my $t = XML::Twig->new(
        twig_handlers => { description => \&convert },
        pretty_print  => 'indented',
        );
    $t->parse( $text );

    return $t->flush;
}


my $orig = <<END_XML;
<article>
    <author>Smith</author>
    <date>2011-10-10</date>
    <description>Article about <b>frobnitz</b>, crulps and furtikurty's. Mainly frobnitz's</description>
</article>
END_XML
;

markup($orig);
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-22T01:27:18+00:00Added an answer on May 22, 2026 at 1:27 am

    It’s a slightly tricky one, but XML::Twig is designed (and I use it heavily) to this kind of processing. So there is a specific method, called mark, that takes a regexp and tags the matches.

    In this case the regexp will likely be quite big. I used Regexp::Assempble to build it, so it gets optimized. Then an other problem is that mark doesn’t let you use the text of the match to set an attribute (I might work on this in the next version of the module, that would be useful), so I has to mark first, then go back and set the href attribute in a second pass (in any case the second pass is needed to “un-link” words that have already been linked).

    One last word: I nearly gave up on writing the solution, because your example data has a few typos. There is nothing worse than getting the code right, just to see the test still fail because you use ‘dictionary’ in the code and ‘definitions’ in the data, or ‘furtykurtle’, ‘furtikurty’ and ‘furtijurty’ where it should all be the same word. So please, before posting, make sure your data is right. Thankfully I was writing the code as a test.

    #!/usr/bin/perl 
    
    use strict;
    use warnings;
    
    use XML::Twig;
    use Regexp::Assemble;
    
    use Test::More tests => 1; 
    use autodie qw(open);
    
    my %dictionary = (
        frobnitz    => 'definitions.html#frobnitz',
        crulps      => 'definitions.html#crulps',
        furtikurty  => 'definitions.html#furtikurty',
        );
    
    my $match_defs= Regexp::Assemble->new()
                                    ->add( keys %dictionary)
                                    ->anchor_word
                                    ->as_string;
    # I am not familiar enough with Regexp::Assemble to know a cleaner
    # way to get get the capturing braces in the regexp
    $match_defs= qr/($match_defs)/; 
    
    my $in       = data_para(); 
    my $expected = data_para();
    my $out;
    open( my $out_fh, '>', \$out);
    
    
    XML::Twig->new( twig_roots => { 'description' => sub { tag_defs( @_, $out_fh, $match_defs, \%dictionary); } },
                    twig_print_outside_roots => $out_fh, 
                  )
             ->parse( $in);
    
    is( $out, $expected, 'base test');
    exit;
    
    sub tag_defs
      { my( $t, $description, $out_fh, $match_defs, $dictionary)= @_;
    
        my @a= $description->mark( $match_defs, 'a' );
    
        # word => 1 when already used in this description
        # this might need to have a different scope if you need to tag
        # only the first time the word appears in a section or whatever
        my $tagged_in_description; 
    
        foreach my $a (@a) 
          { my $word= $a->text;
            warn "checking a: ", $a->sprint, "\n";
    
            if( $tagged_in_description->{$word})
              { $a->erase; } # we did not need to tag it after all
            else
              { $a->set_att( href => $dictionary->{$word}); }
            $tagged_in_description->{$word}++;
          }
    
        $t->flush( $out_fh); }
    
    
    sub def_href
      { my( $word)= @_;
        return $dictionary{word};
      }
    
    sub data_para
      { local $/="\n\n";
        my $para= <DATA>;
        return $para;
      }
    
    __DATA__
    <article>
      <author>Smith</author>
      <date>2011-10-10</date>
      <description>Article about <b>frobnitz</b>, crulps and furtikurty's. Mainly frobnitz</description>
    </article>
    
    <article>
      <author>Smith</author>
      <date>2011-10-10</date>
      <description>Article about <b><a href="definitions.html#frobnitz">frobnitz</a></b>, <a href="definitions.html#crulps">crulps</a> and <a href="definitions.html#furtikurty">furtikurty</a>'s. Mainly frobnitz</description>
    </article>
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Say I have a xml document that looks like this <foo> <bar id=9 />
I am writing an XML document in C#. I have something like this... string
I have an XML document like this: <wii> <game> <type genre=arcade /> <type genre=sport
I have an XML document that looks like this: <kmsg xmlns=http://url1 xmlns:env=url1 xmlns:xsi=http://www.w3.org/2001/XMLSchemainstance xsi:schemaLocation=http://location
I have a XML document which looks something like this: <events> <event category=gymnastics subcategory=rhythmic
I have an XML document with some sample content like this: <someTag> <![CDATA[Hello World]]>
Suppose i have an xml document like this XML File: <document> <educationalsection> educational details
Let's say I have an xml document like this: <director> <play> <t>Nutcracker</t> <a>Tom Cruise</a>
I have an xml document that looks like this. <?xml version=1.0?> <services> <service sn=1
I have an XML document that looks like this: <file> <name>NAME_OF_FILE</name> </file> <file> <name>NAME_OF_FILE</name>

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.