Im pretty sure this is really basic. However I have no knowledge of Perl

Question

0

Asked: June 1, 20262026-06-01T18:49:05+00:00 2026-06-01T18:49:05+00:00

Im pretty sure this is really basic. However I have no knowledge of Perl

0

Im pretty sure this is really basic. However I have no knowledge of Perl and only need to use it this once. So I appreciate your patience.

I am trying to remove unwanted text from a single line below which is in HTML:

    <a target="_blank"          href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List (<i>Revised<i>)</a>

All I want to be left with is Run Printable TCI List (Revised) which is the text at the end before the </a>. I have around 500 of these lines and since they could be changed in the future it makes sense to create a program. Below is my Perl code so far:

open (SEARK, 'C:\\HTMLsorter\\sources.txt');
open (OUTSEARK, '>C:\\HTMLsorter\\outseark.txt');
while(<SEARK>) {
  chomp;

  if ($_=~/<a target/) {
    $_ =~ s/\<i>//g;
    $_ =~ s/\<\/i>//g;
    @itemsa = split(/>/);
    @itemsb = split(/</, $itemsa[1]);
    print OUTSEARK ("$itemsb[0]\n");
  }
}
close (SEARK);
close (OUTSEARK);

I’m sure you can read this but just to explain I am opening a file called sources.txt where there are the 500 lines to be sorted. The output file will be outseark.txt. So far it will output this:

Run Printable TCI List (Revised)

This is obviously due to the split aiming at everything in and around the arrows. Any ideas how I keep the italics inside the brackets? To be left with:

Run Printable TCI List (<i>Revised<i>)

Thanks for looking.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T18:49:07+00:00

You should use a proper HTML parser, such as HTML::TreeBuilder. The code is no more complex as this program demonstrates

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file(*DATA);

print $_->as_text, "\n" for $tree->look_down(_tag => 'a', target => qr/./);

__DATA__
    <a target="_blank"          href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List (<i>Revised<i>)</a>

output

Run Printable TCI List (Revised)

Edit

To use this technique on the files in your example, the code looks like this

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file('C:\HTMLsorter\sources.txt');

open my $out, '>', 'C:\HTMLsorter\outseark.txt' or die $!;

print $out $_->as_text, "\n" for $tree->look_down(_tag => 'a', target => qr/./);

Edit 2

Now that I understand better what you need I can offer this alternative solution. It uses the HTML::DOM module to access the Document Object Model of an HTML document, as getting the result you needed with HTML::TreeBuilder is relatively difficult.

I’ve also noticed that your sample HTML contains Revised which clearly should be Revised, and I have corrected it for this sample test. Regardless, Perl trieds to parse bad HTML as a browser would, and even with the error the output is useable.

use strict;
use warnings;

use HTML::DOM;

my $dom = HTML::DOM->new;
$dom->parse_file('C:\HTMLsorter\sources.txt') or die $!;

open my $out, '>', 'C:\HTMLsorter\outseark.txt' or die $!;
print $out $_->innerHTML, "\n" for grep $_->attr('target'), $dom->getElementsByTagName('a');

output

(With tags corrected)

Run Printable TCI List (<i>Revised</i>)

(With original tags)

Run Printable TCI List (<i>Revised<i>)</i></i>

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Im pretty sure this is really basic. However I have no knowledge of Perl

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply