Im pretty sure this is really basic. However I have no knowledge of Perl and only need to use it this once. So I appreciate your patience.
I am trying to remove unwanted text from a single line below which is in HTML:
<a target="_blank" href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List (<i>Revised<i>)</a>
All I want to be left with is Run Printable TCI List (<i>Revised</i>) which is the text at the end before the </a>. I have around 500 of these lines and since they could be changed in the future it makes sense to create a program. Below is my Perl code so far:
open (SEARK, 'C:\\HTMLsorter\\sources.txt');
open (OUTSEARK, '>C:\\HTMLsorter\\outseark.txt');
while(<SEARK>) {
chomp;
if ($_=~/<a target/) {
$_ =~ s/\<i>//g;
$_ =~ s/\<\/i>//g;
@itemsa = split(/>/);
@itemsb = split(/</, $itemsa[1]);
print OUTSEARK ("$itemsb[0]\n");
}
}
close (SEARK);
close (OUTSEARK);
I’m sure you can read this but just to explain I am opening a file called sources.txt where there are the 500 lines to be sorted. The output file will be outseark.txt. So far it will output this:
Run Printable TCI List (Revised)
This is obviously due to the split aiming at everything in and around the arrows. Any ideas how I keep the italics inside the brackets? To be left with:
Run Printable TCI List (<i>Revised<i>)
Thanks for looking.
You should use a proper HTML parser, such as
HTML::TreeBuilder. The code is no more complex as this program demonstratesoutput
Edit
To use this technique on the files in your example, the code looks like this
Edit 2
Now that I understand better what you need I can offer this alternative solution. It uses the
HTML::DOMmodule to access the Document Object Model of an HTML document, as getting the result you needed withHTML::TreeBuilderis relatively difficult.I’ve also noticed that your sample HTML contains
<i>Revised<i>which clearly should be<i>Revised</i>, and I have corrected it for this sample test. Regardless, Perl trieds to parse bad HTML as a browser would, and even with the error the output is useable.output
(With tags corrected)
(With original tags)