I need to ignore or remove all text in between all HTML elements so I can generate a blank template from a given web page.
I am parsing using the perl module HTML::TreeBuilder and HTML::Element.
I have tried the ignore_text method noted in the documentation but that doesn’t provide correct results.
I have also tried using DOMXpath with PHP to do the same thing and results seemed too cumbersome to manage. Regex’s might work but are a last resort to me.
This is part of my current code, very basic. Bottom is just output to file. All code is functional I just need formatting to work so I can generate template files.
my $url= "http://www.example.com";
my $page = get($url) or die $!;
my $tree = HTML::TreeBuilder->new_from_content($page);
$tree->parse_file($page);
$tree->ignore_text;
$tree->elementify;
open OUTPUT, "+>".$body;
my $output = $tree->as_HTML;
print OUTPUT $output;
close OUTPUT;
Thanks in advance for the help!
EDIT: I found the problem – the ignore text only works when you parse from a physical file. I had to save the page as a temp file to parse then output the way I wanted with no text then I just did unlink($tmp) at the bottom to delete the file. My script has since grown much more complicated with reading and writing to database and each time I need to create this temp file which is kind of annoying…
Thanks for the reply below!
You are very close.
It looks like you need to set
ignore_textwith a true value.$tree->ignore_text(1)and then make sure its set before callingparse_file.Sorry this is a bit long but i hope it helps.
Here is quick pass at the new code, hard to test without example page:
Here is my quick test script using a local file:
Input
test.html:And output:
Good luck