I need to ignore or remove all text in between all HTML elements so

Question

0

Asked: May 23, 20262026-05-23T02:18:39+00:00 2026-05-23T02:18:39+00:00

I need to ignore or remove all text in between all HTML elements so

0

I need to ignore or remove all text in between all HTML elements so I can generate a blank template from a given web page.

I am parsing using the perl module HTML::TreeBuilder and HTML::Element.

I have tried the ignore_text method noted in the documentation but that doesn’t provide correct results.

I have also tried using DOMXpath with PHP to do the same thing and results seemed too cumbersome to manage. Regex’s might work but are a last resort to me.

This is part of my current code, very basic. Bottom is just output to file. All code is functional I just need formatting to work so I can generate template files.

my $url= "http://www.example.com";


my $page = get($url) or die $!;
my $tree = HTML::TreeBuilder->new_from_content($page);

$tree->parse_file($page);

$tree->ignore_text;
$tree->elementify;

open OUTPUT, "+>".$body;
my $output = $tree->as_HTML;
print OUTPUT $output;
close OUTPUT;

Thanks in advance for the help!

EDIT: I found the problem – the ignore text only works when you parse from a physical file. I had to save the page as a temp file to parse then output the way I wanted with no text then I just did unlink($tmp) at the bottom to delete the file. My script has since grown much more complicated with reading and writing to database and each time I need to create this temp file which is kind of annoying…

Thanks for the reply below!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T02:18:39+00:00

You are very close.

It looks like you need to set ignore_text with a true value. $tree->ignore_text(1) and then make sure its set before calling parse_file.

Sorry this is a bit long but i hope it helps.

Here is quick pass at the new code, hard to test without example page:

my $tree = HTML::TreeBuilder->new;

$tree->ignore_text(1);
$tree->elementify;
$tree->parse_file( $page );

Here is my quick test script using a local file:

use strict;
use warnings;

use HTML::TreeBuilder;

my $page = 'test.html';
my $tree = HTML::TreeBuilder->new();

$tree->ignore_text(1);
$tree->parse_file($page);
$tree->elementify;

print $tree->as_HTML;

Input test.html:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>title text</title>
</head>
<body>
  <h1>Heading 1</h1>
  <p>paragraph text</p>
</body>
</html>

And output:

<html xmlns="http://www.w3.org/1999/xhtml"><head><title></title></head><body><h1></h1><p></body></html>

Good luck

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to ignore or remove all text in between all HTML elements so

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply