Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6011795
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T02:18:39+00:00 2026-05-23T02:18:39+00:00

I need to ignore or remove all text in between all HTML elements so

  • 0

I need to ignore or remove all text in between all HTML elements so I can generate a blank template from a given web page.

I am parsing using the perl module HTML::TreeBuilder and HTML::Element.

I have tried the ignore_text method noted in the documentation but that doesn’t provide correct results.

I have also tried using DOMXpath with PHP to do the same thing and results seemed too cumbersome to manage. Regex’s might work but are a last resort to me.

This is part of my current code, very basic. Bottom is just output to file. All code is functional I just need formatting to work so I can generate template files.


my $url= "http://www.example.com";

my $page = get($url) or die $!;
my $tree = HTML::TreeBuilder->new_from_content($page);

$tree->parse_file($page);

$tree->ignore_text;
$tree->elementify;

open OUTPUT, "+>".$body;
my $output = $tree->as_HTML;
print OUTPUT $output;
close OUTPUT;

Thanks in advance for the help!

EDIT: I found the problem – the ignore text only works when you parse from a physical file. I had to save the page as a temp file to parse then output the way I wanted with no text then I just did unlink($tmp) at the bottom to delete the file. My script has since grown much more complicated with reading and writing to database and each time I need to create this temp file which is kind of annoying…

Thanks for the reply below!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T02:18:39+00:00Added an answer on May 23, 2026 at 2:18 am

    You are very close.

    It looks like you need to set ignore_text with a true value. $tree->ignore_text(1) and then make sure its set before calling parse_file.

    Sorry this is a bit long but i hope it helps.

    Here is quick pass at the new code, hard to test without example page:

    my $tree = HTML::TreeBuilder->new;
    
    $tree->ignore_text(1);
    $tree->elementify;
    $tree->parse_file( $page );
    

    Here is my quick test script using a local file:

    use strict;
    use warnings;
    
    use HTML::TreeBuilder;
    
    my $page = 'test.html';
    my $tree = HTML::TreeBuilder->new();
    
    $tree->ignore_text(1);
    $tree->parse_file($page);
    $tree->elementify;
    
    print $tree->as_HTML;
    

    Input test.html:

    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
      <title>title text</title>
    </head>
    <body>
      <h1>Heading 1</h1>
      <p>paragraph text</p>
    </body>
    </html>
    

    And output:

    <html xmlns="http://www.w3.org/1999/xhtml"><head><title></title></head><body><h1></h1><p></body></html>
    

    Good luck

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I've managed to mostly ignore all this multi-byte character stuff, but now I need
I have a textarea where user can edit some text and it accepts HTML.
I've exceeded my svn hosting limit so I need to remove all binaries and
The problem is I need to ignore the stray Letters in the numbers: e.g.
I have a situation where I need to ignore parts of page load sub
I have a situation where I need NHibernate to ignore its caches and just
I need to diff two log files but ignore the time stamp part of
I need split string by commas and spaces, but ignore the inside quotes, single
I need to compare two account number columns from two different tables to see
With the svn:ignore property, is there a way I can specify what I want

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.