Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7398615
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 29, 20262026-05-29T03:53:31+00:00 2026-05-29T03:53:31+00:00

This question has been rephrased. I am using CPAN Perl modules WWW::Mechanize to navigate

  • 0

This question has been rephrased. I am using CPAN Perl modules WWW::Mechanize to navigate a website, HTML::TreeBuilder-XPath to capture the content and xacobeo to test my XPath code on the HTML/XML. The goal is to call this Perl script from a PHP-based website and upload the scraped contents into a database. Therefore, if content is “missing” it still needs to be accounted for.

Below is a tested, reduced sample code depicting my challenge. Note:

  1. This page is dynamically filled and contains various ITEMS outputted for different stores; a different number of Products* will exist for each store. And those product listings may or may not have an itemized table underneath of it.
  2. The captured data has to be in arrays and the association of any itemized list (if it exists) to the Product listing has to be maintained.

Below, the example xml changes per store (as described above) but for brevity I only show one “type” of output. I realize that all data can be captured into one array and then regex used to decipher the content for the purpose of uploading it into a database. I am seeking a better knowledge of XPath to help streamline this (and future) solution(s).

<!DOCTYPE XHTML>
<table id="8jd9c_ITEMS">
<tr><th style="color:red">The Products we have in stock!</th></tr>

<tr><td><span id="Product_NUTS">We have nuts!</span></td></tr>
<tr><td>
    <!--Table may or may not exist  -->
           <table>                                  
      <tr><td style="color:blue;text-indent:10px">Almonds</td></tr>
      <tr><td style="color:blue;text-indent:10px">Cashews</td></tr>
      <tr></tr>
    </table>
</td></tr>

<tr><td><span id="Product_VEGGIES">We have veggies!</span></td></tr>
<tr><td>
    <!--Table may or may not exist -->
    <table>
      <tr><td style="color:blue;text-indent:10px">Carrots</td></tr>
      <tr><td style="color:blue;text-indent:10px">Celery</td></tr>
      <tr></tr>
    </table>
</td></tr>

<tr><td><span id="Product_ALCOHOL">We have booze!</span></td></tr>
    <!--In this case, the table does not exist -->
</table>

An XPath statement of:

'//table[contains(@id, "ITEMS")]/tr[position() >1]/td/span/text()'

would find:

We have nuts!
we have veggies!
We have booze!

And an XPath statement of:

'//table[contains(@id, "ITEMS")]/tr[position() >1]/td/table/tr/td/text()'

would find:

Almonds
Cashews
Carrots
Celery

The two XPath statements can be combined:

'//table[contains(@id, "ITEMS")]/tr[position() >1]/td/span/text() | //table[contains(@id, "ITEMS")]/tr[position() >1]/table/tr/td/text()'

To find:

We have nuts!
Almonds
Cashews
We have veggies!
Carrots
Celery
We have booze!

Again, the above array can be deciphered (in the real code) for it’s product-to-list association using regex. But can the array be built using XPath in a manner that would keep that association?

For example (pseudo-speak, this does not work):

'//table[contains(@id, "ITEMS")]/tr[position()>1]/td/span/text() | 
if exists('//table[contains(@id, "ITEMS")]/tr[position() >1]/table)) 
then ("NoTable") else ("TableRef") | 
Save this result into @TableRef ('//table[contains(@id, "ITEMS")]/tr[position() >1]/table/tr/td/text()')'

It is not possible to build multi-dimensional arrays (in the traditional sense) in Perl, see perldoc perlref But hopefully a solution similar to the above could create something like:

@ITEMS[0] => We have nuts!
@ITEMS[1] => nutsREF     <-- say, the last word of the span value + REF
@ITEMS[2] => We have veggies!
@ITEMS[3] => veggiesREF  <-- say, the last word of the span value + REF
@ITEMS[4] => We have booze!
@ITEMS[5] => NoTable     <-- value accounts for the missing info

@nutsREF[0] => Almonds
@nutsREF[1] => Cashews

@veggiesREF[0] => Carrots
@veggiesREF[1] => Celery 

In the real code the Products are known, so my @veggiesREF and my @nutsREF can be defined in anticipation of the XPath output.

I realize the XPath if/else/then functionality is in the XPath 2.0 version. I am on a ubuntu system and working locally, but I am still not clear on whether my apache2 server is using it or the 1.0 version. How do I check that?

Finally, if you can show how to call a Perl scrip from a PHP form submit AND how to pass back a Perl array to the calling PHP function then that would go along way to getting the bounty. 🙂

Thanks!

FINAL EDIT:

Comments immediately below this post were directed at an initial post that was too vague. The subsequent re-post (and bounty) was responded to by ikegami with a very creative use which solved the pseudo problem, but was proving difficult for me to grasp and reuse in my real application – which entails multiple uses on various html pages. In about the 18th comment in our dialog I finally discovered his meaning and use of ($cat) – an undocumented Perl syntax that he used. For new readers, understanding that syntax makes it possible to understand (and reformat) his intelligent solution to the problem. His post certainly meets the basic requirements sought in the OP but does not use HTML::TreeBuilder::XPath to do it.

jpalecek uses the HTML::TreeBuilder::XPath but does not place the captured data into arrays for passing back to a PHP function and uploading into a database.

I have learned from both responders and hope this post helps others who are new to Perl, like myself. Any final contributions would be greatly appreciated.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-29T03:53:32+00:00Added an answer on May 29, 2026 at 3:53 am

    If I were to guess, your question is: “How do I get the following from the provided input?”

    my $categorized_items = {
       'We have nuts!'    => [ 'Almonds', 'Cashwes' ],
       'We have veggies!' => [ 'Carrots', 'Celery' ],
       'We have booze!'   => [ ],
    };
    

    If so, here’s how I’d do it:

    use Data::Dumper qw( Dumper );
    use XML::LibXML  qw( );
    
    my $root = XML::LibXML->load_xml(IO=>\*DATA)->documentElement;
    
    my %cat_items;
    for my $cat_tr ($root->findnodes('//table[contains(@id, "ITEMS")]/tr[td/span]')) {
       my ($cat) = map $_->textContent(),
          $cat_tr->findnodes('td/span');
    
       my @items = map $_->textContent(),
          $cat_tr->findnodes('following-sibling::tr[position()=1]/td/table/tr/td');
    
       $cat_items{$cat} = \@items;
    }
    
    print(Dumper(\%cat_items));
    
    __DATA__
    ...xml...
    

    PS – What you have there isn’t valid HTML.

    1. A TABLE element cannot be placed directly inside a TR element. There’s a missing TD element.
    2. A TR element cannot be empty. It must have at least one TH or TD element.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

This question has been asked before ( link ) but I have slightly different
This question has been discussed in two blog posts ( http://dow.ngra.de/2008/10/27/when-systemcurrenttimemillis-is-too-slow/ , http://dow.ngra.de/2008/10/28/what-do-we-really-know-about-non-blocking-concurrency-in-java/ ),
This question has been asked in a C++ context but I'm curious about Java.
This question has been puzzling me for a long time now. I come from
This question has been asked in various forms in a number of different forums,
This question has been bugging me for some time. I always picture launching my
Perhaps this question has been asked elsewhere, but I'm unable to find it. With
Maybe this question has been asked many times before, but I never found a
Warning: This question has been heavily edited. I tried my best to guess the
I know this question has been asked before, but I ran into a problem.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.