This question has been rephrased. I am using CPAN Perl modules WWW::Mechanize to navigate

Question

0

Asked: May 29, 20262026-05-29T03:53:31+00:00 2026-05-29T03:53:31+00:00

This question has been rephrased. I am using CPAN Perl modules WWW::Mechanize to navigate

0

This question has been rephrased. I am using CPAN Perl modules WWW::Mechanize to navigate a website, HTML::TreeBuilder-XPath to capture the content and xacobeo to test my XPath code on the HTML/XML. The goal is to call this Perl script from a PHP-based website and upload the scraped contents into a database. Therefore, if content is “missing” it still needs to be accounted for.

Below is a tested, reduced sample code depicting my challenge. Note:

This page is dynamically filled and contains various ITEMS outputted for different stores; a different number of Products* will exist for each store. And those product listings may or may not have an itemized table underneath of it.
The captured data has to be in arrays and the association of any itemized list (if it exists) to the Product listing has to be maintained.

Below, the example xml changes per store (as described above) but for brevity I only show one “type” of output. I realize that all data can be captured into one array and then regex used to decipher the content for the purpose of uploading it into a database. I am seeking a better knowledge of XPath to help streamline this (and future) solution(s).

<!DOCTYPE XHTML>
<table id="8jd9c_ITEMS">
<tr><th style="color:red">The Products we have in stock!</th></tr>

<tr><td><span id="Product_NUTS">We have nuts!</span></td></tr>
<tr><td>
    <!--Table may or may not exist  -->
           <table>                                  
      <tr><td style="color:blue;text-indent:10px">Almonds</td></tr>
      <tr><td style="color:blue;text-indent:10px">Cashews</td></tr>
      <tr></tr>
    </table>
</td></tr>

<tr><td><span id="Product_VEGGIES">We have veggies!</span></td></tr>
<tr><td>
    <!--Table may or may not exist -->
    <table>
      <tr><td style="color:blue;text-indent:10px">Carrots</td></tr>
      <tr><td style="color:blue;text-indent:10px">Celery</td></tr>
      <tr></tr>
    </table>
</td></tr>

<tr><td><span id="Product_ALCOHOL">We have booze!</span></td></tr>
    <!--In this case, the table does not exist -->
</table>

An XPath statement of:

'//table[contains(@id, "ITEMS")]/tr[position() >1]/td/span/text()'

would find:

We have nuts!
we have veggies!
We have booze!

And an XPath statement of:

'//table[contains(@id, "ITEMS")]/tr[position() >1]/td/table/tr/td/text()'

would find:

Almonds
Cashews
Carrots
Celery

The two XPath statements can be combined:

'//table[contains(@id, "ITEMS")]/tr[position() >1]/td/span/text() | //table[contains(@id, "ITEMS")]/tr[position() >1]/table/tr/td/text()'

To find:

We have nuts!
Almonds
Cashews
We have veggies!
Carrots
Celery
We have booze!

Again, the above array can be deciphered (in the real code) for it’s product-to-list association using regex. But can the array be built using XPath in a manner that would keep that association?

For example (pseudo-speak, this does not work):

'//table[contains(@id, "ITEMS")]/tr[position()>1]/td/span/text() | 
if exists('//table[contains(@id, "ITEMS")]/tr[position() >1]/table)) 
then ("NoTable") else ("TableRef") | 
Save this result into @TableRef ('//table[contains(@id, "ITEMS")]/tr[position() >1]/table/tr/td/text()')'

It is not possible to build multi-dimensional arrays (in the traditional sense) in Perl, see perldoc perlref But hopefully a solution similar to the above could create something like:

@ITEMS[0] => We have nuts!
@ITEMS[1] => nutsREF     <-- say, the last word of the span value + REF
@ITEMS[2] => We have veggies!
@ITEMS[3] => veggiesREF  <-- say, the last word of the span value + REF
@ITEMS[4] => We have booze!
@ITEMS[5] => NoTable     <-- value accounts for the missing info

@nutsREF[0] => Almonds
@nutsREF[1] => Cashews

@veggiesREF[0] => Carrots
@veggiesREF[1] => Celery

In the real code the Products are known, so my @veggiesREF and my @nutsREF can be defined in anticipation of the XPath output.

I realize the XPath if/else/then functionality is in the XPath 2.0 version. I am on a ubuntu system and working locally, but I am still not clear on whether my apache2 server is using it or the 1.0 version. How do I check that?

Finally, if you can show how to call a Perl scrip from a PHP form submit AND how to pass back a Perl array to the calling PHP function then that would go along way to getting the bounty. 🙂

Thanks!

FINAL EDIT:

Comments immediately below this post were directed at an initial post that was too vague. The subsequent re-post (and bounty) was responded to by ikegami with a very creative use which solved the pseudo problem, but was proving difficult for me to grasp and reuse in my real application – which entails multiple uses on various html pages. In about the 18th comment in our dialog I finally discovered his meaning and use of ($cat) – an undocumented Perl syntax that he used. For new readers, understanding that syntax makes it possible to understand (and reformat) his intelligent solution to the problem. His post certainly meets the basic requirements sought in the OP but does not use HTML::TreeBuilder::XPath to do it.

jpalecek uses the HTML::TreeBuilder::XPath but does not place the captured data into arrays for passing back to a PHP function and uploading into a database.

I have learned from both responders and hope this post helps others who are new to Perl, like myself. Any final contributions would be greatly appreciated.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-29T03:53:32+00:00

If I were to guess, your question is: “How do I get the following from the provided input?”

my $categorized_items = {
   'We have nuts!'    => [ 'Almonds', 'Cashwes' ],
   'We have veggies!' => [ 'Carrots', 'Celery' ],
   'We have booze!'   => [ ],
};

If so, here’s how I’d do it:

use Data::Dumper qw( Dumper );
use XML::LibXML  qw( );

my $root = XML::LibXML->load_xml(IO=>\*DATA)->documentElement;

my %cat_items;
for my $cat_tr ($root->findnodes('//table[contains(@id, "ITEMS")]/tr[td/span]')) {
   my ($cat) = map $_->textContent(),
      $cat_tr->findnodes('td/span');

   my @items = map $_->textContent(),
      $cat_tr->findnodes('following-sibling::tr[position()=1]/td/table/tr/td');

   $cat_items{$cat} = \@items;
}

print(Dumper(\%cat_items));

__DATA__
...xml...

PS – What you have there isn’t valid HTML.

A TABLE element cannot be placed directly inside a TR element. There’s a missing TD element.
A TR element cannot be empty. It must have at least one TH or TD element.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This question has been rephrased. I am using CPAN Perl modules WWW::Mechanize to navigate

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply