Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8688927
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 12, 20262026-06-12T23:30:19+00:00 2026-06-12T23:30:19+00:00

I am using Curl, XPath and PHP in order to scrape product names and

  • 0

I am using Curl, XPath and PHP in order to scrape product names and prices from HTML source code. Here is a sample similar to the source code I am examining:

<div class="Gamesdb">
  <p class="media-title">
    <a href="/Games/Console/4-/105/Bluetooth-Headset/">Bluetooth Headset</a>
  </p>
  <p class="sub-title"> Console </p>
  <p class="rating star-50">
    <a href="/Games/Console/4-/105/Bluetooth-Headset/ProductReviews.html">(1)</a>
  </p>
  <p class="mt5">
    <span class="price-preffix">
      <a href="/Games/Console/4-/105/Bluetooth-Headset/">1 New</a>
      from 
    </span>
    <a class="wt-link" href="/Games/Console/4-/105/Bluetooth-Headset/">
      <span class="price">
        <em>£34</em>
        .99
      </span>
      <span class="free-delivery"> FREE delivery</span>
    </a>
  </p>
  <p class="mt10">
    <a class="primary button" href="/Games/Console/4-/105/Bluetooth-Headset/">
      Product Details
      <span style="color: rgb(255, 255, 255); margin-left: 6px; font-size: 16px;">»</span>
    </a>
  </p>
</div>

I want to extract the media title i.e:

<p class="media-title">
    <a href="/Games/Console/4-/105/Bluetooth-Headset/">Bluetooth Headset</a>
    </p>

Only when the following price class is also present:

<span class="price">
    <em>£34</em>
    .99
    </span>

Many of the other products listed don’t include it.
I need to extract both the product name and price or nothing at all and move on to the next product.

Here is a sample of the code i am currently using which is effective at getting all the results regardless of any other conditions:

$results=file_get_contents('SCRAPEDHTML.txt');

$html = new DOMDocument();
@$html->loadHtml($results);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query('//p[@class="media-title"]|//span[@class="price"]');

foreach ($nodelist as $n){

$results2[]=$n->nodeValue;

}

I believe this is possible using the correct xpath query but have so far been unable to achieve it. Many thanks in advance.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-12T23:30:21+00:00Added an answer on June 12, 2026 at 11:30 pm

    I am assuming there is only one “item” per div.Gamesdb. If not, there may not be enough structure in the source html to use xpath alone. You will probably have to index product names and look for prices near matching product names.

    You can do this with a single giant XPath, but I recommend you use multiple XPaths. I’ll show both ways.

    First create your DOMXPath and register helper to match class names.

    // This helper is the equivalent to the XPath:
    // contains(concat(' ',normalize-space(@attr),' '), ' $token ')
    // It's not necessary, but it's a bit easier to read and more
    // bulletproof than @ATTR="TOKEN"
    function has_token($attr, $token)
    {
        $attr = $attr[0];
        $regex = '/(?:^|\s)'.preg_quote($token,'/').'(?:\s|$)/Su';
        return (bool) preg_match($regex, $attr->value);
    }
    
    $xp = new DOMXPath($d);
    $xp->registerNamespace("php", "http://php.net/xpath");
    $xp->registerPHPFunctions("has_token");
    

    You can then use a giant XPath:

    $xp_container = '/html/body//div[php:function("has_token", @class, "Gamesdb")]';
    $xp_title = 'p[php:function("has_token", @class, "media-title")]';
    $xp_price = '//span[php:function("has_token", @class, "price")]';
    
    $xp_titles_prices = "$xp_container[{$xp_title}][{$xp_price}]/{$xp_title} | $xp_container[{$xp_title}][{$xp_price}]{$xp_price}";
    
    
    $nodes = $xp->query($xp_items);
    
    $items = array();
    
    $i = 0; // enumerator
    foreach ($nodes as $node) {
        $key = ($node->nodeName==='p') ? 'title' : 'price';
        $value = '';
        switch ($key) {
            case 'price':
                // remove inner whitespace
                $value = preg_replace('/\s+/Su', '', trim($node->textContent));
                break;
            case 'title':
                $value = preg_replace('/\s+/Su', ' ', trim($node->textContent));
                break;
        }
        $items[(int) floor($i/2)][$key] = $value;
        $i += 1;
    }
    

    However, the overall code is brittle and unclear. The XPath union operator (|) returns nodes in document order so we can’t bisect the list. The PHP code must walk through every item in the nodelist and using the DOM determine which field corresponds to this data. Think about the changes you would have to make if you wanted to extend the code to collect a third item (e.g., price). Now imagine making those changes three months from now, when this code is no longer fresh in your mind.

    I recommend you use multiple XPath calls instead and do the “do we have data for both price and title” check in PHP rather than XPath:

    $xpitems = '/html/body//div[php:function("has_token", @class, "Gamesdb")]';
    // below use $xpitems context:
    $xptitle = 'normalize-space(p[php:function("has_token", @class, "media-title")])';
    $xpprice = 'normalize-space(//span[php:function("has_token", @class, "price")])';
    
    $nodeitems = $xp->query($xpitems);
    
    $items = array();
    foreach ($nodeitems as $nodeitem) {
        $item = array(
            'title' => $xp->evaluate($xptitle, $nodeitem),
            'price' => str_replace(' ', '', $xp->evaluate($xpprice, $nodeitem)),
        );
            // Only add this item if we have data for *all* fields:
        if (count(array_filter($item)) === count($item)) {
            $items[] = $item;
        }
    }
    

    This is much easier to read and understand, and much easier to extend in the future.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm using CURL to scrape the html from url's. It works great in 80%
I want to use PHP (possibly with Curl/XPath?) to extract data from Wikipedia pages.
Im using curl to fetch my Twitter favorites: <?php $username = bob; $password =
i am using curl in my program. and my code is : $tref =
I'm using cURL to get some rank data for over 20,000 domain names that
I am using curl from a BASH shell. I would like to create a
I am currently using CURL via a php script running as daily cron to
I am using curl and php to find out information about a given url
I'm using cURL to return data from external sites. How can I return the
I'm using cURL to scrape web pages but I can only seem to scrape

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.