I am using Curl, XPath and PHP in order to scrape product names and

Question

0

Asked: June 12, 20262026-06-12T23:30:19+00:00 2026-06-12T23:30:19+00:00

I am using Curl, XPath and PHP in order to scrape product names and

0

I am using Curl, XPath and PHP in order to scrape product names and prices from HTML source code. Here is a sample similar to the source code I am examining:

<div class="Gamesdb">
  <p class="media-title">
    <a href="/Games/Console/4-/105/Bluetooth-Headset/">Bluetooth Headset</a>
  </p>
  <p class="sub-title"> Console </p>
  <p class="rating star-50">
    <a href="/Games/Console/4-/105/Bluetooth-Headset/ProductReviews.html">(1)</a>
  </p>
  <p class="mt5">
    <span class="price-preffix">
      <a href="/Games/Console/4-/105/Bluetooth-Headset/">1 New</a>
      from 
    </span>
    <a class="wt-link" href="/Games/Console/4-/105/Bluetooth-Headset/">
      <span class="price">
        <em>£34</em>
        .99
      </span>
      <span class="free-delivery"> FREE delivery</span>
    </a>
  </p>
  <p class="mt10">
    <a class="primary button" href="/Games/Console/4-/105/Bluetooth-Headset/">
      Product Details
      <span style="color: rgb(255, 255, 255); margin-left: 6px; font-size: 16px;">»</span>
    </a>
  </p>
</div>

I want to extract the media title i.e:

<p class="media-title">
    <a href="/Games/Console/4-/105/Bluetooth-Headset/">Bluetooth Headset</a>
    </p>

Only when the following price class is also present:

<span class="price">
    <em>£34</em>
    .99
    </span>

Many of the other products listed don’t include it.
I need to extract both the product name and price or nothing at all and move on to the next product.

Here is a sample of the code i am currently using which is effective at getting all the results regardless of any other conditions:

$results=file_get_contents('SCRAPEDHTML.txt');

$html = new DOMDocument();
@$html->loadHtml($results);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query('//p[@class="media-title"]|//span[@class="price"]');

foreach ($nodelist as $n){

$results2[]=$n->nodeValue;

}

I believe this is possible using the correct xpath query but have so far been unable to achieve it. Many thanks in advance.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T23:30:21+00:00

I am assuming there is only one “item” per div.Gamesdb. If not, there may not be enough structure in the source html to use xpath alone. You will probably have to index product names and look for prices near matching product names.

You can do this with a single giant XPath, but I recommend you use multiple XPaths. I’ll show both ways.

First create your DOMXPath and register helper to match class names.

// This helper is the equivalent to the XPath:
// contains(concat(' ',normalize-space(@attr),' '), ' $token ')
// It's not necessary, but it's a bit easier to read and more
// bulletproof than @ATTR="TOKEN"
function has_token($attr, $token)
{
    $attr = $attr[0];
    $regex = '/(?:^|\s)'.preg_quote($token,'/').'(?:\s|$)/Su';
    return (bool) preg_match($regex, $attr->value);
}

$xp = new DOMXPath($d);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions("has_token");

You can then use a giant XPath:

$xp_container = '/html/body//div[php:function("has_token", @class, "Gamesdb")]';
$xp_title = 'p[php:function("has_token", @class, "media-title")]';
$xp_price = '//span[php:function("has_token", @class, "price")]';

$xp_titles_prices = "$xp_container[{$xp_title}][{$xp_price}]/{$xp_title} | $xp_container[{$xp_title}][{$xp_price}]{$xp_price}";


$nodes = $xp->query($xp_items);

$items = array();

$i = 0; // enumerator
foreach ($nodes as $node) {
    $key = ($node->nodeName==='p') ? 'title' : 'price';
    $value = '';
    switch ($key) {
        case 'price':
            // remove inner whitespace
            $value = preg_replace('/\s+/Su', '', trim($node->textContent));
            break;
        case 'title':
            $value = preg_replace('/\s+/Su', ' ', trim($node->textContent));
            break;
    }
    $items[(int) floor($i/2)][$key] = $value;
    $i += 1;
}

However, the overall code is brittle and unclear. The XPath union operator (|) returns nodes in document order so we can’t bisect the list. The PHP code must walk through every item in the nodelist and using the DOM determine which field corresponds to this data. Think about the changes you would have to make if you wanted to extend the code to collect a third item (e.g., price). Now imagine making those changes three months from now, when this code is no longer fresh in your mind.

I recommend you use multiple XPath calls instead and do the “do we have data for both price and title” check in PHP rather than XPath:

$xpitems = '/html/body//div[php:function("has_token", @class, "Gamesdb")]';
// below use $xpitems context:
$xptitle = 'normalize-space(p[php:function("has_token", @class, "media-title")])';
$xpprice = 'normalize-space(//span[php:function("has_token", @class, "price")])';

$nodeitems = $xp->query($xpitems);

$items = array();
foreach ($nodeitems as $nodeitem) {
    $item = array(
        'title' => $xp->evaluate($xptitle, $nodeitem),
        'price' => str_replace(' ', '', $xp->evaluate($xpprice, $nodeitem)),
    );
        // Only add this item if we have data for *all* fields:
    if (count(array_filter($item)) === count($item)) {
        $items[] = $item;
    }
}

This is much easier to read and understand, and much easier to extend in the future.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using Curl, XPath and PHP in order to scrape product names and

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply