I am using Curl, XPath and PHP in order to scrape product names and prices from HTML source code. Here is a sample similar to the source code I am examining:
<div class="Gamesdb">
<p class="media-title">
<a href="/Games/Console/4-/105/Bluetooth-Headset/">Bluetooth Headset</a>
</p>
<p class="sub-title"> Console </p>
<p class="rating star-50">
<a href="/Games/Console/4-/105/Bluetooth-Headset/ProductReviews.html">(1)</a>
</p>
<p class="mt5">
<span class="price-preffix">
<a href="/Games/Console/4-/105/Bluetooth-Headset/">1 New</a>
from
</span>
<a class="wt-link" href="/Games/Console/4-/105/Bluetooth-Headset/">
<span class="price">
<em>£34</em>
.99
</span>
<span class="free-delivery"> FREE delivery</span>
</a>
</p>
<p class="mt10">
<a class="primary button" href="/Games/Console/4-/105/Bluetooth-Headset/">
Product Details
<span style="color: rgb(255, 255, 255); margin-left: 6px; font-size: 16px;">»</span>
</a>
</p>
</div>
I want to extract the media title i.e:
<p class="media-title">
<a href="/Games/Console/4-/105/Bluetooth-Headset/">Bluetooth Headset</a>
</p>
Only when the following price class is also present:
<span class="price">
<em>£34</em>
.99
</span>
Many of the other products listed don’t include it.
I need to extract both the product name and price or nothing at all and move on to the next product.
Here is a sample of the code i am currently using which is effective at getting all the results regardless of any other conditions:
$results=file_get_contents('SCRAPEDHTML.txt');
$html = new DOMDocument();
@$html->loadHtml($results);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query('//p[@class="media-title"]|//span[@class="price"]');
foreach ($nodelist as $n){
$results2[]=$n->nodeValue;
}
I believe this is possible using the correct xpath query but have so far been unable to achieve it. Many thanks in advance.
I am assuming there is only one “item” per
div.Gamesdb. If not, there may not be enough structure in the source html to use xpath alone. You will probably have to index product names and look for prices near matching product names.You can do this with a single giant XPath, but I recommend you use multiple XPaths. I’ll show both ways.
First create your
DOMXPathand register helper to match class names.You can then use a giant XPath:
However, the overall code is brittle and unclear. The XPath union operator (
|) returns nodes in document order so we can’t bisect the list. The PHP code must walk through every item in the nodelist and using the DOM determine which field corresponds to this data. Think about the changes you would have to make if you wanted to extend the code to collect a third item (e.g., price). Now imagine making those changes three months from now, when this code is no longer fresh in your mind.I recommend you use multiple XPath calls instead and do the “do we have data for both price and title” check in PHP rather than XPath:
This is much easier to read and understand, and much easier to extend in the future.