Please see the edit at the bottom:
I’m using XPath to scrape some data from a site. Im wondering if I’m perhaps using too many foreach() loops, and could traverse through the hierarchy in a simpler way. I feel I may be using too many queries, and that there may be a better way just using one
The hierarchy looks something like this.
<ul class='item-list'>
<li class='item' id='12345'>
<div class='this-section'>
<a href='http://www.thissite.com'>
<img src='http://www.thisimage.com/image.png' attribute_one='4567' attribute-two='some-words' />
</div>
<small class='sale-count'>Some Number</small>
</li>
<li class='item' id='34567'>
<li class='item' id='48359'>
<li class='item' id='43289'>
</ul>
So I did the following:
$dom = new DOMDocument;
@$dom->loadHTMLFile($file);
$xpath = new DOMXPath($dom);
$list = $xpath->query("//ul[@class='item-list']/li");
foreach($list as $list_item)
{
$item['item_id'][] = $list_item->getAttribute('id');
$links = $xpath->query("div[@class='this-section']//a[contains(@href, 'item')]", $list_item);
foreach($links as $address)
{
$href = $address->getAttribute('href');
$item['link'][] = substr($href, 0, strpos($href, '?'));
}
$other_data = $xpath->query("div[@class='this-section']//*[@attribute-one]", $list_item);
foreach($other_data as $element)
{
$item['cost'][] = $element->getAttribute('atribute-one');
$item['category'][] = $element->getAttribute('attribute-two');
$item['name'][] = $element->getAttribute('attribute-three');
}
$sales = $xpath->query(".//small[@class='sale-count']", $list_item);
foreach($sales as $sale)
$item['sale'][] = substr($sale->textContent, 0, strpos($sale->textContent, ' '));
}
Do I need to constantly re-query to work my down the hierarchy, or is there a simpler way to accomplish this?
EDIT
So it seems I am indeed using too many foreach loops. For every one I take out, I am save a ton of memory. So my question becomes.
One I have parent element (in this case the <li>), is there not a way to pick elements and attributes out without re-querying and looping through the results? I need to eliminate as many of these xpath subqueries, and foreach loops as I can.
Sure, you could use
DOMElement::getElementsByTagName()instead:As for which is more efficient, you’d have to benchmark it. You have the speed comparison between a relative XPath query, or a preorder traversal of the
<li>‘s node tree.