I’m scraping a website that’s mostly table based. I have <tr> tags that each represent a category and <td> tags inside these that represent properties of the category.
Using Xpath I get the <tr> fine but with all the <td> info inside it bunched as one string:
$html_string = file_get_contents('testpage.html');
$dom = new DOMDocument();
$dom->loadHTML($html_string);
$xpath = new DOMXpath($dom);
$context_nodes = $xpath->query('//table[@id="category"]/tr[not(starts-with(@id, "category"))]');
And can each get <td> fine but with no retrospective reference to the category with:
$context_nodes = $xpath->query('//table[@id="category"]/tr[not(starts-with(@id, "category"))]/td');
What I would like to do later is be able to reference the properties of each category. I presumed I could do so with $context_nodes[2] etc., thinking that the array it created was a multidimensional string array. This doesn’t seem to be the case.
How would I go about creating an array from the xpath info where I can grab a property of a category based on identifying what category I specifically want. E.g. train[1][2]?
Your second attempt is on the right lines. PHP (or, rather, libxml) retains a reference to the context the nodes you selected were returned from, allowing you to do precisely what you need in your case.
XML
PHP
Notice how we navigate back up the tree, from the point of the
propnode, to reference the parent category. Not sure if this is what you meant but hope it helps.