Probably a stupid question but so far I can’t figure this out…
I have an XHTML document as a string. It’s in $temp So far so good. I want to do two things. I want to select all meta tags in the body (they are there because of their use in conjunction with microdata) and then delete them. After deleting the microdata properties that is.
$xml=new DOMDocument();
$xml->loadXML($temp);
$xpath = new DOMXPath($xml);
$attr = $xpath->query("//@itemscope|//@itemprop|//@itemtype|//@itemid|//@itemref");
foreach ($attr as $entry)
$entry->parentNode->removeAttribute($entry->nodeName);
That works. But I can’t manage to select any nodes with Xpath.
$xpath = new DOMXPath($xml); // thought I had to update this after changing the XML
echo $xpath->query("//body")->length; // => 0
echo $xml->getElementsByTagName("body")->length; // => 1
So Question No. 1: How do I select nodes with Xpath. Why doesn’t this work?
This works to get the node list though:
$node = $xml->getElementsByTagName("body")->item(0)->getElementsByTagName("meta");
I figured to remove the nodes I’d use this: (similar to removing the attributes above)
foreach ($node as $entry)
{
$entry->parentNode->removeChild($entry);
}
But the nodes remain.
So there is Question No. 2: How to remove nodes from an XML file.
Specifically meta nodes anywhere in any body node.
Thanks.
UPDATE
Let me add an HTML test case:
$temp='<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
<head>
<meta charset="utf-8"/>
</head>
<body id="dok" itemscope="itemscope" itemtype="http://schema.org/WebPage" >
<div><div><div><meta itemprop="dummy" content="something"/></div></div></div>
<span><meta itemprop="dummy2" content="something2"/></span>
</body>
</html>';
With the above the xPath trying to select the body give me a length of 0 and I can’t remove all meta tags from the body…
UPDATE
This works with the loadXML() method:
$xpath = new DOMXPath($xml);
$xpath->registerNamespace("x","http://www.w3.org/1999/xhtml");
echo $xpath->query("//x:body")->length;
SOLUTION without namespaces
It was about the xmlns="http://www.w3.org/1999/xhtml" namespace in the root html tag all along. //body selects any body tag that is NOT part of any namespace. Since we did specify a default namespace and body is part of that namespace //bodywon’t select it. I have no idea under what name to access the namespace already intrinsic to the XHTML without declaring it under a name but if we strip it off before creating the XML all is fine. After we’re done we can add it back in..
$temp = str_replace('xmlns="http://www.w3.org/1999/xhtml"','',$temp);
$xml=new DOMDocument();
$xml->loadXML($temp);
$xpath = new DOMXPath($xml);
$attr = $xpath->query("//@itemscope|//@itemprop|//@itemtype|//@itemid|//@itemref");
foreach ($attr as $entry)
$entry->parentNode->removeAttribute($entry->nodeName);
$node = $xpath->query("//body//meta");
foreach ($node as $entry)
{
$entry->parentNode->removeChild($entry);
}
$temp=$xml->saveXML();
$temp = str_replace('<html','<html xmlns="http://www.w3.org/1999/xhtml"',$temp);
that way //body//meta works just as expected…
This piece of code does the job for me:
I think the two key points are:
//body//meta– The xpath must reflect that there can be more elements between the body and the meta elements. Hence the//betweenbodyandmeta.Namespaces and XML
Thanks to the explanation by Dimitri, I could now better understand the namespace issue I only smelled and could update the code to a loadXML() compatible version (only the modified lines):
This loads the document as XML. Then it registers the namespace URI from the document with the name
xhtmlfor the xpath object.The xpath query then was modified to reflect the namespace properly for the elements expressions.