We have the following code that lists the xpaths where $value is found. We

Question

0

Asked: June 12, 20262026-06-12T12:09:22+00:00 2026-06-12T12:09:22+00:00

We have the following code that lists the xpaths where $value is found. We

0

We have the following code that lists the xpaths where $value is found.

We have detected for a given URL (see on picture) a non standard tag td1 which in addition doesn’t have a closing tag. Probably the site developers have put that there intentionally, as you see in the screen shot below.

This element creates problems identifying the corect XPath for nodes.

A broken Xpath example :

/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/**td1**/td[2]/span/u[1]

(as you see td1 is identified and chained in the Xpath)

We think by removing this element it helps us to build the valid XPath we are after.

A valid example is

/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/td[2]/span/u[1]

How can we remove prior loading in DOMXpath? Do you have some other approach?

We would like to remove all the invalid tags which may be other than td1, as h8, diw, etc…

private function extract($url, $value) {

        $dom = new DOMDocument();

        $file = 'content.txt';
        //$current = file_get_contents($url);
        $current = CurlTool::downloadFile($url, $file);
        //file_put_contents($file, $current);

        @$dom->loadHTMLFile($current);

        //use DOMXpath to navigate the html with the DOM
        $dom_xpath = new DOMXpath($dom);

        $elements = $dom_xpath->query("//*[text()[contains(., '" . $value . "')]]");
        var_dump($elements);
        if (!is_null($elements)) {

            foreach ($elements as $element) {
                var_dump($element);
                echo "\n1.[" . $element->nodeName . "]\n";

                $nodes = $element->childNodes;
                foreach ($nodes as $node) {
                    if( ($node->nodeValue != null) && ($node->nodeValue === $value) ) {
                        echo '2.' . $node->nodeValue . "\n";
                        $xpath = preg_replace("/\/text\(\)/", "", $node->getNodePath());
                        echo '3.' . $xpath . "\n";
                    }
                }
            }
        }
    }

enter image description here

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T12:09:23+00:00

You could use XPath to find the offending nodes and remove them, while promoting its children into its place in the DOM. Then your paths will be correct.

$dom_xpath = new DOMXpath($dom);
$results = $dom_xpath->query('//td1'); // (or any offending element)
foreach ($results as $invalidNode)
{
    $parentNode = $invalidNode->parentNode;
    while ($invalidNode->childNodes)
    {
        $firstChild = $invalidNode->firstChild;
        $parentNode->insertBefore($firstChild,$invalidNode);
    }
    $parentNode->removeChild($invalidNode);
}

EDIT:

You could also build a list of offending elements by using a list of valid elements and negating it.

// Build list manually from the HTML spec:
// See: http://www.w3.org/TR/html5/section-index.html#elements-1
$validTags = array();

// Convert list to XPath:
$validTagsStr = '';
foreach ($validTags as $tag)
{
    if ($validTagsStr)
    {   $validTagsStr .= ' or ';    }
    $validTagsStr .= 'self::'.$tag;
}
$results = $dom_xpath->query('//*[not('.$validTagsStr.')');

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

We have the following code that lists the xpaths where $value is found. We

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply