I’m currently trying to parse a document with DOMDocument, and I’m having some serious problems. I created a script that runs fine on php 5.2.9, ripping out content using DOMNode::nodeValue. The same script fails to get any content on php 5.3.3 – even though it correctly navigates to the proper nodes to extract content.
Basically, the code used looks like this:
$dom = new DOMDocument();
$dom->loadHTML($data);
$dom->preserveWhiteSpace = false;
$xpath = new DOMXpath($dom);
$nodelist = $xpath->query($query);
$value = $nodelist->item(0)->nodeValue;
I’ve checked to make sure that item(0) is in fact a node – it’s there and even of the right type, but nodeValue is empty.
The script works on some documents but not others (on 5.3.3) – on 5.2.9 it works on all documents, returning the proper nodeValue.
I seem to have missed something basic and/or a bug (though if the bug is in php or libxml I don’t know). Basically, the issue is fixed by making sure the data loaded with loadHTML is UTF-8 encoded. Mind you, it’s not the entire document that needs to be UTF-8 encoded – the problem here was that there was a character in the element which wasn’t in UTF-8. That then threw off everything else in the document handling.
What gets me is that this basically meant all document content was thrown out – but the structure was in place working normally. No errors or anything to suggest the content was seen as invalid.