I’m making a web scraper and this is driving me crazy!
I need to get the text of a paragraph. Simple, right?! Here’s the code.
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//div");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('class');
echo "<br />Found it: $url";
}
It works perfectly, grabs the class of every div on the page and echoes it out. But what I really need to do is find all <p> tags – every one on the page – and echo the text that is in between the <p>! I have a feeling it’s simple but I just can’t figure it out.
edit
All it took was the following:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$node = $doc->getElementsByTagName('p')->item(3);
echo $node->textContent."\n";
What you really want is getElementsByName and then once you have the node, you textContent for the win. Thanks folks! Not sure if it will apply to everyone else’s situation, but it sure does mine. =o
Use getElementsByTagName to retrieve all
<p>-elements. Then iterate over the resulting DOMNodeList an fetch the nodeValue of the items.