I’m using domDocument to parse some HTML, and want to replace breaks with \n. However, I’m having problems identifying where a break actually occurs within the document.
Given the following snippet of HTML – from a much larger file that I’m reading using $dom->loadHTMLFile($pFilename):
<p>Multiple-line paragraph<br />that has a close tag</p>
and my code:
foreach ($dom->getElementsByTagName('*') as $domElement) {
switch (strtolower($domElement->nodeName)) {
case 'p' :
$str = (string) $domElement->nodeValue;
echo 'PARAGRAPH: ',$str,PHP_EOL;
break;
case 'br' :
echo 'BREAK: ',PHP_EOL;
break;
}
}
I get:
PARAGRAPH: Multiple-line paragraphthat has a close tag
BREAK:
How can I identify the position of that break within the paragraph, and replace it with a \n ?
Or is there a better alternative than using domDocument for parsing HTML that may or may not be well-formed?
You can’t get the position of an element using
getElementsByTagName. You should go throughchildNodesof each element and process text nodes and elements separately.In the general case you’ll need recursion, like this:
This will output: