Regex was my original idea as a solution, although it soon became apparent a

Question

0

Asked: May 23, 20262026-05-23T17:45:38+00:00 2026-05-23T17:45:38+00:00

Regex was my original idea as a solution, although it soon became apparent a

0

Regex was my original idea as a solution, although it soon became apparent a DOM parser would be more appropriate… I’d like to convert spaces to   between PRE tags within a string of HTML text. For example:

<table atrr="zxzx"><tr>
<td>adfa a   adfadfaf></td><td><br /> dfa  dfa</td>
</tr></table>
<pre class="abc" id="abc">
abc 123
<span class="abc">abc 123</span>
</pre>
<pre>123 123</pre>

into (note the space in the span tag attribute is preserved):

<table atrr="zxzx"><tr>
<td>adfa a   adfadfaf></td><td><br /> dfa  dfa</td>
</tr></table>
<pre class="abc" id="abc">
abc&nbsp;123
<span class="abc">abc&nbsp;123</span>
</pre>
<pre>123 123</pre>

The result needs to be serialised back into string format, for use elsewhere.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T17:45:39+00:00

This is somewhat tricky when you want to insert   Entities without DOM converting the ampersand to & entities because Entities are nodes and spaces are just character data. Here is how to do it:

$dom = new DOMDocument;
$dom->loadHtml($html);
$xp = new DOMXPath($dom);
foreach ($xp->query('//text()[ancestor::pre]') as $textNode)
{
    $remaining = $textNode;
    while (($nextSpace = strpos($remaining->wholeText, ' ')) !== FALSE) {
        $remaining = $remaining->splitText($nextSpace);
        $remaining->nodeValue = substr($remaining->nodeValue, 1);
        $remaining->parentNode->insertBefore(
            $dom->createEntityReference('nbsp'),
            $remaining
        );
    }
}

Fetching all the pre elements and working with their nodeValues doesnt work here because the nodeValue attribute would contain the combined DOMText values of all the children, e.g. it would include the nodeValue of the span childs. Setting the nodeValue on the pre element would delete those.

So instead of fetching the pre nodes, we fetch all the DOMText nodes that have a pre element parent somewhere up on their axis:

DOMElement pre
    DOMText "abc 123"         <-- picking this
    DOMElement span
       DOMText "abc 123"      <-- and this one
DOMElement
    DOMText "123 123"         <-- and this one

We then go through each of those DOMText nodes and split them into separate DOMText nodes at each space. We remove the space and insert a nbsp Entity node before the split node, so in the end you get a tree like

DOMElement pre
    DOMText "abc"
    DOMEntity nbsp
    DOMText "123"
    DOMElement span
       DOMText "abc"
       DOMEntity nbsp
       DOMText "123"
DOMElement
    DOMText "123"
    DOMEntity nbsp
    DOMText "123"

Because we only worked with the DOMText nodes, any DOMElements are left untouched and so it will preserve the span elements inside the pre element.

Caveat:

Your snippet is not valid because it doesnt have a root element. When using loadHTML, libxml will add any missing structure to the DOM, which means you will get your snippet including a DOCTYPE, html and body tag back.

If you want the original snippet back, you’d have to getElementsByTagName the body node and fetch all the children to get the innerHTML. Unfortunately, there is no innerHTML function or property in PHP’s DOM implementation, so we have to do that manually:

$innerHtml = '';
foreach ($dom->getElementsByTagName('body')->item(0)->childNodes as $child) {
    $tmp_doc = new DOMDocument();
    $tmp_doc->appendChild($tmp_doc->importNode($child,true));
    $innerHtml .= $tmp_doc->saveHTML();
}
echo $innerHtml;

Also see

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Regex was my original idea as a solution, although it soon became apparent a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply