I’m trying to write a regular expression using the PCRE library in PHP.
I need a regex to match only &, > and < chars that exist within string part of any XML node and not the tag declaration themselves.
Input XML:
<pnode>
<cnode>This string contains > and < and & chars.</cnode>
</pnode>
The idea is to to a search and replace these chars and convert them to XML entities equivalents.
If I was to convert the entire XML to entities the XML would look like this:
Entire XML converted to entities
<pnode>
<cnode>This string contains > and < and & chars.</cnode>
</pnode>
I need it to look like this:
Correct XML
<pnode>
<cnode>This string contains > and < and & chars.</cnode>
</pnode>
I have tried to write a regular expression to match these chars using look-ahaead but I don’t know enough to get this to work. My attempt (currently only attempting to match > symbols):
/>(?=[^<]*<)/g
Just to make it clear the XML I’m trying to fix comes from a 3rd party and they seem unable to fix it their end hence my attempt to fix it.
In the end I’ve opted to use the Tidy library in PHP. The code I used is shown below:
This works perfectly correcting all the encoding errors and converting invalid characters to XML entities.