I’m trying to find a regex to remove all html special chars (mostly &, <, >) but keeping the html tags intact.
I’m getting these informations from a database, so I can’t make sure that characters like < and > are replaced with > and <
I could manage to replace & and < it with the following RegEx in PHP:
$Value = preg_replace('/<(?!#?\/?[a-zA-Z0-9]+>)/','',$Value);
$Value = preg_replace('/&(?!#?[a-zA-Z0-9]+;)/','&',$Value);
I have now only troubles to fix the > tags, because I’d have to use lookup-behind, which doesn’t allow non fixed length RegEx.
$Value = preg_replace('/(?<!<[a-zA-Z0-9]+)>/','',$Value);
Any ideas?
Greetings
-Thomas
Use a DOM Parser and apply your replacements to the text nodes only.
Just parsing the partial will already turn XML special chars to their respective entities:
If you are not on PHP 5.3.6 you cannot use
saveHTMLwith a node. See How to get innerHTML of DOMNode? and How to return outer html of DOMDocument? for workarounds.If you need to work on the text nodes, you can do
Also see DOMDocument in php for an introduction to how DOM works.