For example, we have this xml:
<x>
<y>some text</y>
<y>[ID] hello</y>
<y>world [/ID]</y>
<y>some text</y>
<y>some text</y>
</x>
and we need to remove words “[ID]”, “[/ID]” and text between them (which we don’t know, when parsing), of course without damage xml formatting.
The only solution i can think is that:
-
Find in xml the text by using regex, for example:
"/\[ID\].*?\[\/ID\]/". In our case, result will be"[ID]hello</y><y>world[/ID]" -
In result from prev step we need to find text without xml-tags by using this regex:
"/(?<=^|>)[^><]+?(?=<|$)/", and delete this text. The result will be"</y><y>" -
Made changes in original xml by doing smth like this:
str_replace($step1string,$step2string,$xml);
is this correct way to do this?
I just think that this “str_replace”‘s things it’s not best way to edit xml, so maybe you know better solution?
Removing the specific string is simple:
When just removing textnodes in a specific tag, one could alter te preg_replace to these 2:
Resulting in for your example:
However, removing tags in between without damaging well formed XML is quite tricky. Before venturing into lot of DOM actions, how would you like to handle:
An [/ID] higher in the DOM-tree:
An [/ID] lower in the DOM-tree
And open/close spanning siblings, as per your example:
And a real dealbreaker of a question: is nesting possible, is that nesting well formed, and what should it do?
Without further knowledge how these case should be handled there is no real answer.
Edit, well futher information was given, the actual, fail-safe solution (i.e.: parse XML, don’t use regexes) seems kind of long, but will work in 99.99% of cases (personal typos and brainfarts excluded of course 🙂 ):