We’re currently implementing a little tag system into our software. There are just two different tag styles: single ones and multiple ones.
The single ones look like this:
<<Single_Tag>>
The multiple ones look like this:
<<Multiple_Tag*>>
... stuff between tag ...
<</Multiple_Tag*>>
The RegEx to find the single ones would be:
<<\w+>>
The RegEx to find the multiple ones would be:
<<(\w+)\*{1}>>((.|\s)*)<</(\w+)\*{1}>>
Are the {1}‘s required? Am I right, that (.|\s)*needs to be greedy? Otherwise this RegEx would fail on:
<<multiple_tag1*>>
<<multiple_tag2*>>
<</multiple_tag2*>>
<</multiple_tag1>>
Is there maybe an easier way with capturing groups? Excuse me, if the following syntax is wrong. The last time I’ve used RegEx is years ago:
<<(\w+)\*{1}>>((.|\s)*)<</($1)\*{1}>>
That $1stands for the first capturing group. I’m developing in .NET. I checked these on RegExr, already. But I just remember: it’s very easy to overlook something while working with RegEx.
See the following post about parsing html with regex as it applies to this as well (my fav. ever stack-overflow post).
RegEx match open tags except XHTML self-contained tags
Update
One way of solving this is to:
1) Build a tokenizer that tokenizes your input into sequence of tokens where each token is one of:
2) Call the tokenizer in a loop, and manualy keep count of the opening closing tags, making sure that they balance correctly.
Step (1) could be automated with a lexer generator. In theroy step (2) could be automated by a parser generator, but this may be overkill in this case.
A common lexer and parser generator used in .NET is ANTLR
Example
This input
Would generate the following tokens: