I have a basic text template engine that uses a syntax like this:
foo bar
%IF MY_VAR
some text
%IF OTHER_VAR
some other text
%ENDIF
%ENDIF
bar foo
I have an issue with the regular expression that I am using to parse it whereby it is not taking into account the nested IF/ENDIF blocks.
The current regex I’m using is: %IF (?<Name>[\w_]+)(?<Contents>.*?)%ENDIF
I have been reading up on balancing capture groups (a feature of .NET’s regex library) as I understand this is the recommended way of supporting “recursive” regex’s in .NET.
I’ve been playing with balancing groups and have so far came up with the following:
(
(
(?'Open'%IF\s(?<Name>[\w_]+))
(?<Contents>.*?)
)+
(
(?'Close-Open'%ENDIF)(?<Remainder>.*?)
)+
)*
(?(Open)(?!))
But this is not behaving entirely how I would expect. It is for instance capturing a lot of empty groups. Help?
To capture a whole IF/ENDIF block with balanced IF statements, you can use this regex:
The point here is this: you cannot capture in a single
Matchmore than one of every named group. You will only get one(?<Name>\w+)group, for example, of the last captured value. In my regex, I kept theNameandContentsgroups of your simple regex, and limited the balancing inside theContentsgroup – the regex is still wrapped inIFandENDIF.If becomes interesting when your data is more complex. For example:
Here, you will get two matches, one for
MY_VAR, and one forOTHER_VAR3. If you want to capture the two ifs onMY_VAR‘s content, you have to rerun the regex on itsContentsgroup (you can get around it by using a lookahead if you must – wrap the whole regex in(?=...), but you’ll need to put it into a logical structure somehow, using positions and lengths).Now, I won’t explain too much, because it seems you get the basics, but a short note about the contents group – I’ve uses a possessive group to avoid backtracking. Otherwise, it would be possible for the dot to eventually match whole
IFs and break the balance. A lazy match on the group would behave similarly (( )+?instead of(?> )+).