I am using the regexec() function in C. I basically am trying to write a regular expression to capture portions of a string for substitution.
So for example, if I have the string “Hello $X” Then I want the regexec to give me the range 6,7 as that is “$X”. But as there can be an arbitrary number of substitutions, I am using the regular expression:
"([^$]*(\\$[A-Za-z][A-Za-z0-9_]*))+"
This should match any arbitrary sequence of text + substitution patterns.
So for example in the string “First=$X, Second=$Y” I need to know that $X occurred at offset 6-7 and and $Y occurred at offset 17-18.
The actual offsets I get from regexec are:
0,19 8,19 17,19
First, I understand that the ending offset is actually one past the the character of the match. So the above offsets correspond to the following parts of the string:
First=$X, Second=$Y
, Second=$Y
$Y
Now I can see what is happening here: the first range is obviously the entire match, and the second is the first entire sub-match of the second sub-expression. But from this point on I am puzzled. Why is it only returning the first sub-match of the second sub-expression and not the first?
I suspect it has something to do with the fact that I have a repeating expression, but I’m not sure what I need to do to fix the problem. How do I get it to return the desired offsets?
Note: I am passing a 128-element regmatch_t to regexec() (nmatch=128), so I should be able to get all matches.
You’re confused about what first and second mean. In this expression:
is the first parenthesizes subexpression and
is the second. If a parenthesized subexpression gets used more than once as part of a
*,?,+, or{}repetition operator, it’s the last match that counts.If you want to match an arbitrary number of instances, than rather than using the
+on the end of your regex, you simply need to callregexecmultiple times, and use the ending offset of the previous run as your new starting point.