I am writing a flex parser for gawk scripts. I am running into a problem differentiating between uses for a forward slash (/) character.
Obviously, a single / would be an operator for division, but two slashes could be both a regular expression or division. Right now, it parses
int((r-1)/3)*3+int((c-1)/3)+1
as having the regular expression
/3)*3+int((c-1)/
instead of the intended division operations. How do I get flex to recognize it as a mathematical expression?
Right now, this is my flex regular expression to recognize regular expressions in gawk:
EXT_REG_EXP "\/"("\\\/"|[^\/\n])*"\/"
and the division operator should be caught by my list of operators:
OPERATOR "+"|"-"|"*"|"/"|"%"|"^"|"!"|">"|"<"|"|"|"?"|":"|"~"|"$"|"="
But since the flex regular expressions are greedy I guess it treats two divisions as a regular expression.
I don’t think it’s possible to define a simple token expression to unambiguously identify regular expressions. The Posix spec for Awk notes the ambiguity thusly:
And later:
(“ERE” stands for “extended regular expression.”) From this, I think you can safely conclude that a tokenizer for Awk has to be aware of the syntactic context, and hence there is no possible regular expression that could successfully identify regular expression tokens.
It’s also worth looking at how Awk itself (or at least one of the implementations) is defined to parse regexes. In the original Awk (sometimes called the One True Awk), identifying regular expressions is the job of the parser, which explicitly sets the lexer into “regex mode” when it has figured out that it should expect to read a regex:
(
startreg()is a function defined in lex.c.) Thereg_exprrule itself is only ever matched in contexts where a division operator would be invalid.Sorry to disappoint, but I hope this helps nonetheless.