I am writing a flex parser for gawk scripts. I am running into a

Question

0

Asked: June 12, 20262026-06-12T02:42:09+00:00 2026-06-12T02:42:09+00:00

I am writing a flex parser for gawk scripts. I am running into a

0

I am writing a flex parser for gawk scripts. I am running into a problem differentiating between uses for a forward slash (/) character.

Obviously, a single / would be an operator for division, but two slashes could be both a regular expression or division. Right now, it parses

int((r-1)/3)*3+int((c-1)/3)+1

as having the regular expression

/3)*3+int((c-1)/

instead of the intended division operations. How do I get flex to recognize it as a mathematical expression?

Right now, this is my flex regular expression to recognize regular expressions in gawk:

EXT_REG_EXP "\/"("\\\/"|[^\/\n])*"\/"

and the division operator should be caught by my list of operators:

OPERATOR "+"|"-"|"*"|"/"|"%"|"^"|"!"|">"|"<"|"|"|"?"|":"|"~"|"$"|"="

But since the flex regular expressions are greedy I guess it treats two divisions as a regular expression.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T02:42:10+00:00

I don’t think it’s possible to define a simple token expression to unambiguously identify regular expressions. The Posix spec for Awk notes the ambiguity thusly:

In some contexts, a slash ( ‘/’ ) that is used to surround an ERE
could also be the division operator. This shall be resolved in such a
way that wherever the division operator could appear, a slash is
assumed to be the division operator. (There is no unary division
operator.)

And later:

There is a lexical ambiguity between the token ERE and the tokens ‘/’
and DIV_ASSIGN. When an input sequence begins with a slash character
in any syntactic context where the token ‘/’ or DIV_ASSIGN could
appear as the next token in a valid program, the longer of those two
tokens that can be recognized shall be recognized. In any other
syntactic context where the token ERE could appear as the next token
in a valid program, the token ERE shall be recognized.

(“ERE” stands for “extended regular expression.”) From this, I think you can safely conclude that a tokenizer for Awk has to be aware of the syntactic context, and hence there is no possible regular expression that could successfully identify regular expression tokens.

It’s also worth looking at how Awk itself (or at least one of the implementations) is defined to parse regexes. In the original Awk (sometimes called the One True Awk), identifying regular expressions is the job of the parser, which explicitly sets the lexer into “regex mode” when it has figured out that it should expect to read a regex:

reg_expr:
      '/' {startreg();} REGEXPR '/'     { $$ = $3; }
    ;

(startreg() is a function defined in lex.c.) The reg_expr rule itself is only ever matched in contexts where a division operator would be invalid.

Sorry to disappoint, but I hope this helps nonetheless.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am writing a flex parser for gawk scripts. I am running into a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply