In modern compiler implementation in Java by Andrew Appel he claims in an exercise

Question

0

Asked: May 31, 20262026-05-31T15:36:30+00:00 2026-05-31T15:36:30+00:00

In modern compiler implementation in Java by Andrew Appel he claims in an exercise

0

In “modern compiler implementation in Java” by Andrew Appel he claims in an exercise that:

Lex has a lookahead operator / so that the regular expression abc/def matches abc only when followed by def (but def is not part of the matched string, and will be part of the next token(s)). Aho et al. [1986] describe, and Lex [Lesk 1975] uses, an incorrect algorithm for implementing lookahead (it fails on (a|ab)/ba with input aba, matching ab where it should match a). Flex [Paxson 1995] uses a better mechanism that works correctly for (a|ab)/ba but fails (with a warning message on zx*/xy*. Design a better lookahead mechanism.

Does anyone know the solution to what he is describing?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T15:36:31+00:00

“Does not work how I think it should” and “incorrect” are, not always the same thing. Given the input

aba

and the pattern

(ab|a)/ab

it makes a certain amount of sense for the (ab|a) to match greedily, and then for the /ab constraint to be applied separately. You’re thinking that it should work like this regular expression:

(ab|a)(ab)

with the constraint that the part matched by (ab) is not consumed. That’s probably better because it removes some limitations, but since there weren’t any external requirements for what lex should do at the time it was written, you cannot call either behavior correct or incorrect.

The naive way has the merit that adding a trailing context doesn’t change the meaning of a token, but simply adds a totally separate constraint about what may follow it. But that does lead to limitations/surprises:

 {IDENT}  /* original code */

 {IDENT}/ab   /* ident, only when followed by ab */

Oops, it won’t work because “ab” is swallowed into IDENT precisely because its meaning was not changed by the trailing context. That turns into a limitation, but maybe it’s a limitation that the author was willing to live with in exchange for simplicity. (What is the use case for making it more contextual, anyway?)

How about the other way? That could have surprises also:

 {IDENT}/ab  /* input is bracadabra:123 */

Say the user wants this not to match because bracadabra is not an identifier followed by (or ending in) ab. But {IDENT}/ab will match bracad and then, leaving abra:123 in the input.

A user could have expectations which are foiled no matter how you pin down the semantics.

lex is now standardized by The Single Unix specification, which says this:

r/x
The regular expression r shall be matched only if it is followed by an occurrence of regular expression x ( x is the instance of trailing context, further defined below). The token returned in yytext shall only match r. If the trailing portion of r matches the beginning of x, the result is unspecified. The r expression cannot include further trailing context or the ‘$’ (match-end-of-line) operator; x cannot include the ‘^’ (match-beginning-of-line) operator, nor trailing context, nor the ‘$’ operator. That is, only one occurrence of trailing context is allowed in a lex regular expression, and the ‘^’ operator only can be used at the beginning of such an expression.

So you can see that there is room for interpretation here. The r and x can be treated as separate regexes, with a match for r computed in the normal way as if it were alone, and then x applied as a special constraint.

The spec also has discussion about this very issue (you are in luck):

The following examples clarify the differences between lex regular expressions and regular expressions appearing elsewhere in this volume of IEEE Std 1003.1-2001. For regular expressions of the form “r/x”, the string matching r is always returned; confusion may arise when the beginning of x matches the trailing portion of r. For example, given the regular expression “a*b/cc” and the input “aaabcc”, yytext would contain the string “aaab” on this match. But given the regular expression “x*/xy” and the input “xxxy”, the token xxx, not xx, is returned by some implementations because xxx matches “x*”.

In the rule “ab*/bc”, the “b*” at the end of r extends r’s match into the beginning of the trailing context, so the result is unspecified. If this rule were “ab/bc”, however, the rule matches the text “ab” when it is followed by the text “bc”. In this latter case, the matching of r cannot extend into the beginning of x, so the result is specified.
As you can see there are some limitations in this feature.

Unspecified behavior means that there are some choices about what the behavior should be, none of which are more correct than the others (and don’t write patterns like that if you want your lex program to be portable). “As you can see, there are some limitations in this feature”.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In modern compiler implementation in Java by Andrew Appel he claims in an exercise

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply