In “modern compiler implementation in Java” by Andrew Appel he claims in an exercise that:
Lex has a lookahead operator / so that the regular expression abc/def matches abc only when followed by def (but def is not part of the matched string, and will be part of the next token(s)). Aho et al. [1986] describe, and Lex [Lesk 1975] uses, an incorrect algorithm for implementing lookahead (it fails on (a|ab)/ba with input aba, matching ab where it should match a). Flex [Paxson 1995] uses a better mechanism that works correctly for (a|ab)/ba but fails (with a warning message on zx*/xy*. Design a better lookahead mechanism.
Does anyone know the solution to what he is describing?
“Does not work how I think it should” and “incorrect” are, not always the same thing. Given the input
and the pattern
it makes a certain amount of sense for the (ab|a) to match greedily, and then for the
/abconstraint to be applied separately. You’re thinking that it should work like this regular expression:with the constraint that the part matched by
(ab)is not consumed. That’s probably better because it removes some limitations, but since there weren’t any external requirements for whatlexshould do at the time it was written, you cannot call either behavior correct or incorrect.The naive way has the merit that adding a trailing context doesn’t change the meaning of a token, but simply adds a totally separate constraint about what may follow it. But that does lead to limitations/surprises:
Oops, it won’t work because “ab” is swallowed into IDENT precisely because its meaning was not changed by the trailing context. That turns into a limitation, but maybe it’s a limitation that the author was willing to live with in exchange for simplicity. (What is the use case for making it more contextual, anyway?)
How about the other way? That could have surprises also:
Say the user wants this not to match because
bracadabrais not an identifier followed by (or ending in)ab. But {IDENT}/ab will matchbracadand then, leavingabra:123in the input.A user could have expectations which are foiled no matter how you pin down the semantics.
lexis now standardized by The Single Unix specification, which says this:So you can see that there is room for interpretation here. The r and x can be treated as separate regexes, with a match for r computed in the normal way as if it were alone, and then x applied as a special constraint.
The spec also has discussion about this very issue (you are in luck):
Unspecified behavior means that there are some choices about what the behavior should be, none of which are more correct than the others (and don’t write patterns like that if you want your lex program to be portable). “As you can see, there are some limitations in this feature”.