Suppose I need simple grammar that describes language like
foo 2
bar 21
but not
foo1
Using jflex I wrote smt like
WORD=[a-zA-Z]+
NUMBER=[0-9]+
WHITE_SPACE_CHAR=[\ \n\r\t\f]
%state AFTER_WORD
%state AFTER_WORD_SEPARATOR
%%
<YYINITIAL>{WORD} { yybegin(AFTER_WORD); return TokenType.WORD; }
<AFTER_WORD>{WHITE_SPACE_CHAR}+ { yybegin(AFTER_WORD_SEPARATOR); return TokenType.WHITE_SPACE; }
<AFTER_WORD_SEPARATOR>{NUMBER} { yybegin(YYINITIAL); return TokenType.NUMBER; }
{WHITE_SPACE_CHAR}+ { return TokenType.WHITE_SPACE; }
But I dont like extra states that used for saying that there should be whitespace between word and digit. How I can simplify my grammar?
From what I know of JFlex, if you are recognizing whitespaces corectly (which seems to be the case), you don’t have to use extra states. Just make a rule for “identifiers”, and another one for “numbers”.
If your language imposes each line to be consisted of exactly one identifier, one space and one number, this should be checked by syntactic analysis (i.e. by a parser), not lexical analysis.