If I have an ANTLR grammar as follows:
grammar Test;
options {
language = Java;
}
rule : (foo | bar);
foo : FOO ',' FOO;
bar : BAR;
FOO: ('0'..'9')+;
BAR: ('a'..'z' | 'A'..'Z' | '0'..'9' | ' ')+;
WHITESPACE: (' ' | '\t')+ { $channel=HIDDEN; };
And I use a test string:
12abc3
this (I believe) is a BAR token which satisfies a bar rule and is parsed as such. Bravo.
However, if I have this string:
12
I receive line 1:2 mismatched input '' expecting ','
This seems rather non-deterministic although I’m sure it’s not. I understand I’m already in trouble by having two tokens: FOO and BAR that accept digits. But if the parser is going to succeed or fail it should succeed or fail consistently. In other words, in the first case the first character is a 1 and apparently is being evaluated as a member of the BAR token and thus the parser heads down a successful path. In the second case, the SAME first character is being evaluated as a FOO token and thus the path is doomed to fail despite the fact that the string COULD be a successful bar parse. Why the inconsistency? Or am I missing something more fundamental about ANTLR and/or parsing?
ANTLR doesn’t determine the token type until it sees the first character for the next token(or EOF). ANTLR will also attempt a longest match, which is why you see ’12abc3′ as BAR and not as FOO BAR. In the second case ANTLR will use FOO for ’12’ because it is listed first in the grammar.
ANTLR basics
ANTLR lexers