Below is a cut down version of a grammar that is parsing an input assembly file. Everything in my grammar is fine until i use labels that have 3 characters (i.e. same length as an OPCODE in my grammar), so I’m assuming Antlr is matching it as an OPCODE rather than a LABEL, but how do I say “in this position, it should be a LABEL, not an OPCODE”?
Trial input:
set a, label1
set b, abc
Output from a standard rig gives:
line 2:5 missing EOF at ','
(OP_BAS set a (REF label1)) (OP_SPE set b)
When I step debug through ANTLRWorks, I see it start down instruction rule 2, but at the reference to “abc” jumps to rule 3 and then fail at the “,”.
I can solve this with massive left factoring, but it makes the grammar incredibly unreadable. I’m trying to find a compromise (there isn’t so much input that the global backtrack is a hit on performance) between readability and functionality.
grammar TestLabel;
options {
language = Java;
output = AST;
ASTLabelType = CommonTree;
backtrack = true;
}
tokens {
NEGATION;
OP_BAS;
OP_SPE;
OP_CMD;
REF;
DEF;
}
program
: instruction* EOF!
;
instruction
: LABELDEF -> ^(DEF LABELDEF)
| OPCODE dst_op ',' src_op -> ^(OP_BAS OPCODE dst_op src_op)
| OPCODE src_op -> ^(OP_SPE OPCODE src_op)
| OPCODE -> ^(OP_CMD OPCODE)
;
operand
: REG
| LABEL -> ^(REF LABEL)
| expr
;
dst_op
: PUSH
| operand
;
src_op
: POP
| operand
;
term
: '('! expr ')'!
| literal
;
unary
: ('+'! | negation^ )* term
;
negation
: '-' -> NEGATION
;
mult
: unary ( ( '*'^ | '/'^ ) unary )*
;
expr
: mult ( ( '+'^ | '-'^ ) mult )*
;
literal
: number
| CHAR
;
number
: HEX
| BIN
| DECIMAL
;
REG: ('A'..'C'|'I'..'J'|'X'..'Z'|'a'..'c'|'i'..'j'|'x'..'z') ;
OPCODE: LETTER LETTER LETTER;
HEX: '0x' ( 'a'..'f' | 'A'..'F' | DIGIT )+ ;
BIN: '0b' ('0'|'1')+;
DECIMAL: DIGIT+ ;
LABEL: ( '.' | LETTER | DIGIT | '_' )+ ;
LABELDEF: ':' ( '.' | LETTER | DIGIT | '_' )+ {setText(getText().substring(1));} ;
STRING: '\"' .* '\"' {setText(getText().substring(1, getText().length()-1));} ;
CHAR: '\'' . '\'' {setText(getText().substring(1, 2));} ;
WS: (' ' | '\n' | '\r' | '\t' | '\f')+ { $channel = HIDDEN; } ;
fragment LETTER: ('a'..'z'|'A'..'Z') ;
fragment DIGIT: '0'..'9' ;
fragment PUSH: ('P'|'p')('U'|'u')('S'|'s')('H'|'h');
fragment POP: ('P'|'p')('O'|'o')('P'|'p');
The parser has no influence on what tokens the lexer produces. So, the input
"abc"will always be tokenized as aOPCODE, no matter what the parser tries to match.What you can do is create a
labelparser rules that matches either aLABELorOPCODEand then use thislabelrule in youroperandrule:resulting in the following AST for your example input:
This will only match
OPCODE, but will not change the type of the token. If you want the type to be changed as well, add a bit of custom code to the rule that changes it to typeLABEL: