I am developing a lexer grammar for C/C++ source code. The goal of the grammar is to fight plagiarism between students at university.
To improve the effectiveness of the grammar, I want ANTLR to create the same token for the 4(?) different ways a student could increment a variable:
i++
++i
i += 1
(i = i + 1) [I doubt that this can be solved with ANTLR]
Each of these expressions should result in the token INCREMENT.
What I have come up with so far: (only the neccessary parts of the grammar are reproduced here)
options {
language = CSharp3;
filter = true;
k = 2;
}
INCREMENT : IDENTIFIER (PLUSPLUS | ADDEQUAL '1') | PLUSPLUS IDENTIFIER ;
IDENTIFIER
: LETTER (LETTER | DIGIT)*;
/*
* covers both decimal and hex integer literals
*/
INTEGER_LITERAL :
DIGIT+ | '0x' HEX_DIGIT+;
ADDEQUAL : '+=';
PLUSPLUS : '++';
fragment
LETTER : 'A'..'Z' | 'a'..'z';
fragment
HEX_DIGIT : DIGIT | 'a'..'f' | 'A'..'F';
fragment
DIGIT : '0'..'9';
testing this grammar for i += 1 results in the token sequence IDENTIFIER ADDEQUAL INTEGER_LITERAL instead of INCREMENT.
Why is that?
From my understanding the precedence of rules is top to bottom plus INCREMENT is the “bigger” rule.
What adjustments to the grammar need I make to get the desired result?
Because
"i += 1"contains spaces you didn’t account for inside yourINCREMENTrule.Account for the spaces (and line breaks, possibly).
However, creating a lexer alone does not seem the way to go here. You really need a parser, IMO. And the option
k = 2;sets look ahead for parser rules, not lexer rules: so in case you stick to lexing only, you mind as well remove it.