Specifically, I am trying to implement a RegExp parser in ANTLR.
Here are the relevant parts of my grammar:
grammar JavaScriptRegExp;
options {
language = 'CSharp3';
}
tokens {
/* snip */
QUESTION = '?';
STAR = '*';
PLUS = '+';
L_CURLY = '{';
R_CURLY = '}';
COMMA = ',';
}
/* snip */
quantifier returns [Quantifier value]
: q=quantifierPrefix QUESTION?
{
var quant = $q.value;
quant.Eager = $QUESTION == null;
return quant;
}
;
quantifierPrefix returns [Quantifier value]
: STAR { return new Quantifier { Min = 0 }; }
| PLUS { return new Quantifier { Min = 1 }; }
| QUESTION { return new Quantifier { Min = 0, Max = 1 }; }
| L_CURLY min=DEC_DIGITS (COMMA max=DEC_DIGITS?)? R_CURLY
{
var minValue = int.Parse($min.Text);
if ($COMMA == null)
{
return new Quantifier { Min = minValue, Max = minValue };
}
else if ($max == null)
{
return new Quantifier { Min = minValue, Max = null };
}
else
{
var maxValue = int.Parse($max.Text);
return new Quantifier { Min = minValue, Max = maxValue };
}
}
;
DEC_DIGITS
: ('0'..'9')+
;
/* snip */
CHAR
: ~('^' | '$' | '\\' | '.' | '*' | '+' | '?' | '(' | ')' | '[' | ']' | '{' | '}' | '|')
;
Now, INSIDE of the curly braces, I would like to tokenize ‘,’ as COMMA, but OUTSIDE, I would like to tokenize it as CHAR.
Is this possible?
This is not the only case where this is happening. I will have many other instances where this is a problem (decimal digits, hyphens in character classes, etc.)
EDIT:
I know realize that this is called context-sensitive lexing. Is this possible with ANTLR?
It is possible to do this using gated semantic predicates in the lexer. In the code below ‘,’ will match the COMMA rule only if the isComma is true. Otherwise it will match CHAR provided CHAR appears after COMMA in the grammar. I don’t know CSharp so I can’t give a complete example.
Obviously if curly braces are used in different contexts, this may not work. I recommend avoiding using the lexer this way unless it really makes a mess of the parser.