I’m writing a parser/interpreter for a C-like language and I need to interpret escaped characters. One of them is the unicode-escaped sequence with this pattern “\uXXXX” where X is some hex number.
My ANTLR rules look like this:
public char returns [char c]
: '\\"' { $c = '"'; }
| '\\\\' { $c = '\\'; }
| '\\/' { $c = '/'; }
| '\\b' { $c = '\b'; }
| '\\f' { $c = '\f'; }
| '\\n' { $c = '\n'; }
| '\\r' { $c = '\r'; }
| '\\t' { $c = '\t'; }
| '\\u' HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT { $c = 'e'; }
| ~('\\' | '"') { $c = '/'; }
;
fragment HEXDIGIT
: ('0'..'9'|'a'..'f'|'A'..'F')
I’m feeding it this string “\u1234” for which I expect an ‘e’ but I’m getting a ‘/’ instead which is the fallback rule for everything else.
Is there some magic juju going on with fragments and rules or something that I’m not aware of?
As mentioned by Adam,
charis a parser rule at the moment, but should be made a lexer rule instead, in which case you can’t let it return achar(lexer rules always return an instance of aToken!).You can adjust the inner-text of a token using its
setText(...)method like this (assuming Java is the target language):