I have really simple XML (HTML) parsing ANTLR grammar:
wiki: ggg+;
ggg: tag | text;
tag: '<' tx=TEXT { System.out.println($tx.getText()); } '>';
text: tx=TEXT { System.out.println($tx.getText()); };
CHAR: ~('<'|'>');
TEXT: CHAR+;
With such input: "<ggg> fff" it works fine.
But when I start to deal with whitespaces it fails. For example:
" <ggg> fff "– fails at beggining"<ggg> <hhh> "– fails after<ggg>"<ggg> fff "– works fine"<ggg> "– fails at end
I don’t know what is wrong. Maybe there is some special grammar option to handle this. ANTLRWorks gives me NoViableAltException.
ANTLR’s lexer rules match as much as possible. Only when 2 (or more) rule(s) match the same amount of characters, the rule defined first will “win”. Because of that, a single character other than
'<'and'>'is tokenized as aCHARtoken, and not asTEXTtoken, regardless of what the parser “needs” (the lexer operates independently from the parser, remember that!). Only two or more characters other than'<'and'>'are being tokenized as a (single)TEXTtoken.So, therefor the input
" <ggg> fff "creates the following 5 tokens:And since the token
CHARis not accounted for in your parser rule(s), the parse fails.Simply remove
CHARand do: