I have really simple XML (HTML) parsing ANTLR grammar: wiki: ggg+; ggg: tag |

Question

0

Asked: June 6, 20262026-06-06T09:05:21+00:00 2026-06-06T09:05:21+00:00

I have really simple XML (HTML) parsing ANTLR grammar: wiki: ggg+; ggg: tag |

0

I have really simple XML (HTML) parsing ANTLR grammar:

wiki: ggg+;

ggg: tag | text;

tag: '<' tx=TEXT { System.out.println($tx.getText()); } '>';

text: tx=TEXT { System.out.println($tx.getText()); };

CHAR: ~('<'|'>');
TEXT: CHAR+;

With such input: "<ggg> fff" it works fine.

But when I start to deal with whitespaces it fails. For example:

" <ggg> fff " – fails at beggining
"<ggg> <hhh> " – fails after <ggg>
"<ggg> fff " – works fine
"<ggg> " – fails at end

I don’t know what is wrong. Maybe there is some special grammar option to handle this. ANTLRWorks gives me NoViableAltException.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-06T09:05:22+00:00

ANTLR’s lexer rules match as much as possible. Only when 2 (or more) rule(s) match the same amount of characters, the rule defined first will “win”. Because of that, a single character other than '<' and '>' is tokenized as a CHAR token, and not as TEXT token, regardless of what the parser “needs” (the lexer operates independently from the parser, remember that!). Only two or more characters other than '<' and '>' are being tokenized as a (single) TEXT token.

So, therefor the input " <ggg> fff " creates the following 5 tokens:

type    | text
--------+-----------
CHAR    |   ' '
'<'     |   '<'
TEXT    |   'ggg'
'>'     |   '>'
TEXT    |   ' fff '

And since the token CHAR is not accounted for in your parser rule(s), the parse fails.

Simply remove CHAR and do:

TEXT : ~('<'|'>')+;

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have really simple XML (HTML) parsing ANTLR grammar: wiki: ggg+; ggg: tag |

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply