For a markup language I’m trying to parse, I decided to give parser generation

Question

0

Asked: June 15, 20262026-06-15T09:52:33+00:00 2026-06-15T09:52:33+00:00

For a markup language I’m trying to parse, I decided to give parser generation

0

For a markup language I’m trying to parse, I decided to give parser generation a try with ANTLR. I’m new to the field, and I’m messing something up.

My grammar is

grammar Test;
DIGIT   :   ('0'..'9');
LETTER  :   ('A'..'Z');
SLASH   :   '/'; 
restriction
    :   ('E' ap)
    |   ('L' ap)
    |   'N';
ap  :   LETTER LETTER LETTER;
car :   LETTER LETTER;
fnum    :   DIGIT DIGIT DIGIT DIGIT? LETTER?;
flt :   car fnum?;
message :   'A' (SLASH flt)? (SLASH restriction)?;

which does exactly what I want, when I give it an input string A/KK543/EPOS. When I give it A/KL543/EPOS however, it fails (MismatchedTokenException(9!=5)). It seems like some sort of conflict; it wants to generate restriction on the first L, so it seems I’m doing something wrong in the language definition, but I can’t properly find out what.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T09:52:34+00:00

For the input "A/KK543/EPOS", the following tokens are created:

'A'        'A'
SLASH      '/'
LETTER     'K'
LETTER     'K'
DIGIT      '5'
DIGIT      '4'
DIGIT      '3'
SLASH      '/'
'E'        'E'
LETTER     'P'
LETTER     'O'
LETTER     'S'

But for the input "A/KL543/EPOS", these are created:

'A'        'A'
SLASH      '/'
LETTER     'K'
'L'        'L'
DIGIT      '5'
DIGIT      '4'
DIGIT      '3'
SLASH      '/'
'E'        'E'
LETTER     'P'
LETTER     'O'
LETTER     'S'

As you can see, the char 'L' does not get tokenized as a LETTER. For the literal tokens 'A', 'E', 'L' and 'N' inside your parser rules, ANTLR (automatically) creates separate lexer rules that are place before all other lexer rules. This causes your lexer to look like this behind the scenes:

A      : 'A';
E      : 'E';
L      : 'L';
N      : 'N';
DIGIT  : '0'..'9';
LETTER : 'A'..'Z';
SLASH  : '/';

Therefor, any single 'A', 'E', 'L' and 'N' will never become a LETTER token. This is simply how ANTLR works. If you want to match them as letters, you’ll need to create a parser rule letter and let it match these tokens too. Something like this:

message
 : A (SLASH flt)? (SLASH restriction)?
 ;

flt
 : car fnum?
 ;

fnum
 : DIGIT DIGIT DIGIT DIGIT? letter?
 ;

restriction
 : E ap
 | L ap
 | N
 ;

ap
 : letter letter letter
 ;

car
 : letter letter
 ;

letter
 : A
 | E
 | L
 | N
 | LETTER
 ;

A      : 'A';
E      : 'E';
L      : 'L';
N      : 'N';
DIGIT  : '0'..'9';
LETTER : 'A'..'Z';
SLASH  : '/';

which will parse the input "A/KL543/EPOS" like this:

enter image description here

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

For a markup language I’m trying to parse, I decided to give parser generation

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply