I’m trying to use Antlr to process a simple text file, mostly to re-learn

Question

0

Asked: June 16, 20262026-06-16T17:14:14+00:00 2026-06-16T17:14:14+00:00

I’m trying to use Antlr to process a simple text file, mostly to re-learn

0

I’m trying to use Antlr to process a simple text file, mostly to re-learn grammer design.

Each line in the text file is composed of a keyword ‘BY: ‘ and a EOL terminated string; the file ends with a series of ‘-‘; like so:

BY: abc123@gmail.com
BY: myCrazy@#$%ID
BY: first_name second_name
-------------------

I defined my grammer as follows:

grammar authors;

prog    :   author+ DASHES;
author  :   BY STRING NEWLINE;

BY  :   'BY: ';
STRING  :   ('!'..'~')*;
NEWLINE :   '\r'? '\n' ;
DASHES  :   '-'+ NEWLINE;

This grammer recognizes the first and second author but fails to recognize the third because of the space. So I changed the STRING to include a space STRING:('!'..'~'|' ')* but then it stopped working all together (It throws MisstingTokenException).

I think it is because the STRING rule matches the entire line before the BY is matched. But then why does it work when the space is excluded from the STRING? Is there a way I can force the lexer to match the BY rule first?

In general, how can I consume a free form unicode newline terminated string (names can have accented-characters as well)?

Thanks!
P.S. I know it is easy to this with java, perl, awk, etc.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T17:14:16+00:00

In ANTLR, a lexer deals in characters and a parser deals in abstract tokens. So whenever you find yourself saying “start with characters ABC and read every character indiscriminately until characters XYZ”, you’re probably better off writing a lexer rule rather than a parser one because “every character” is meaningful to the lexer but not to the parser.

Along these lines, consider the similarity between the English definition of the author parser rule and the boilerplate lexer rule for a C++-style, single-line comment:

An author is some text that starts with ‘BY: ‘ followed by every character until the end of the line.
A single-line comment is some text that starts with ‘//’ followed by every character until the end of the line.

A lexer rule for this kind of single-line comment generally follows this form:

SINGLE_LINE_COMMENT : '//' ~('\r'|'\n')*;

A lexer rule for an author line would look similar:

AUTHOR : 'BY: ' ~('\r'|'\n')*;

But this won’t work quite right because the AUTHOR token produced will start with “BY: ” and you only want what follows that. You can either trim the first characters off or, preferably, have the text separated to begin with, like so:

AUTHOR: BY RESTOFLINE; //TODO ignore BY

This separation can be done with lexer fragments:

AUTHOR  : BY RESTOFLINE; //TODO ignore BY

fragment BY :   'BY: ';
fragment RESTOFLINE  
        :   ~('\r'|'\n')*;

A lexer fragment behaves like a private lexer-level macro: it’s only “active” when it’s referenced in a lexer rule, and only a lexer rule can activate it. (A parser can reference a fragment by name, but it generally shouldn’t… but that’s a different topic.)

Now we just need AUTHOR tokens to contain only RESTOFLINE‘s text. That’s easy enough with a lexer action:

    AUTHOR  : BY RESTOFLINE {setText($RESTOFLINE.text);};

Now after the AUTHOR rule has finished reading the RESTOFLINE fragment, setText is called to change the outgoing AUTHOR token’s text to that which came only from the RESTOFLINE fragment.

So after adapting the parser rules to accommodate the new lexer rules, you end up with a grammar like this:

grammar authors;

prog    :   author+ DASHES;
author  :   AUTHOR NEWLINE;


NEWLINE :   '\r'? '\n' ;
DASHES  :   '-'+ NEWLINE;

AUTHOR  : BY RESTOFLINE {setText($RESTOFLINE.text);};

fragment BY       
        :   'BY: ';
fragment RESTOFLINE  
        :   ~('\r'|'\n')*;

Here’s a quick test case:

Input

BY: abc123@gmail.com
BY: myCrazy@#$%ID
BY: first_name second_name
-------------------

Tokens Produced

[AUTHOR : abc123@gmail.com] [NEWLINE : ] [AUTHOR : myCrazy@#$%ID] [NEWLINE : ] [AUTHOR : first_name second_name] [NEWLINE : ] [DASHES : -------------------]

I’m not sure how much this helps you with grammar design in general, but I hope it helps show the distinction between a token parser and a character parser/lexer, and a little of the limitations of each.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to use Antlr to process a simple text file, mostly to re-learn

Leave an answerCancel reply

1 Answer

Input

Tokens Produced

Leave an answer
Cancel reply