Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9068631
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 16, 20262026-06-16T17:14:14+00:00 2026-06-16T17:14:14+00:00

I’m trying to use Antlr to process a simple text file, mostly to re-learn

  • 0

I’m trying to use Antlr to process a simple text file, mostly to re-learn grammer design.

Each line in the text file is composed of a keyword ‘BY: ‘ and a EOL terminated string; the file ends with a series of ‘-‘; like so:

BY: abc123@gmail.com
BY: myCrazy@#$%ID
BY: first_name second_name
-------------------

I defined my grammer as follows:

grammar authors;

prog    :   author+ DASHES;
author  :   BY STRING NEWLINE;

BY  :   'BY: ';
STRING  :   ('!'..'~')*;
NEWLINE :   '\r'? '\n' ;
DASHES  :   '-'+ NEWLINE;

This grammer recognizes the first and second author but fails to recognize the third because of the space. So I changed the STRING to include a space STRING:('!'..'~'|' ')* but then it stopped working all together (It throws MisstingTokenException).

I think it is because the STRING rule matches the entire line before the BY is matched. But then why does it work when the space is excluded from the STRING? Is there a way I can force the lexer to match the BY rule first?

In general, how can I consume a free form unicode newline terminated string (names can have accented-characters as well)?

Thanks!
P.S. I know it is easy to this with java, perl, awk, etc.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-16T17:14:16+00:00Added an answer on June 16, 2026 at 5:14 pm

    In ANTLR, a lexer deals in characters and a parser deals in abstract tokens. So whenever you find yourself saying “start with characters ABC and read every character indiscriminately until characters XYZ”, you’re probably better off writing a lexer rule rather than a parser one because “every character” is meaningful to the lexer but not to the parser.

    Along these lines, consider the similarity between the English definition of the author parser rule and the boilerplate lexer rule for a C++-style, single-line comment:

    • An author is some text that starts with ‘BY: ‘ followed by every character until the end of the line.
    • A single-line comment is some text that starts with ‘//’ followed by every character until the end of the line.

    A lexer rule for this kind of single-line comment generally follows this form:

    SINGLE_LINE_COMMENT : '//' ~('\r'|'\n')*;
    

    A lexer rule for an author line would look similar:

    AUTHOR : 'BY: ' ~('\r'|'\n')*;
    

    But this won’t work quite right because the AUTHOR token produced will start with “BY: ” and you only want what follows that. You can either trim the first characters off or, preferably, have the text separated to begin with, like so:

    AUTHOR: BY RESTOFLINE; //TODO ignore BY
    

    This separation can be done with lexer fragments:

    AUTHOR  : BY RESTOFLINE; //TODO ignore BY
    
    fragment BY :   'BY: ';
    fragment RESTOFLINE  
            :   ~('\r'|'\n')*;
    

    A lexer fragment behaves like a private lexer-level macro: it’s only “active” when it’s referenced in a lexer rule, and only a lexer rule can activate it. (A parser can reference a fragment by name, but it generally shouldn’t… but that’s a different topic.)

    Now we just need AUTHOR tokens to contain only RESTOFLINE‘s text. That’s easy enough with a lexer action:

        AUTHOR  : BY RESTOFLINE {setText($RESTOFLINE.text);};
    

    Now after the AUTHOR rule has finished reading the RESTOFLINE fragment, setText is called to change the outgoing AUTHOR token’s text to that which came only from the RESTOFLINE fragment.

    So after adapting the parser rules to accommodate the new lexer rules, you end up with a grammar like this:

    grammar authors;
    
    prog    :   author+ DASHES;
    author  :   AUTHOR NEWLINE;
    
    
    NEWLINE :   '\r'? '\n' ;
    DASHES  :   '-'+ NEWLINE;
    
    AUTHOR  : BY RESTOFLINE {setText($RESTOFLINE.text);};
    
    fragment BY       
            :   'BY: ';
    fragment RESTOFLINE  
            :   ~('\r'|'\n')*;
    

    Here’s a quick test case:

    Input

    BY: abc123@gmail.com
    BY: myCrazy@#$%ID
    BY: first_name second_name
    -------------------
    

    Tokens Produced

    [AUTHOR : abc123@gmail.com] [NEWLINE : ] [AUTHOR : myCrazy@#$%ID] [NEWLINE : ] [AUTHOR : first_name second_name] [NEWLINE : ] [DASHES : -------------------] 
    

    I’m not sure how much this helps you with grammar design in general, but I hope it helps show the distinction between a token parser and a character parser/lexer, and a little of the limitations of each.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm trying to convert HTML to plain text. I get many &\#8217; &\#8220; etc.
I have a reasonable size flat file database of text documents mostly saved in
Basically, what I'm trying to create is a page of div tags, each has
I am trying to understand how to use SyndicationItem to display feed which is
I have just tried to save a simple *.rtf file with some websites and
I am trying to find ID3V2 tags from MP3 file using jid3lib in Java.
I am trying to render a haml file in a javascript response like so:
I want use html5's new tag to play a wav file (currently only supported
I'm trying to use string.replace('’','') to replace the dreaded weird single-quote character: ’ (aka
I have a .ini file as follows: [playlist] numberofentries=2 File1=http://87.230.82.17:80 Title1=(#1 - 365/1400) Example

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.