I’m trying to use Antlr to process a simple text file, mostly to re-learn grammer design.
Each line in the text file is composed of a keyword ‘BY: ‘ and a EOL terminated string; the file ends with a series of ‘-‘; like so:
BY: abc123@gmail.com
BY: myCrazy@#$%ID
BY: first_name second_name
-------------------
I defined my grammer as follows:
grammar authors;
prog : author+ DASHES;
author : BY STRING NEWLINE;
BY : 'BY: ';
STRING : ('!'..'~')*;
NEWLINE : '\r'? '\n' ;
DASHES : '-'+ NEWLINE;
This grammer recognizes the first and second author but fails to recognize the third because of the space. So I changed the STRING to include a space STRING:('!'..'~'|' ')* but then it stopped working all together (It throws MisstingTokenException).
I think it is because the STRING rule matches the entire line before the BY is matched. But then why does it work when the space is excluded from the STRING? Is there a way I can force the lexer to match the BY rule first?
In general, how can I consume a free form unicode newline terminated string (names can have accented-characters as well)?
Thanks!
P.S. I know it is easy to this with java, perl, awk, etc.
In ANTLR, a lexer deals in characters and a parser deals in abstract tokens. So whenever you find yourself saying “start with characters ABC and read every character indiscriminately until characters XYZ”, you’re probably better off writing a lexer rule rather than a parser one because “every character” is meaningful to the lexer but not to the parser.
Along these lines, consider the similarity between the English definition of the
authorparser rule and the boilerplate lexer rule for a C++-style, single-line comment:authoris some text that starts with ‘BY: ‘ followed by every character until the end of the line.A lexer rule for this kind of single-line comment generally follows this form:
A lexer rule for an author line would look similar:
But this won’t work quite right because the
AUTHORtoken produced will start with “BY: ” and you only want what follows that. You can either trim the first characters off or, preferably, have the text separated to begin with, like so:This separation can be done with lexer fragments:
A lexer fragment behaves like a private lexer-level macro: it’s only “active” when it’s referenced in a lexer rule, and only a lexer rule can activate it. (A parser can reference a fragment by name, but it generally shouldn’t… but that’s a different topic.)
Now we just need
AUTHORtokens to contain onlyRESTOFLINE‘s text. That’s easy enough with a lexer action:Now after the
AUTHORrule has finished reading theRESTOFLINEfragment,setTextis called to change the outgoingAUTHORtoken’s text to that which came only from theRESTOFLINEfragment.So after adapting the parser rules to accommodate the new lexer rules, you end up with a grammar like this:
Here’s a quick test case:
Input
Tokens Produced
I’m not sure how much this helps you with grammar design in general, but I hope it helps show the distinction between a token parser and a character parser/lexer, and a little of the limitations of each.