First question stream Hello everyone, This could be a follow-up on this question: Antlr

Question

0

Editorial Team

Asked: May 22, 20262026-05-22T21:24:48+00:00 2026-05-22T21:24:48+00:00

First question stream Hello everyone, This could be a follow-up on this question: Antlr

0

First question stream

Hello everyone,

This could be a follow-up on this question: Antlr rule priorities

I’m trying to write an ANTLR grammar for the reStructuredText markup language.

The main problem I’m facing is : “How to match any sequence of characters (regular text) without masking other grammar rules?”

Let’s take an example with a paragraph with inline markup:

In `Figure 17-6`_, we have positioned ``before_ptr`` so that it points to the element 
*before* the insert point. The variable ``after_ptr`` points to the element *after* the 
insert. In other words, we are going to put our new element **in between** ``before_ptr`` 
and ``after_ptr``.

I thought that writing rules for inline markup text would be easy. So I wrote a simple grammar:

grammar Rst;

options {
    output=AST;
    language=Java;
    backtrack=true;
    //memoize=true;
}

@members {
boolean inInlineMarkup = false;
}

// PARSER

text
    : inline_markup (WS? inline_markup)* WS? EOF
    ;


inline_markup
@after {
inInlineMarkup = false;
}
    : {!inInlineMarkup}? (emphasis|strong|litteral|link)
    ;

emphasis
@init {
inInlineMarkup = true;
}
    : '*' (~'*')+ '*' {System.out.println("emphasis: " + $text);}
    ;

strong
@init {
inInlineMarkup = true;
}
    : '**' (~'*')+ '**' {System.out.println("bold: " + $text);}
    ;

litteral
@init {
inInlineMarkup = true;
}
    : '``' (~'`')+ '``' {System.out.println("litteral: " + $text);}
    ;

link
@init {
inInlineMarkup = true;
}
    : inline_internal_target
    | footnote_reference
    | hyperlink_reference
    ;

inline_internal_target
    : '_`' (~'`')+ '`' {System.out.println("inline_internal_target: " + $text);}
    ;

footnote_reference
    : '[' (~']')+ ']_' {System.out.println("footnote_reference: " + $text);}
    ;


hyperlink_reference
    : ~(' '|'\t'|'\u000C'|'_')+ '_' {System.out.println("hyperlink_reference: " + $text);}
    |   '`' (~'`')+ '`_' {System.out.println("hyperlink_reference (long): " + $text);}
    ;

// LEXER

WS  
  : (' '|'\t'|'\u000C')+
  ; 

NEWLINE
  : '\r'? '\n'
  ;

This simple grammar doesn’t work. And I didn’t even try to match regular text…

My questions:

Could someone point to my errors and maybe give me a hint on how to match regular text?
Is there a way to set priority on the grammar rules? Maybe this could be a lead.

Thanks in advance for your help 🙂

Robin

Second question stream

Thank you very much for your help! I would have had a hard time figuring my errors… I’m not writing that grammar (only) to learn ANTLR, I’m trying to code an IDE plugin for eclipse. And for that, I need a grammar 😉

I managed to go further in the grammar and wrote a text rule:

grammar Rst;

options {
    output=AST;
    language=Java;
}



@members {
boolean inInlineMarkup = false;
}

//////////////////
// PARSER RULES //
//////////////////

file
  : line* EOF
  ;


line
  : text* NEWLINE
  ;

text
    : inline_markup
    | normal_text
    ;

inline_markup
@after {
inInlineMarkup = false;
}
    : {!inInlineMarkup}? {inInlineMarkup = true;} 
  (
  | STRONG
  | EMPHASIS
  | LITTERAL
  | INTERPRETED_TEXT
  | SUBSTITUTION_REFERENCE
  | link
  )
    ;


link
    : INLINE_INTERNAL_TARGET
    | FOOTNOTE_REFERENCE
    | HYPERLINK_REFERENCE
    ;

normal_text
  : {!inInlineMarkup}? 
   ~(EMPHASIS
      |SUBSTITUTION_REFERENCE
      |STRONG
      |LITTERAL
      |INTERPRETED_TEXT
      |INLINE_INTERNAL_TARGET
      |FOOTNOTE_REFERENCE
      |HYPERLINK_REFERENCE
      |NEWLINE
      )
  ;
//////////////////
// LEXER TOKENS //
//////////////////

EMPHASIS
    : STAR ANY_BUT_STAR+ STAR {System.out.println("EMPHASIS: " + $text);}
    ;

SUBSTITUTION_REFERENCE
  : PIPE ANY_BUT_PIPE+ PIPE  {System.out.println("SUBST_REF: " + $text);}
  ;

STRONG
    : STAR STAR ANY_BUT_STAR+ STAR STAR {System.out.println("STRONG: " + $text);}
    ;

LITTERAL
    : BACKTICK BACKTICK ANY_BUT_BACKTICK+ BACKTICK BACKTICK {System.out.println("LITTERAL: " + $text);}
    ;
INTERPRETED_TEXT
  : BACKTICK ANY_BUT_BACKTICK+ BACKTICK {System.out.println("LITTERAL: " + $text);}
  ;

INLINE_INTERNAL_TARGET
    : UNDERSCORE BACKTICK ANY_BUT_BACKTICK+ BACKTICK {System.out.println("INLINE_INTERNAL_TARGET: " + $text);}
    ;

FOOTNOTE_REFERENCE
    : L_BRACKET ANY_BUT_BRACKET+ R_BRACKET UNDERSCORE {System.out.println("FOOTNOTE_REFERENCE: " + $text);}
    ;


HYPERLINK_REFERENCE
  : BACKTICK ANY_BUT_BACKTICK+ BACKTICK UNDERSCORE {System.out.println("HYPERLINK_REFERENCE (long): " + $text);}
  | ANY_BUT_ENDLINK+ UNDERSCORE {System.out.println("HYPERLINK_REFERENCE (short): " + $text);}
  ;

WS  
  : (' '|'\t')+ {$channel=HIDDEN;}
  ; 

NEWLINE
  : '\r'? '\n' {$channel=HIDDEN;}
  ;


///////////////
// FRAGMENTS //
///////////////

fragment ANY_BUT_PIPE
  : ESC PIPE
  | ~(PIPE|'\n'|'\r')
  ;
fragment ANY_BUT_BRACKET
  : ESC R_BRACKET
  | ~(R_BRACKET|'\n'|'\r')
  ;
fragment ANY_BUT_STAR
  : ESC STAR
  | ~(STAR|'\n'|'\r')
  ;
fragment ANY_BUT_BACKTICK
  : ESC BACKTICK
  | ~(BACKTICK|'\n'|'\r')
  ;
fragment ANY_BUT_ENDLINK
  : ~(UNDERSCORE|' '|'\t'|'\n'|'\r')
  ;



fragment ESC
  : '\\'
  ;
fragment STAR
  : '*'
  ;
fragment BACKTICK
  : '`'
  ;
fragment PIPE
  : '|'
  ;
fragment L_BRACKET
  : '['
  ;
fragment R_BRACKET
  : ']'
  ;
fragment UNDERSCORE
  : '_'
  ;

The grammar is working fine for inline_markup but normal_text is not matched.

Here is my test class:

import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;

import org.antlr.runtime.ANTLRStringStream;
import org.antlr.runtime.CommonTokenStream;
import org.antlr.runtime.RecognitionException;
import org.antlr.runtime.tree.Tree;

public class Test {

    public static void main(String[] args) throws RecognitionException, IOException {

        InputStream is = Test.class.getResourceAsStream("test.rst");
        Reader r = new InputStreamReader(is);
        StringBuilder source = new StringBuilder();
        char[] buffer = new char[1024];
        int readLenght = 0;
        while ((readLenght = r.read(buffer)) > 0) {
            if (readLenght < buffer.length) {
                source.append(buffer, 0, readLenght);
            } else {
                source.append(buffer);
            }
        }
        r.close();
        System.out.println(source.toString());

        ANTLRStringStream in = new ANTLRStringStream(source.toString());
        RstLexer lexer = new RstLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        RstParser parser = new RstParser(tokens);
        RstParser.file_return out = parser.file();
        System.out.println(((Tree)out.getTree()).toStringTree());
    }
}

And the input file I use:

In `Figure 17-6`_, we have positioned ``before_ptr`` so that it points to the element 
*before* the insert point. The variable ``after_ptr`` points to the |element| *after* the 
insert. In other words, `we are going`_ to put_ our new element **in between** ``before_ptr`` 
and ``after_ptr``.

And I get this output:

HYPERLINK_REFERENCE (short): 7-6`_
line 1:2 mismatched character ' ' expecting '_'
line 1:10 mismatched character ' ' expecting '_'
line 1:18 mismatched character ' ' expecting '_'
line 1:21 mismatched character ' ' expecting '_'
line 1:26 mismatched character ' ' expecting '_'
line 1:37 mismatched character ' ' expecting '_'
LITTERAL: `before_ptr`
line 1:86 no viable alternative at character '\r'
line 1:55 mismatched character ' ' expecting '_'
line 1:60 mismatched character ' ' expecting '_'
line 1:63 mismatched character ' ' expecting '_'
line 1:70 mismatched character ' ' expecting '_'
line 1:73 mismatched character ' ' expecting '_'
line 1:77 mismatched character ' ' expecting '_'
line 1:85 mismatched character ' ' expecting '_'
EMPHASIS: *before*
line 2:12 mismatched character ' ' expecting '_'
line 2:19 mismatched character ' ' expecting '_'
line 2:26 mismatched character ' ' expecting '_'
LITTERAL: `after_ptr`
line 2:30 mismatched character ' ' expecting '_'
line 2:39 mismatched character ' ' expecting '_'
line 2:90 no viable alternative at character '\r'
line 2:60 mismatched character ' ' expecting '_'
line 2:63 mismatched character ' ' expecting '_'
line 2:67 mismatched character ' ' expecting '_'
line 2:77 mismatched character ' ' expecting '_'
line 2:85 mismatched character ' ' expecting '_'
line 2:89 mismatched character ' ' expecting '_'
line 3:7 mismatched character ' ' expecting '_'
line 3:10 mismatched character ' ' expecting '_'
line 3:16 mismatched character ' ' expecting '_'
line 3:23 mismatched character ' ' expecting '_'
line 3:27 mismatched character ' ' expecting '_'
line 3:31 mismatched character ' ' expecting '_'
line 3:42 mismatched character ' ' expecting '_'
line 3:51 mismatched character ' ' expecting '_'
line 3:55 mismatched character ' ' expecting '_'
line 3:63 mismatched character ' ' expecting '_'
line 3:94 mismatched character '\r' expecting '*'
line 4:3 mismatched character ' ' expecting '_'
line 4:18 no viable alternative at character '\r'
line 4:18 mismatched character '\r' expecting '_'
HYPERLINK_REFERENCE (short): oing`_
HYPERLINK_REFERENCE (short): ut_
EMPHASIS: *in between*
LITTERAL: `after_ptr`
BR.recoverFromMismatchedToken
line 0:-1 mismatched input '<EOF>' expecting NEWLINE
null

Can you point to my error(s)? (the parser works for inline markup without errors when I add the filter=true; option to the grammar)

Robin

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-22T21:24:48+00:00

Here’s a quick demo how you could parse this reStructeredText. Note that it just handles a minor set of all available markup-syntax, and by adding more to it, you will affect the existing parser/lexer rules: so there is much, much more work to be done!

Demo

grammar RST;

options {
  output=AST;
  backtrack=true;
  memoize=true;
}

tokens {
  ROOT;
  PARAGRAPH;
  INDENTATION;
  LINE;
  WORD;
  BOLD;
  ITALIC;
  INTERPRETED_TEXT;
  INLINE_LITERAL;
  REFERENCE;
}

parse
  :  paragraph+ EOF -> ^(ROOT paragraph+)
  ;

paragraph
  :  line+ -> ^(PARAGRAPH line+)
  |  Space* LineBreak -> /* omit line-breaks between paragraphs from AST */
  ;

line
  :  indentation text+ LineBreak -> ^(LINE text+)
  ;

indentation
  :  Space* -> ^(INDENTATION Space*)
  ;

text
  :  styledText
  |  interpretedText
  |  inlineLiteral
  |  reference
  |  Space
  |  Star
  |  EscapeSequence
  |  Any
  ;

styledText
  :  bold
  |  italic
  ;

bold
  :  Star Star boldAtom+ Star Star -> ^(BOLD boldAtom+)
  ;  

italic
  :  Star italicAtom+ Star -> ^(ITALIC italicAtom+)
  ;

boldAtom
  :  ~(Star | LineBreak)
  |  italic
  ;

italicAtom
  :  ~(Star | LineBreak)
  |  bold
  ;

interpretedText
  :  BackTick interpretedTextAtoms BackTick -> ^(INTERPRETED_TEXT interpretedTextAtoms)
  ;

interpretedTextAtoms
  :  ~BackTick+
  ;

inlineLiteral
  :  BackTick BackTick inlineLiteralAtoms BackTick BackTick -> ^(INLINE_LITERAL inlineLiteralAtoms)
  ;

inlineLiteralAtoms
  :  inlineLiteralAtom+
  ;

inlineLiteralAtom
  :  ~BackTick
  |  BackTick ~BackTick
  ;

reference
  :  Any+ UnderScore -> ^(REFERENCE Any+)
  ;

UnderScore
  :  '_'
  ;

BackTick
  :  '`'
  ;

Star
  :  '*'
  ;

Space
  :  ' ' 
  |  '\t'
  ;

EscapeSequence
  :  '\\' ('\\' | '*')
  ;

LineBreak
  :  '\r'? '\n'
  |  '\r'
  ;

Any
  :  .
  ;

When you generate a parser and lexer from the above, and let it parse the following input file:

***x*** **yyy** *zz* *
a b c

P2 ``*a*`b`` `q`
Python_

(note the trailing line break!)

the parser will produce the following AST:

enter image description here

EDIT

The graph can be created by running this class:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String source =
        "***x*** **yyy** *zz* *\n" +
        "a b c\n" +
        "\n" +
        "P2 ``*a*`b`` `q`\n" +
        "Python_\n";
    RSTLexer lexer = new RSTLexer(new ANTLRStringStream(source));
    RSTParser parser = new RSTParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.parse().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

or if your source comes from a file, do:

RSTLexer lexer = new RSTLexer(new ANTLRFileStream("test.rst"));

or

RSTLexer lexer = new RSTLexer(new ANTLRFileStream("test.rst", "???"));

where "???" is the encoding of your file.

The class above will print the AST as a DOT file to the console. You can use a DOT viewer to display the AST. In this case, I posted an image created by kgraphviewer. But there are many more viewers around. A nice online one is this one, which appears to be using kgraphviewer under “the hood”. Good luck!

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

First question stream Hello everyone, This could be a follow-up on this question: Antlr

First question stream

Second question stream

Leave an answerCancel reply

1 Answer

Demo

EDIT

Leave an answer
Cancel reply