Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6723939
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T09:37:44+00:00 2026-05-26T09:37:44+00:00

I’ve been working on a parser for some template language embeded in HTML (FreeMarker),

  • 0

I’ve been working on a parser for some template language embeded in HTML (FreeMarker), piece of example here:

${abc}
<html> 
<head> 
  <title>Welcome!</title> 
</head> 
<body> 
  <h1> 
    Welcome ${user}<#if user == "Big Joe">, our beloved 
leader</#if>! 
  </h1> 
  <p>Our latest product: 
  <a href="${latestProduct}">${latestProduct}</a>! 
</body> 
</html>

The template language is between some specific tags, e.g. ‘${‘ ‘}’, ‘<#’ ‘>’. Other raw texts in between can be treated like as the same tokens (RAW).

The key point here is that the same text, e.g. an integer, will mean differently thing for the parser depends on whether it’s between those tags or not, and thus needs to be treated as different tokens.

I’ve tried with the following ugly implementation, with a self-defined state to indicate whether it’s in those tags. As you see, I have to check the state almost in every rule, which drives me crazy…

I also thinked about the following two solutions:

  1. Use multiple lexers. I can switch between two lexers when inside/outside those tags. However, the document for this is poor for ANTLR3. I don’t know how to let one parser share two different lexers and switch between them.

  2. Move the RAW rule up, after the NUMERICAL_ESCAPE rule. Check the state there, if it’s in the tag, put back the token and continue trying the left rules. This would save lots of state checking. However, I don’t find any ‘put back’ function and ANTLR complains about some rules can never be matched…

Is there an elegant solution for this?

grammar freemarker_simple;

@lexer::members {
int freemarker_type = 0;
}

expression
    :   primary_expression ;

primary_expression
    :   number_literal | identifier | parenthesis | builtin_variable
    ;

parenthesis
    :   OPEN_PAREN expression CLOSE_PAREN ;

number_literal
    :   INTEGER | DECIMAL
    ;

identifier
    :   ID
    ;

builtin_variable
    :   DOT ID
    ;

string_output
    :   OUTPUT_ESCAPE expression CLOSE_BRACE
    ;

numerical_output
    :   NUMERICAL_ESCAPE expression  CLOSE_BRACE
    ;

if_expression
    :   START_TAG IF expression DIRECTIVE_END optional_block
        ( START_TAG ELSE_IF expression loose_directive_end optional_block )*
        ( END_TAG ELSE optional_block )?
        END_TAG END_IF
    ;

list    :   START_TAG LIST expression AS ID DIRECTIVE_END optional_block END_TAG END_LIST ;

for_each
    :   START_TAG FOREACH ID IN expression DIRECTIVE_END optional_block END_TAG END_FOREACH ;

loose_directive_end
    :   ( DIRECTIVE_END | EMPTY_DIRECTIVE_END ) ;

freemarker_directive
    :   ( if_expression | list | for_each  ) ;
content :   ( RAW |  string_output | numerical_output | freemarker_directive ) + ;
optional_block
    :   ( content )? ;

root    :   optional_block EOF  ;

START_TAG
    :   '<#'
        { freemarker_type = 1; }
    ;

END_TAG :   '</#'
        { freemarker_type = 1; }
    ;

DIRECTIVE_END
    :   '>'
        {
        if(freemarker_type == 0) $type=RAW;
        freemarker_type = 0;
        }
    ;
EMPTY_DIRECTIVE_END
    :   '/>'
        {
        if(freemarker_type == 0) $type=RAW;
        freemarker_type = 0;
        }
    ;

OUTPUT_ESCAPE
    :   '${'
        { if(freemarker_type == 0) freemarker_type = 2; }
    ;
NUMERICAL_ESCAPE
    :   '#{'
        { if(freemarker_type == 0) freemarker_type = 2; }
    ;

IF  :   'if'
        { if(freemarker_type == 0) $type=RAW; }
    ;
ELSE    :   'else' DIRECTIVE_END
        { if(freemarker_type == 0) $type=RAW; }
    ; 
ELSE_IF :   'elseif'
        { if(freemarker_type == 0) $type=RAW; }
    ; 
LIST    :   'list'
        { if(freemarker_type == 0) $type=RAW; }
    ; 
FOREACH :   'foreach'
        { if(freemarker_type == 0) $type=RAW; }
    ; 
END_IF  :   'if' DIRECTIVE_END
        { if(freemarker_type == 0) $type=RAW; }
    ; 
END_LIST
    :   'list' DIRECTIVE_END
        { if(freemarker_type == 0) $type=RAW; }
    ; 
END_FOREACH
    :   'foreach' DIRECTIVE_END
        { if(freemarker_type == 0) $type=RAW; }
    ;


FALSE: 'false' { if(freemarker_type == 0) $type=RAW; };
TRUE: 'true' { if(freemarker_type == 0) $type=RAW; };
INTEGER: ('0'..'9')+ { if(freemarker_type == 0) $type=RAW; };
DECIMAL: INTEGER '.' INTEGER { if(freemarker_type == 0) $type=RAW; };
DOT: '.' { if(freemarker_type == 0) $type=RAW; };
DOT_DOT: '..' { if(freemarker_type == 0) $type=RAW; };
PLUS: '+' { if(freemarker_type == 0) $type=RAW; };
MINUS: '-' { if(freemarker_type == 0) $type=RAW; };
TIMES: '*' { if(freemarker_type == 0) $type=RAW; };
DIVIDE: '/' { if(freemarker_type == 0) $type=RAW; };
PERCENT: '%' { if(freemarker_type == 0) $type=RAW; };
AND: '&' | '&&' { if(freemarker_type == 0) $type=RAW; };
OR: '|' | '||' { if(freemarker_type == 0) $type=RAW; };
EXCLAM: '!' { if(freemarker_type == 0) $type=RAW; };
OPEN_PAREN: '(' { if(freemarker_type == 0) $type=RAW; };
CLOSE_PAREN: ')' { if(freemarker_type == 0) $type=RAW; };
OPEN_BRACE
    :   '{'
    { if(freemarker_type == 0) $type=RAW; }
    ;
CLOSE_BRACE
    :   '}'
    {
        if(freemarker_type == 0) $type=RAW;
        if(freemarker_type == 2) freemarker_type = 0;
    }
    ;
IN: 'in' { if(freemarker_type == 0) $type=RAW; };
AS: 'as' { if(freemarker_type == 0) $type=RAW; };
ID  :   ('A'..'Z'|'a'..'z')+
    //{ if(freemarker_type == 0) $type=RAW; }
    ;

BLANK   :   ( '\r' | ' ' | '\n' | '\t' )+
    {
        if(freemarker_type == 0) $type=RAW;
        else $channel = HIDDEN;
    }
    ;

RAW
    :   .
    ;

EDIT

I found the problem similar to How do I lex this input? , where a “start condition” is needed. But unfortunately, the answer uses a lot of predicates as well, just like my states.

Now, I tried to move the RAW higher with a predicate. Hoping to eliminate all the state checks after RAW rule. However, my example input failed, the first line end is recogonized as BLANK instead of RAW it should be.

I guess something wrong is about the rule priority:
After CLOSE_BRACE is matched, the next token is matched from rules after the CLOSE_BRACE rule, rather than start from the begenning again.

Any way to resolve this?

New grammar below with some debug outputs:

grammar freemarker_simple;

@lexer::members {
int freemarker_type = 0;
}

expression
    :   primary_expression ;

primary_expression
    :   number_literal | identifier | parenthesis | builtin_variable
    ;

parenthesis
    :   OPEN_PAREN expression CLOSE_PAREN ;

number_literal
    :   INTEGER | DECIMAL
    ;

identifier
    :   ID
    ;

builtin_variable
    :   DOT ID
    ;

string_output
    :   OUTPUT_ESCAPE expression CLOSE_BRACE
    ;

numerical_output
    :   NUMERICAL_ESCAPE expression  CLOSE_BRACE
    ;

if_expression
    :   START_TAG IF expression DIRECTIVE_END optional_block
        ( START_TAG ELSE_IF expression loose_directive_end optional_block )*
        ( END_TAG ELSE optional_block )?
        END_TAG END_IF
    ;

list    :   START_TAG LIST expression AS ID DIRECTIVE_END optional_block END_TAG END_LIST ;

for_each
    :   START_TAG FOREACH ID IN expression DIRECTIVE_END optional_block END_TAG END_FOREACH ;

loose_directive_end
    :   ( DIRECTIVE_END | EMPTY_DIRECTIVE_END ) ;

freemarker_directive
    :   ( if_expression | list | for_each  ) ;
content :   ( RAW |  string_output | numerical_output | freemarker_directive ) + ;
optional_block
    :   ( content )? ;

root    :   optional_block EOF  ;

START_TAG
    :   '<#'
        { freemarker_type = 1; }
    ;

END_TAG :   '</#'
        { freemarker_type = 1; }
    ;

OUTPUT_ESCAPE
    :   '${'
        { if(freemarker_type == 0) freemarker_type = 2; }
    ;
NUMERICAL_ESCAPE
    :   '#{'
        { if(freemarker_type == 0) freemarker_type = 2; }
    ;
RAW
    :
        { freemarker_type == 0 }?=> .
        {System.out.printf("RAW \%s \%d\n",getText(),freemarker_type);}
    ;

DIRECTIVE_END
    :   '>'
        { if(freemarker_type == 1) freemarker_type = 0; }
    ;
EMPTY_DIRECTIVE_END
    :   '/>'
        { if(freemarker_type == 1) freemarker_type = 0; }
    ;

IF  :   'if'

    ;
ELSE    :   'else' DIRECTIVE_END

    ; 
ELSE_IF :   'elseif'

    ; 
LIST    :   'list'

    ; 
FOREACH :   'foreach'

    ; 
END_IF  :   'if' DIRECTIVE_END
    ; 
END_LIST
    :   'list' DIRECTIVE_END
    ; 
END_FOREACH
    :   'foreach' DIRECTIVE_END
    ;


FALSE: 'false' ;
TRUE: 'true' ;
INTEGER: ('0'..'9')+ ;
DECIMAL: INTEGER '.' INTEGER ;
DOT: '.' ;
DOT_DOT: '..' ;
PLUS: '+' ;
MINUS: '-' ;
TIMES: '*' ;
DIVIDE: '/' ;
PERCENT: '%' ;
AND: '&' | '&&' ;
OR: '|' | '||' ;
EXCLAM: '!' ;
OPEN_PAREN: '(' ;
CLOSE_PAREN: ')' ;
OPEN_BRACE
    :   '{'
    ;
CLOSE_BRACE
    :   '}'
    { if(freemarker_type == 2) {freemarker_type = 0;} }
    ;
IN: 'in' ;
AS: 'as' ;
ID  :   ('A'..'Z'|'a'..'z')+
    { System.out.printf("ID \%s \%d\n",getText(),freemarker_type);}
    ;

BLANK   :   ( '\r' | ' ' | '\n' | '\t' )+
    {
        System.out.printf("BLANK \%d\n",freemarker_type);
        $channel = HIDDEN;
    }
    ;

My input results with the output:

ID abc 2
BLANK 0  <<< incorrect, should be RAW when state==0
RAW < 0  <<< correct
ID html 0 <<< incorrect, should be RAW RAW RAW RAW
RAW > 0

EDIT2

Also tried the 2nd approach with Bart’s grammar, still didn’t work the ‘html’ is recognized as an ID, which should be 4 RAWs. When mmode=false, shouldn’t RAW get matched first? Or the lexer still chooses the longest match here?

grammar freemarker_bart;

options {
  output=AST;
  ASTLabelType=CommonTree;
}

tokens {
  FILE;
  OUTPUT;
  RAW_BLOCK;
}

@parser::members {

  // merge a given list of tokens into a single AST
  private CommonTree merge(List tokenList) {
    StringBuilder b = new StringBuilder();
    for(int i = 0; i < tokenList.size(); i++) {
      Token token = (Token)tokenList.get(i);
      b.append(token.getText());
    }
    return new CommonTree(new CommonToken(RAW, b.toString()));
  }
}

@lexer::members {
  private boolean mmode = false;
}

parse
  :  content* EOF -> ^(FILE content*)
  ;

content
  :  (options {greedy=true;}: t+=RAW)+ -> ^(RAW_BLOCK {merge($t)})
  |  if_stat
  |  output
  ;

if_stat
  :  TAG_START IF expression TAG_END raw_block TAG_END_START IF TAG_END -> ^(IF expression raw_block)
  ;

output
  :  OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
  ;

raw_block
  :  (t+=RAW)* -> ^(RAW_BLOCK {merge($t)})
  ;

expression
  :  eq_expression
  ;

eq_expression
  :  atom (EQUALS^ atom)* 
  ;

atom
  :  STRING
  |  ID
  ;

// these tokens denote the start of markup code (sets mmode to true)
OUTPUT_START  : '${'  {mmode=true;};
TAG_START     : '<#'  {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});

RAW           : {!mmode}?=> . ;

// these tokens denote the end of markup code (sets mmode to false)
OUTPUT_END    : '}' {mmode=false;};
TAG_END       : '>' {mmode=false;};

// valid tokens only when in "markup mode"
EQUALS        : '==';
IF            : 'if';
STRING        : '"' ~'"'* '"';
ID            : ('a'..'z' | 'A'..'Z')+;
SPACE         : (' ' | '\t' | '\r' | '\n')+ {skip();};
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T09:37:45+00:00Added an answer on May 26, 2026 at 9:37 am

    You could let lexer rules match using gated semantic predicates where you test for a certain boolean expression.

    A little demo:

    freemarker_simple.g

    grammar freemarker_simple;
    
    options {
      output=AST;
      ASTLabelType=CommonTree;
    }
    
    tokens {
      FILE;
      OUTPUT;
      RAW_BLOCK;
    }
    
    @parser::members {
    
      // merge a given list of tokens into a single AST
      private CommonTree merge(List tokenList) {
        StringBuilder b = new StringBuilder();
        for(int i = 0; i < tokenList.size(); i++) {
          Token token = (Token)tokenList.get(i);
          b.append(token.getText());
        }
        return new CommonTree(new CommonToken(RAW, b.toString()));
      }
    }
    
    @lexer::members {
      private boolean mmode = false;
    }
    
    parse
      :  content* EOF -> ^(FILE content*)
      ;
    
    content
      :  (options {greedy=true;}: t+=RAW)+ -> ^(RAW_BLOCK {merge($t)})
      |  if_stat
      |  output
      ;
    
    if_stat
      :  TAG_START IF expression TAG_END raw_block TAG_END_START IF TAG_END -> ^(IF expression raw_block)
      ;
    
    output
      :  OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
      ;
    
    raw_block
      :  (t+=RAW)* -> ^(RAW_BLOCK {merge($t)})
      ;
    
    expression
      :  eq_expression
      ;
    
    eq_expression
      :  atom (EQUALS^ atom)* 
      ;
    
    atom
      :  STRING
      |  ID
      ;
    
    // these tokens denote the start of markup code (sets mmode to true)
    OUTPUT_START  : '${'  {mmode=true;};
    TAG_START     : '<#'  {mmode=true;};
    TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});
    
    // these tokens denote the end of markup code (sets mmode to false)
    OUTPUT_END    : {mmode}?=> '}' {mmode=false;};
    TAG_END       : {mmode}?=> '>' {mmode=false;};
    
    // valid tokens only when in "markup mode"
    EQUALS        : {mmode}?=> '==';
    IF            : {mmode}?=> 'if';
    STRING        : {mmode}?=> '"' ~'"'* '"';
    ID            : {mmode}?=> ('a'..'z' | 'A'..'Z')+;
    SPACE         : {mmode}?=> (' ' | '\t' | '\r' | '\n')+ {skip();};
    
    RAW           : . ;
    

    which parses your input:

    test.html

    ${abc}
    <html> 
    <head> 
      <title>Welcome!</title> 
    </head> 
    <body> 
      <h1> 
        Welcome ${user}<#if user == "Big Joe">, our beloved leader</#if>! 
      </h1> 
      <p>Our latest product: <a href="${latestProduct}">${latestProduct}</a>!</p>
    </body> 
    </html>
    

    into the following AST:

    enter image description here

    as you can test yourself with the class:

    Main.java

    import org.antlr.runtime.*;
    import org.antlr.runtime.tree.*;
    import org.antlr.stringtemplate.*;
    
    public class Main {
      public static void main(String[] args) throws Exception {
        freemarker_simpleLexer lexer = new freemarker_simpleLexer(new ANTLRFileStream("test.html"));
        freemarker_simpleParser parser = new freemarker_simpleParser(new CommonTokenStream(lexer));
        CommonTree tree = (CommonTree)parser.parse().getTree();
        DOTTreeGenerator gen = new DOTTreeGenerator();
        StringTemplate st = gen.toDOT(tree);
        System.out.println(st);
      }
    }
    

    EDIT 1

    When I run your example input with a parser generated from the second grammar you posted, the following are wthe first 5 lines being printed to the console (not counting the many warnings that are generated):

    ID abc 2
    RAW 
     0
    RAW < 0
    ID html 0
    ...
    

    EDIT 2

    Bood wrote:

    Also tried the 2nd approach with Bart’s grammar, still didn’t work the ‘html’ is recognized as an ID, which should be 4 RAWs. When mmode=false, shouldn’t RAW get matched first? Or the lexer still chooses the longest match here?

    Yes, that is correct: ANTLR chooses the longer match in that case.

    But now that I (finally :)) see what you’re trying to do, here’s a last proposal: you could let the RAW rule match characters as long as the rule can’t see one of the following character sequences ahead: "<#", "</#" or "${". Note that the rule must still stay at the end in the grammar. This check is performed inside the lexer. Also, in that case you don’t need the merge(...) method in the parser:

    grammar freemarker_simple;
    
    options {
      output=AST;
      ASTLabelType=CommonTree;
    }
    
    tokens {
      FILE;
      OUTPUT;
      RAW_BLOCK;
    }
    
    @lexer::members {
      
      private boolean mmode = false;
      
      private boolean rawAhead() {
        if(mmode) return false;
        int ch1 = input.LA(1), ch2 = input.LA(2), ch3 = input.LA(3);
        return !(
            (ch1 == '<' && ch2 == '#') ||
            (ch1 == '<' && ch2 == '/' && ch3 == '#') ||
            (ch1 == '$' && ch2 == '{')
        );
      }
    }
    
    parse
      :  content* EOF -> ^(FILE content*)
      ;
    
    content
      :  RAW
      |  if_stat
      |  output
      ;
    
    if_stat
      :  TAG_START IF expression TAG_END RAW TAG_END_START IF TAG_END -> ^(IF expression RAW)
      ;
    
    output
      :  OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
      ;
    
    expression
      :  eq_expression
      ;
    
    eq_expression
      :  atom (EQUALS^ atom)*
      ;
    
    atom
      :  STRING
      |  ID
      ;
    
    OUTPUT_START  : '${'  {mmode=true;};
    TAG_START     : '<#'  {mmode=true;};
    TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});
    
    OUTPUT_END    : '}' {mmode=false;};
    TAG_END       : '>' {mmode=false;};
    
    EQUALS        : '==';
    IF            : 'if';
    STRING        : '"' ~'"'* '"';
    ID            : ('a'..'z' | 'A'..'Z')+;
    SPACE         : (' ' | '\t' | '\r' | '\n')+ {skip();};
    
    RAW           : ({rawAhead()}?=> . )+;
    

    The grammar above will produce the following AST from the input posted at the start of this answer:

    enter image description here

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

link Im having trouble converting the html entites into html characters, (&# 8217;) i
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
i got an object with contents of html markup in it, for example: string
I'm working with an upstream system that sometimes sends me text destined for HTML/XML
I have a jquery bug and I've been looking for hours now, I can't
I have just tried to save a simple *.rtf file with some websites and
For some reason, after submitting a string like this Jack’s Spindle from a text
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I have this code: - (void)parser:(NSXMLParser *)parser foundCDATA:(NSData *)CDATABlock { NSString *someString = [[NSString
I have thousands of HTML files to process using Groovy/Java and I need to

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.