I am trying to parse C++/Java style source files and would like to isolate

Question

0

Asked: May 25, 20262026-05-25T19:19:17+00:00 2026-05-25T19:19:17+00:00

I am trying to parse C++/Java style source files and would like to isolate

0

I am trying to parse C++/Java style source files and would like to isolate comments, string literals, and whitespaces as tokens.

For whitespaces and comments, the commonly suggested solution is (using ANTLR grammar):

// WS comments*****************************
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
ML_COMMENT: '/*' (options {greedy=false;}: .)* '*/' {$channel=HIDDEN;};
SL_COMMENT: '//' (options {greedy=false;}: .)* '\r'? '\n' {$channel=HIDDEN;};

But, the problem is that my source files also consist of string literals e.g.

printf("   /* something looks like comment and whitespace \n");
printf("    something looks like comment and whitespace */ \n");

The whole thing inside “” should be considered a single token but my ANTLR lexer rules obviously will consider them a ML_COMMENT token:

    /* something looks like comment and whitespace \n");
printf("    something looks like comment and whitespace */

But I cannot create another lexer rule to define a token as something inside a pair of ” (assuming the \” escape sequence is handled properly), because this would be considered as a string token erroneously:

/*  comment...."comment that looks */   /*like a string literal"...more comment */

In short, the 2 pairs /**/ and “” will interfere with one another because each can contain the start of the other as its valid content. So how should we define a lexer grammar to handle both cases?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T19:19:18+00:00

JavaMan wrote:

I am trying to parse C++/Java style source files and would like to isolate comment, string literal, and whitespace as tokens.

Shouldn’t you match char literals as well? Consider:

char c = '"';

The double quote should not be considered as the start of a string literal!

JavaMan wrote:

In short, the 2 pairs /**/ and “” will interfere with one another.

Err, no. If a /* is “seen” first, it would consume all the way to the first */. For input like:

/*  comment...."comment that looks like a string literal"...more comment */

this would mean the double quotes are also consumed. The same for string literals: when a double quote is seen first, the /* and/or */ would be consumed until the next (un-escaped) " is encountered.

Or did I misunderstand?

Note that you can drop the options {greedy=false;}: from your grammar before .* or .+ which are by default ungreedy.

Here’s a way:

grammar T;

parse
  :  (t=. 
       {
         if($t.type != OTHER) {
           System.out.printf("\%-10s >\%s<\n", tokenNames[$t.type], $t.text);
         }
       }
     )+
     EOF
  ;

ML_COMMENT
  :  '/*' .* '*/'
  ;

SL_COMMENT
  :  '//' ~('\r' | '\n')*
  ;

STRING
  :  '"' (STR_ESC | ~('\\' | '"' | '\r' | '\n'))* '"'
  ;

CHAR
  :  '\'' (CH_ESC | ~('\\' | '\'' | '\r' | '\n')) '\''
  ;

SPACE
  :  (' ' | '\t' | '\r' | '\n')+
  ;

OTHER
  :  . // fall-through rule: matches any char if none of the above matched
  ;

fragment STR_ESC
  :  '\\' ('\\' | '"' | 't' | 'n' | 'r') // add more:  Unicode esapes, ...
  ;

fragment CH_ESC
  :  '\\' ('\\' | '\'' | 't' | 'n' | 'r') // add more: Unicode esapes, Octal, ...
  ;

which can be tested with:

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String source = 
        "String s = \" foo \\t /* bar */ baz\";\n" +
        "char c = '\"'; // comment /* here\n" +
        "/* multi \"no string\"\n" +
        "   line */";
    System.out.println(source + "\n-------------------------");
    TLexer lexer = new TLexer(new ANTLRStringStream(source));
    TParser parser = new TParser(new CommonTokenStream(lexer));
    parser.parse();
  }
}

If you run the class above, the following is printed to the console:

String s = " foo \t /* bar */ baz";
char c = '"'; // comment /* here
/* multi "no string"
   line */
-------------------------

SPACE      > <
SPACE      > <
SPACE      > <
STRING     >" foo \t /* bar */ baz"<
SPACE      >
<
SPACE      > <
SPACE      > <
SPACE      > <
CHAR       >'"'<
SPACE      > <
SL_COMMENT >// comment /* here<
SPACE      >
<
ML_COMMENT >/* multi "no string"
   line */<

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to parse C++/Java style source files and would like to isolate

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply