I am trying to parse C++/Java style source files and would like to isolate comments, string literals, and whitespaces as tokens.
For whitespaces and comments, the commonly suggested solution is (using ANTLR grammar):
// WS comments*****************************
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
ML_COMMENT: '/*' (options {greedy=false;}: .)* '*/' {$channel=HIDDEN;};
SL_COMMENT: '//' (options {greedy=false;}: .)* '\r'? '\n' {$channel=HIDDEN;};
But, the problem is that my source files also consist of string literals e.g.
printf(" /* something looks like comment and whitespace \n");
printf(" something looks like comment and whitespace */ \n");
The whole thing inside “” should be considered a single token but my ANTLR lexer rules obviously will consider them a ML_COMMENT token:
/* something looks like comment and whitespace \n");
printf(" something looks like comment and whitespace */
But I cannot create another lexer rule to define a token as something inside a pair of ” (assuming the \” escape sequence is handled properly), because this would be considered as a string token erroneously:
/* comment...."comment that looks */ /*like a string literal"...more comment */
In short, the 2 pairs /**/ and “” will interfere with one another because each can contain the start of the other as its valid content. So how should we define a lexer grammar to handle both cases?
Shouldn’t you match char literals as well? Consider:
The double quote should not be considered as the start of a string literal!
Err, no. If a
/*is “seen” first, it would consume all the way to the first*/. For input like:this would mean the double quotes are also consumed. The same for string literals: when a double quote is seen first, the
/*and/or*/would be consumed until the next (un-escaped)"is encountered.Or did I misunderstand?
Note that you can drop the
options {greedy=false;}:from your grammar before.*or.+which are by default ungreedy.Here’s a way:
which can be tested with:
If you run the class above, the following is printed to the console: