I’m struggling with trying to write a rule that will catch various comments and even "unended" comment errors.
This is for a language based on Pascal. Comments can be of the following forms:
(* ...with any characters within... *)
(*
* separated onto multiple lines
*)
(* they can contain "any" symbol, so -, +, :, ; , etc. should be ignored *)
but I need to catch any comment errors, like:
(* this comment has no closing r-parenthesis * or (* this comment is missing an asterisk )
I have this so far:
{%
int yylval;
vector<string> string_table;
int string_table_index = 0;
int yyline = 1, yycolumn = 1;
%}
delim [ \t\n]
ws {delim}+
letter [a-zA-Z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+
float {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
%%
{ws} {yycolumn += yyleng;}
"(*" {
int c;
yycolumn += yyleng;
while ((c = yyinput()) != '*' && c != EOF) {
c = yyinput(); /* read additional text */
if (c == '*') {
while ((c = yyinput()) == '*') {
c = yyinput();
if (c == ')') {
break; /* found the end */
} else if (c == EOF) {
cout << "EOF in comment\n";
break;
} else {
cout << "unended comment, line = "
<< yyline << ", column = "
<< yycolumn-yyleng << "\n";
}
}
}
}
}
-
it’s not catching the last parenthesis (always prints out
RPARENtoken!), -
it’s not ignoring all the characters inside the comment (ie: prints
MINUStokenfor "-") -
it can’t catch comments on multiple lines.
-
I’m not sure it’s catching unended comment errors correctly.
I think I’m close… can anyone see where I went wrong?
Consider using start conditions to avoid having to write all that extra code in the
(*pattern. I’ve written a short example below.Basically when the lexer finds the beginning of a comment, it enters the
COMMENTstate, and will only check the rules within the<COMMENT>block. When it finds*), it will return to the initial state. Note that if you plan on using multiple states, it’d probably be better to useyy_push_state(COMMENT)andyy_pop_state(COMMENT)instead ofBEGIN(STATENAME).I’m not entirely sure what your criteria for comment errors are (e.g., how it’s different from encountering an EOF in a comment), but this can likely be expanded to handle those cases as well.