My grammar is producing an unexpected result. I am not sure if it is just my bug or some issues with ANTLR’S ambiguous alternatives processing logic.
Here is my grammar :
grammar PPMacro;
options {
language=Java;
backtrack=true;
}
file: (inputLines)+ EOF;
inputLines
: ( preprocessorLineSet | oneNormalInputLine ) ;
oneNormalInputLine @after{System.out.print("["+$text+"]");}
: (any_token_except_crlf)* CRLF ;
preprocessorLineSet
: ifPart endifLine;
ifPart: ifLine inputLines* ;
ifLine @after{System.out.print("{"+$text+"}" );}
: '#' IF (any_token_except_crlf)* CRLF ;
endifLine @after{System.out.print("{"+$text+"}" );}
: '#' ENDIF (any_token_except_crlf)* CRLF ;
any_token_except_crlf: (ANY_ID | WS | '#'|IF|ENDIF);
// just matches everything
CRLF: '\r'? '\n' ;
WS: (' '|'\t'|'\f' )+;
Hash: '#' ;
IF : 'if' ;
ENDIF : 'endif' ;
ANY_ID: ( 'a'..'z'|'A'..'Z'|'0'..'9'| '_')+ ;
Explanation:
It is for parsing a C++ #if … #endif block
I am trying to recognize nested #if #endif block. This is done by my preprocessorLineSet. It contains a recursive definition to support nested block. oneNormalInputLine is to handle anything not of the #if form. This rule is a match anything rule and actually matches a #if line. But I deliberately put it after the preprocessorLineSet in inputLines. I’m expecting this ordering can prevent it from matching a #if or #endif line. The reason to use a catch-all rule is that I want a rule to accept any other c++ syntax and simply echo them back to the output.
I my test, I just print out everything. Lines matched by preprocessorLineSet should be surrounded by {}, while those matched by oneNormalInputLine should be surrounded by [].
Sample inputs :
#if s
s
#if a
s
s
#endif
#endif
and this
#if
abc
#endif
The corresponding outputs:
[#if s
][s
][#if a
][s
][s
][#endif
][#endif
]
and this
[#if
][abc
][#endif
]
Problem:
All the output lines including #if #endif are surrounded by [] meaning they are matched ONLY by oneNormalInputLine! But I am not expecting this. The preprocessorLineSet should be able to match the #if lines. Why’d I get this result?
This line contains ambiguous alternatives:
inputLines : ( preprocessorLineSet | oneNormalInputLine );
since both can match the #if and #endif. But I am expecting the first alternative should be used rather than the later one. Also note that backtracking is on.
EDIT
The reason my oneNormalInputLine rule accepts everything is that it is difficult to express something not having a specific pattern as #if pattern can be rather complicated:
/***
comments
*/ # /***
comments
*/ if
is a valid pattern. Writing a rule not having this pattern seems difficult.
Your approach is not really robust – I’d suggest you to keep it simple and use the actual language rule, which says that every line that begins with
#is a preprocessor directive, and the one that doesn’t begin with#isn’t. There would be no ambiguity in the grammar using this rule and it would be much simpler to understand.Now why doesn’t your grammar work? The problem is that your
preprocesstoLineSetrule can’t match anything.It starts by
#if ..., then should match other lines, and as the first matching#endifcomes, it should match it and finish. However, it doesn’t actually do that.inputLinescan match pretty much any line (pretty much – it won’t match eg. C++’s operators and other non-identifiers), including all preprocessor directives. That means theifPartrule will match to the end of input and there would be noendifLineleft. Note that backtracking has no effect on this, because once ANTLR matches a rule (in this caseifPart, which will succeed on the whole rest of the input, since*is greedy), it will never backtrack into it. ANTLR’s rules for backtracking are hairy…Note that if you made
oneNormalLinenot match preprocessor directives (eg. it would be something like(nonHash any*| ) CRLF, it would start to work.