I have the following grammar:
cmds
: cmd+
;
cmd
: include_cmd | other_cmd
;
include_cmd
: INCLUDE DOUBLE_QUOTE FILE_NAME DOUBLE_QUOTE
;
other_cmd
: CMD_NAME ARG+
;
INCLUDE
: '#include'
;
DOUBLE_QUOTE
: '"'
;
CMD_NAME
: ('a'..'z')*
;
ARG
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')+
;
FILE_NAME
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '.')+
;
So the difference between CMD_NAME, ARG and FILE_NAME is not large, CMD_NAME must be lower case letters, ARG can have upper case letter and “_” and FILE_NAME yet can have “.”.
But this has a problem, when I test the rule with – #include “abc”, ‘abc’ is interpreted as CMD_NAME instead of FILE_NAME, I think it is because CMD_NAME is before FILE_NAME in the grammar file, this leads to parsing error.
Do I have to rely on such technique as predict to deal with this? Is there a pure EBNF solution other than relying on host programming language?
Thanks.
The set of all valid
CMD_NAMEs intersects with the set of all validFILE_NAMEs. Inputabcqualifies as both. The lexer matches the input with the first rule listed (as you suspected) because it’s the first one matched.It depends on what you’re willing accept in your grammar. Consider changing your
include_cmdrule to something more conventional, like this:Now input
#include "abc"turns into tokens[INCLUDE : #include] [STRING : abc].I don’t think the grammar should be responsible for determining whether a file name is valid or not: a valid file name doesn’t imply a valid file, and the grammar has to understand OS file naming conventions (valid characters, paths, etc) that probably have no bearing on the grammar itself. I think you’ll be fine if you’re willing to drop rule
FILE_NAMEfor something like the rules the above.Also worth noting, your
CMD_NAMErule matches zero-length input. Consider changing('a'..'z')*to('a'..'z')+unless aCMD_NAMEreally can be empty.Keep in mind, too, that you’ll have the same problem with
ARGthat you did withFILE_NAME. It’s listed afterCMD_NAME, so any input that qualifies for both rules (likeabcagain) will hitCMD_NAME. Consider breaking these rules up into more conventional ones like so:I added rule
SEMIto mark the end of a command. Otherwise the parser won’t know if inputa b c dis supposed to be one command with three arguments (a(b,c,d)) or two commands with one argument each (a(b), c(d)).