I have created a very simple SQL parser, but during fuzz testing I’ve come across this situation:
SELECT 123 + ,
K_SELECT INTEGER T_PLUS T_COMMA
Of course this is a syntax error, but I don’t know how to “catch” it.
How does it decide between the “next_column_expression came too early” and “binary_expression didn’t finish”. I’ve worked with ANTLR3 a fair bit on Java project. But this is totally different.
Here is the skeleton parser rules:
/* be more versbose about error messages */
%error-verbose
/* keywords */
%token K_CREATE
%token K_FROM
%token K_INTEGER
%token K_SELECT
%token K_TABLE
%token K_TEXT
%token K_WHERE
%token K_VALUES
%token K_INSERT
%token K_INTO
/* variable tokens */
%token IDENTIFIER
%token INTEGER
/* fixed tokens */
%token T_ASTERISK
%token T_PLUS
%token T_EQUALS
%token T_END ";"
%token T_COMMA
%token T_BRACKET_OPEN
%token T_BRACKET_CLOSE
%token END 0 "end of file"
%%
input:
statement {
}
END
;
statement:
select_statement {
}
|
create_table_statement {
}
|
insert_statement {
}
;
keyword:
K_CREATE | K_FROM | K_INTEGER | K_SELECT | K_TABLE | K_TEXT | K_WHERE | K_VALUES | K_INSERT | K_INTO
;
table_name:
error {
// "Expected table name"
}
|
keyword {
// "You cannot use a keyword for a table name."
}
|
IDENTIFIER {
}
;
select_statement:
K_SELECT column_expression_list {
// "Expected FROM after column list."
}
error
|
K_SELECT error {
// "Expected column list after SELECT."
}
|
K_SELECT column_expression_list {
}
K_FROM table_name {
}
;
column_expression_list:
column_expression {
}
next_column_expression
;
column_expression:
T_ASTERISK {
}
|
expression {
}
;
next_column_expression:
|
T_COMMA column_expression {
}
next_column_expression
;
binary_expression:
value {
}
operator {
}
value {
}
;
expression:
value
|
binary_expression
;
operator:
T_PLUS {
}
|
T_EQUALS {
}
;
value:
INTEGER {
}
|
IDENTIFIER {
}
;
%%
You need to understand LR (shift-reduce) parsing, and you need to understand how yacc recovers from errors, using error rules in the grammar. The former is a big question, and there are a number of books that cover the theory and practice PDAs and shift-reduce parsing (The classics, Hopcroft & Ullman and Aho, Sethi & Ullman are complete if rather dense).
Once you understand shift-reduce parsing, yacc error recovery is reasonably straight-forward. Basically, whenever it gets into a state where it can’t shift or reduce on the current tokens, it takes a simple sequence of steps to try to recover:
It pops states until it gets to one that can shift the special
errortoken. This might be zero pops if the current state can shifterror.It shifts the error token, and then does any default reductions in the target state.
It throws away input tokens until it finds one that can be handled in the current state. As with the state dropping, that might be zero discards if the state after shifting
errorcan handle the next token.and that’s it.
So If we look at what happens with your current grammar and example erroneous input we that it:
SELECTtoken going into a stateselect_statement: K_SELECT ...123token, reduces it to avalueand shifts to a state*expr: value ...+token, reduces it to anoperatorand shifts to a statebinary_expression: value operator ...,and can’t shift or reduce in the current state, so issues a syntax error.error. The top two states (from 3 and 2 above) can’t so are discarded. The next state can, so we end up in a stateselect_statement: K_SELECT errorselect_statementwhich is then reduced tostatementwhich shifts to a stateinput: statement ENDEND. So it throws aways everything until it gets toENDor eof.Now your question seems to be, “How do I do something different?”
If you want a ‘binary expression not complete’ recovery, you could add a rule like:
This would end up as part of the
*expr: valuestate above, so error recovery would stop popping there and shift the error token, ending up in a state that can shift the,token.Whenever you’re trying to untangle the states in a large grammar and understand what error recovery will do, it helps tremendously to run yacc/bison with the -v flags to produce a
.outputfile with all the states in it.