I’d like to parse a CSV using context free grammar. I arleady have an implementation in C++ but I want to scale CFG’s up to harder problems, but first i need to solve an easy one.
So here’s what I have so far (my syntax is similar to boost spirit):
A CSV consists of one or more rows
Start >> +Line
A row consists of a comma separated symbols plus EOL
Line >> Symbol >> *(',' Symbol) >> EOL
An EOL delimiter can be either windows or unix style
EOL >> -'\r' >> '\n'
Here is where I am stuck handling quoted strings:
Symbol >>
string |
????
Example of a complex quoted strings that must be properly parsed:
"This, is a ""complex"" example of a CSV string!"
"This, is a more """"""complex"""""" but theoretically possible example of a CSV string!"
I am new to CFG’s and cannot figure out how to characterize this in CFG. Basically you need to ignore the commas and double double quotes when the state enters quote mode.
UPDATE:
I just realized that I need to add more states to my conceptual finite state machine from an insight from my automata theory that CFG can be recognized by a pushdown automata:
Symbol -->
string
" doublequotemode "
' singlequotemode '
doublequotemode -->
*"" string *""
The question is how does this work with boost and greedy/non-greedy parsing?
This will process (double-)quoted strings replacing
""with a single quote"inside the quoted string: