I have defined the follwing Parsec parser for parsing csv files into a table of strings, i.e. [[String]]
--A csv parser is some rows seperated, and possibly ended, by a newline charater
csvParser = sepEndBy row (char '\n')
--A row is some cells seperated by a comma character
row = sepBy cell (char ',')
--A cell is either a quoted cell, or a normal cell
cell = qcell <|> ncell
--A normal cell is a series of charaters which are neither , or newline. It might also be an escape character
ncell = many (escChar <|> noneOf ",\n")
--A quoted cell is a " followd by some characters which either are escape charaters or normal characters except for "
qcell = do
char '"'
res <- many (escChar <|> noneOf "\"")
char '"'
return res
--An escape character is anything followed by a \. The \ will be discarded.
escChar = char '\\' >> anyChar
I don’t really know if the comments are too much and annoying, of if they are helping. As a Parsec noob they would help me, so I thought I would add them.
It works pretty good, but there is a problem. It creates an extra, empty, row in the table. So if I for example have a csv file with 10 rows(that is, only 10 lines. No empty lines in the end*), the [[String]] structure will have length 11 and the last list of Strings will contain 1 element. An empty String (at least this is how it appears when printing it using show).
My main question is: Why does this extra row appear, and what can I do to stop it?
Another thing I have noted is that if there are empty lines after the data in the csv files, these will end up as rows containing only an empty String in the table. I thought that using sepEndBy instead of sepBy would make the extra empty lines by ignored. Is this not the case?
*After looking at the text file in a hex editor, it seems that it indeed actually ends in a newline character, even though vim doesn’t show it…
If you want each row to have at least one cell, you can use
sepBy1instead ofsepBy. This should also stop empty rows being parsed as a row. The difference betweensepByandsepBy1is the same as the difference betweenmanyandmany1: the1version only parses sequences of at least one element. Sorowbecomes this:Also, the usual style is to use
sepBy1in infix:cell `sepBy1` char ','. This reads more naturally: you have a “cell separated by a comma” rather than “separated by cell a comma”.EDIT: If you don’t want to accept empty cells, you have to specify that
ncellhas at least one character usingmany1: