I need to parse an CSV file using AWK. A line in the CSV could look like this:
"hello, world?",1 thousand,"oneword",,,"last one"
Some important observations:
-field inside quoted string can contain commas and multiple words
-unquoted field can be multiple worlds
-field can be empty by just having two commas in a row
Any clues on writing a regex expression to split this line up properly?
Thanks!
As many have observed, CSV is a harder format than it first appears. There are many edge cases and ambiguities. As an example ambiguity, in your example, is ‘,,,’ a field with a comma or two blank fields?
Perl, python, Java, etc are better equipped to deal with CSV because they have well tested libraries for the same. A regex will be more fragile.
With AWK, I have had some success with THIS AWK function. It works under AWK, gawk and nawk.
Running it on your example data produces:
An example Perl solution: