I’m having a hard time breaking a large (50GB) csv file into smaller part. Each line has a few thousand fields. Some of the fields are strings in double quotes, others are integers, decimals and boolean.
I want to parse the file line by line and split by the number of fields in each row. The strings contain possibly several commas (such as ), as well as a number of empty fields.
,,1,30,50,”Sold by father,son and daughter for $4,000″ , ,,,, 12,,,20.9,0,
I tried using
perl -pe' s{("[^"]+")}{($x=$1)=~tr/,/|/;$x}ge ' file >> file2
to change the commas inside the quotes to | but that didn’t work. I plan to use
awk -F"|" conditional statement appending to new k_fld_files file2
Is there an easier way to do this please? I’m looking at python, but I probably need a utility that will stream process the file, line by line.
Using Python – if you just want to parse CSV including embedded delimiters, and stream out with a new delimiter, then something such as:
Otherwise, it’s not much more difficult to make this do all kinds of stuff.
Example of outputting to files per column (untested):