I have CSV files that have multiple columns that are sorted. For instance, I might have lines like this:
19980102,,PLXS,10032,Q,A,,,15.12500,15.00000,15.12500,2
19980105,,PLXS,10032,Q,A,,,14.93750,14.75000,14.93750,2
19980106,,PLXS,10032,Q,A,,,14.56250,14.56250,14.87500,2
20111222,,PCP,63830,N,A,,,164.07001,164.09000,164.12000,1
20111223,,PCP,63830,N,A,,,164.53000,164.53000,164.55000,1
20111227,,PCP,63830,N,A,,,165.69000,165.61000,165.64000,1
I would like to divide up the file based on the 3rd column, e.g. put PLXS and PCP entries into their own files called PLXS.csv and PCP.csv. Because the file happens to be pre-sorted, all of the PLXS entries are before the PCP entries and so on.
I generally end up doing things like this in C++ since that’s the language I know the best, but in this case, my input CSV file is several gigabytes and too large to load into memory in C++.
Can somebody show how this can be accomplished? Perl/Python/php/bash solutions are all okay, they just need to be able to handle the huge file without excessive memory usage.
C++ is fine if you know it best. Why would you try to load the entire file into memory anyways?
Since the output is dependent upon the column being read you could easily store buffers for output files and stuff the record into the appropriate file as you process, cleaning as you go to keep the memory footprint relatively small.
I do this (albeit in java) when needing to take massive extracts from a database. The records are pushed into a file buffer stream and anything in the memory is cleaned up so the footprint of the program never grows beyond what it initially starts out at.
Fly by the seat of my pants pseudo-code:
Basically continuing this processing until we’re at the end of the file.
Since we never store more than pointers to the streams and we’re flushing as soon as we write to the streams we don’t ever hold anything resident in the memory of the application other than one record from the input file. Thus the footprint is kept managable.