I have a 2 GB CSV file that has a few columns and several millions of rows (including a date column formatted as 2010-12-15). I am looking to split this CSV into smaller CSVs that are arranged in folders by the date (for example all entries for December 15, 2010 are located inside a folder named 20101215).
I am fairly new to this stuff but am aware of split command. Can you guys point me in the right direction?
Thanks in advance!
Depending on how regular and clean your data is, something like this may suffice:
(assuming your data is in the file
csv, and your date info is in the second column)What is this doing ? The
cutfilters out the dates in the second column, and they’re run throughsort -uto create a sorted list of unique dates. We then iterate through this (theforcommand) and for each entry wemkdira corresponding directory, and grep results out into a csv file within that directory.It’s not ideal. e.g. we grep through the input file for each date. I’m assuming the data is regular, and that a date string (2012-08-06 for example) doesn’t appear elsewhere in your data (or has characters that would screw up the above script e.g. spaces and/or
/).I don’t think the
splitcommand will help you here. It’s more useful for splitting files into regular chunks (by size or by number of lines).