I have two groups of files that contain data in CSV format with a common key (Timestamp) – I need to walk through all the records chronologically.
-
Group A: ‘Environmental Data’
- Filenames are in format A_0001.csv, A_0002.csv, etc.
- Pre-sorted ascending
- Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
- Contains environmental data in CSV/column format
- Very large, several GBs worth of data
-
Group B: ‘Event Data’
- Filenames are in format B_0001.csv, B_0002.csv
- Pre-sorted ascending
- Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
- Contains event based data in CSV/column format
- Relatively small compared to Group A files, < 100 MB
What is best approach?
- Pre-merge: Use one of the various recipes out there to merge the files into a single sorted output and then read it for processing
- Real-time merge: Implement code to ‘merge’ the files in real-time
I will be running lots of iterations of the post-processing side of things. Any thoughts or suggestions? I am using Python.
I would suggest pre-merge.
Reading a file takes a lot of processor time. Reading two files, twice as much. Since your program will be dealing with a large input (lots of files, esp in Group A), I think it would be better to get it over with in one file read, and have all your relevant data in that one file. It would also reduce the number of variables and
readstatements you will need.This will improve the runtime of your algorithm, and I think that’s a good enough reason in this scenario to decide to use this approach
Hope this helps