I am about to start collecting large amounts of numeric data in real-time (for those interested, the bid/ask/last or ‘tape’ for various stocks and futures). The data will later be retrieved for analysis and simulation. That’s not hard at all, but I would like to do it efficiently and that brings up a lot of questions. I don’t need the best solution (and there are probably many ‘bests’ depending on the metric, anyway). I would just like a solution that a computer scientist would approve of. (Or not laugh at?)
(1) Optimize for disk space, I/O speed, or memory?
For simulation, the overall speed is important. We want the I/O (really, I) speed of the data just faster than the computational engine, so we are not I/O limited.
(2) Store text, or something else (binary numeric)?
(3) Given a set of choices from (1)-(2), are there any standout language/library combinations to do the job– Java, Python, C++, or something else?
I would classify this code as “write and forget”, so more points for efficiency over clarity/compactness of code. I would very, very much like to stick with Python for the simulation code (because the sims do change a lot and need to be clear). So bonus points for good Pythonic solutions.
Edit: this is for a Linux system (Ubuntu)
Thanks
Fame is an often-used commercial solution for time-series storage.
If you are serious about this, building your own will be a big job. HDF might be useful, they claim that it is suitable for tick data handling, and have C++ access. There is Python support here.
Useful real-life experience from somebody with the same problem here, including HDF5 refs.