I have a large data set (200GB uncompressed, 9GB compressed in bz2 -9 ) of stock tick data.
I want to run some basic time series analysis on them.
My machine has 16GB of RAM.
I would prefer to:
-
keep all data, compressed, in memory
-
decompress that data on the fly, and stream it [so nothing ever hits disk]
-
do all analysis in memory
Now, I think there’s nice interactions here with Clojure’s laziness, and future objects (i.e. I can define objects s.t. when I try to access them, I’ll decompress them on the fly.)
Question: what are the things I should keep in mind when doing high performance time series analysis in Clojure?
I’m particular interested in tricks involving:
-
efficiently storing tick data in memory
-
efficiently doing computation
-
weird convolutions to reduce # of passes over the data
Books / articles / research paper suggestions welcome. (I’m a CS PhD student).
Thanks.
Some ideas: