Alright so this problem has been breaking my brain all day today.
The Problem: I am currently receiving stock tick data at an extremely high rate through multicasts. I have already parsed this data and am receiving it in the following form.
-StockID: Int-64
-TimeStamp: Microseconds from Epoch
-Price: Int
-Quantity: Int
Hundreds of these packets of data are parsed every second. I am trying to reduce the computation on my storage end by packaging up this data into dictionaries/hashtables hashed by the stockID (key == stockID)(value == array of [timestamp, price, quantity] elements).
I also want each dictionary to represent timestamps within a 5min interval. When the incoming data’s timestamps get past the 5min time interval, I want this new data to go into a new dictionary that represents the next time interval. Also, a special key will be hashed at key -1 telling what 5 particular minute interval per day does this dictionary belong to (so if you receive something at 12:32am, it should hash into the dictionary that has value 7 at key -1, since this represents the time interval of 12:30am to 12:35am for that particular day). Once the time passes, the dict that has its time expired can be sent off to the dataWrapper.
Now, you might be coming up with some ideas right about now. But here’s a big constraint. The timestamps that are coming in Are not necessarily strictly increasing; however, if one waits about 10 seconds after an interval has ended then it can be safe to assume that every data coming in belongs to the current interval.
The reason I am doing all this complicated things is to reduce computation on the storage side of my application. With the setup above, my storage side thread can simply iterate over all of the key, value pairs within the dictionary and store them in the same location on the storage system without having to reopen files, reassign groups or change directories.
Good Luck! I will greatly appreciate ANY answers btw. 🙂
Preferred if you can send me something in python (that’s what I’m doing the project in), but I can perfectly understand Java, C++, Ruby or PHP.
Summary
I am trying to put stock data into dictionaries that represent 5min intervals for each dictionary. The timestamp that comes with the data determines what particular dictionary it should be put in. This could be relatively easy except that timestamps are not strictly increasing as they come in, so dictionaries cannot be sent off to the datawrapper immediately once 5 mins has passed by the timestamps, since it isn’t guaranteed to not receive any more data within 10 seconds, after this its okay to send it to the wrapper.
I just want any kind of ideas, algorithms, or partial implementations that could help me with the scheduling of this. How can we switch the current use of dictionaries within both timestamps (for the data) and actual time (the 10seconds buffer).
Clarification Edit
The 5 min window should be data driven (based upon timestamps), however the 10 second timeout appears to be clock time.
Perhaps I am missing something ….
Its appears you want to keep the data in 5 min buckets, but you can’t be sure you have all the data for a bucket for up to 10 sec after it has rolled over.
This means for each instrument you need to keep the current bucket and the previous bucket. When its 10 seconds past the 5 min boundary you can publish/write out the old bucket.