I have a Python datetime timestamp and a large dict (index) where keys are timestamps and the values are some other information I’m interested in.
I need to find the datetime (the key) in index that is closest to timestamp, as efficiently as possible.
At the moment I’m doing something like:
for timestamp in timestamps:
closestTimestamp = min(index,key=lambda datetime : abs(timestamp - datetime))
which works, but takes too long – my index dict has millions of values, and I’m doing the search thousands of times. I’m flexible with data structures and so on – the timestamps are roughly sequential, so that I’m iterating from the first to the last timestamps. Likewise the timestamps in the text file that I load into the dict are sequential.
Any ideas for optimisation would be greatly appreciated.
Dictionaries aren’t organized for efficient near miss searches. They are designed for exact matches (using a hash table).
You may be better-off maintaining a separate, fast-searchable ordered structure.
A simple way to start off is to use the bisect module for fast O(log N) searches but slower O(n) insertions:
A more sophisticated approach suitable for non-static, dynamically updated dicts, would be to use blist which employs a tree structure for fast O(log N) insertions and lookups. You only need this if the dict is going to change over time.
If you want to stay with a dictionary based approach, consider a dict-of-lists that clusters entries with nearby timestamps:
Note, for exact results near cluster boundaries, store close-to-the-boundary timestamps in both the primary cluster and the adjacent cluster.