Imagine that you have a large set of #m objects with properties A and B. What data structure can you use as index(s) (or which algorithm) to improve the performance of the following query?
find all objects where A between X and Y, order by B, return first N results;
That is, filter by range A and sort by B, but only return the first few results (say, 1000 at most). Insertions are very rare, so heavy preprocessing is acceptable. I’m not happy with the following options:
-
With records (or index) sorted by B: Scan the records/index in
Border, return the firstNwhereAmatches X-Y. In the worst cases (few objects match the range X-Y, or the matches are at the end of the records/index) this becomesO(m), which for large data sets of sizemis not good enough. -
With records (or index) sorted by A: Do a binary search until the first object is found which matches the range X-Y. Scan and create an array of references to all
kobjects which match the range. Sort the array by B, return the firstN. That’sO(log m + k + k log k). Ifkis small then that’s reallyO(log m), but ifkis large then the cost of the sort becomes even worse than the cost of the linear scan over allmobjects. -
Adaptative 2/1: do a binary search for the first match of the range X-Y (using an index over A); do a binary search for the last match of the range. If the range is small continue with algorithm 2; otherwise revert to algorithm 1. The problem here is the case where we revert to algorithm 1. Although we checked that “many” objects pass the filter, which is the good case for algorithm 1, this “many” is at most a constant (asymptotically the
O(n)scan will always win over theO(k log k)sort). So we still have anO(n)algorithm for some queries.
Is there an algorithm / data structure which allows answering this query in sublinear time?
If not, what could be good compromises to achieve the necessary performance? For instance, if I don’t guarantee returning the objects best ranking for their B property (recall < 1.0) then I can scan only a fraction of the B index. But could I do that while bounding the results’ quality somehow?
The question you are asking is essentially a more general version of:
Am I right?
If so, you might want to check this paper which discusses how to do this in O(k log n) time, where k is the number of elements in the output set desired and n is the number of records in the original input set. We assume that k > log n.
http://dhruvbird.com/autocomplete.pdf
(I am the author).
Update: A further refinement I can add is that the question you are asking is related to 2-dimensional range searching where you want everything in a given X-range and the top-K from the previous set, sorted by the Y-range.
2D range search lets you find everything in an X/Y-range (if both your ranges are known). In this case, you only know the X-range, so you would need to run the query repeatedly and binary search on the Y-range till you get K results. Each query can be performed using O(log n) time if you employ fractional cascading, and O(log2n) if employing the naive approach. Either of them are sub-linear, so you should be okay.
Additionally, the time to list all entries would add an additional O(k) factor to your running time.