Imagine that you have a large set of #m objects with properties A and

Question

0

Asked: May 30, 20262026-05-30T23:42:23+00:00 2026-05-30T23:42:23+00:00

Imagine that you have a large set of #m objects with properties A and

0

Imagine that you have a large set of #m objects with properties A and B. What data structure can you use as index(s) (or which algorithm) to improve the performance of the following query?

find all objects where A between X and Y, order by B, return first N results;

That is, filter by range A and sort by B, but only return the first few results (say, 1000 at most). Insertions are very rare, so heavy preprocessing is acceptable. I’m not happy with the following options:

With records (or index) sorted by B: Scan the records/index in B order, return the first N where A matches X-Y. In the worst cases (few objects match the range X-Y, or the matches are at the end of the records/index) this becomes O(m), which for large data sets of size m is not good enough.
With records (or index) sorted by A: Do a binary search until the first object is found which matches the range X-Y. Scan and create an array of references to all k objects which match the range. Sort the array by B, return the first N. That’s O(log m + k + k log k). If k is small then that’s really O(log m), but if k is large then the cost of the sort becomes even worse than the cost of the linear scan over all mobjects.
Adaptative 2/1: do a binary search for the first match of the range X-Y (using an index over A); do a binary search for the last match of the range. If the range is small continue with algorithm 2; otherwise revert to algorithm 1. The problem here is the case where we revert to algorithm 1. Although we checked that “many” objects pass the filter, which is the good case for algorithm 1, this “many” is at most a constant (asymptotically the O(n) scan will always win over the O(k log k) sort). So we still have an O(n) algorithm for some queries.

Is there an algorithm / data structure which allows answering this query in sublinear time?

If not, what could be good compromises to achieve the necessary performance? For instance, if I don’t guarantee returning the objects best ranking for their B property (recall < 1.0) then I can scan only a fraction of the B index. But could I do that while bounding the results’ quality somehow?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T23:42:25+00:00

The question you are asking is essentially a more general version of:

Q. You have a sorted list of words with a weight associated with each word, and you want all words which share a prefix with a given query q, and you want this list sorted by the associated weight.

Am I right?

If so, you might want to check this paper which discusses how to do this in O(k log n) time, where k is the number of elements in the output set desired and n is the number of records in the original input set. We assume that k > log n.

http://dhruvbird.com/autocomplete.pdf

(I am the author).

Update: A further refinement I can add is that the question you are asking is related to 2-dimensional range searching where you want everything in a given X-range and the top-K from the previous set, sorted by the Y-range.

2D range search lets you find everything in an X/Y-range (if both your ranges are known). In this case, you only know the X-range, so you would need to run the query repeatedly and binary search on the Y-range till you get K results. Each query can be performed using O(log n) time if you employ fractional cascading, and O(log²n) if employing the naive approach. Either of them are sub-linear, so you should be okay.

Additionally, the time to list all entries would add an additional O(k) factor to your running time.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Imagine that you have a large set of #m objects with properties A and

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply