I solve it myself. The problem arises when you compile…

Question

0

Asked: May 11, 20262026-05-11T00:09:43+00:00 2026-05-11T00:09:43+00:00

Is there any way to select a subset from a large set based on

0

Is there any way to select a subset from a large set based on a property or predicate in less than O(n) time?

For a simple example, say I have a large set of authors. Each author has a one-to-many relationship with a set of books, and a one-to-one relationship with a city of birth.

Is there a way to efficiently do a query like ‘get all books by authors who were born in Chicago’? The only way I can think of is to first select all authors from the city (fast with a good index), then iterate through them and accumulate all their books (O(n) where n is the number of authors from Chicago).

I know databases do something like this in certain joins, and Endeca claims to be able to do this ‘fast’ using what they call ‘Record Relationship Navigation’, but I haven’t been able to find anything about the actual algorithms used or even their computational complexity.

I’m not particularly concerned with the exact data structure… I’d be jazzed to learn about how to do this in a RDBMS, or a key/value repository, or just about anything.

Also, what about third or fourth degree requests of this nature? (Get me all the books written by authors living in cities with immigrant populations greater than 10,000…) Is there a generalized n-degree algorithm, and what is its performance characteristics?

Edit:

I am probably just really dense, but I don’t see how the inverted index suggestion helps. For example, say I had the following data:

DATA 1.  Milton        England 2.  Shakespeare   England 3.  Twain         USA  4.  Milton        Paridise Lost 5.  Shakespeare   Hamlet 6.  Shakespeare   Othello 7.  Twain         Tom Sawyer 8.  Twain         Huck Finn  INDEX 'Milton'         (1, 4) 'Shakespeare'    (2, 5, 6) 'Twain'          (3, 7, 8) 'Paridise Lost'  (4) 'Hamlet'         (5) 'Othello'        (6) 'Tom Sawyer'     (7) 'Huck Finn'      (8) 'England'        (1, 2) 'USA'            (3)

Say I did my query on ‘books by authors from England’. Very quickly, in O(1) time via a hashtable, I could get my list of authors from England: (1, 2). But then, for the next step, in order retrieve the books, I’d have to, for EACH of the set {1, 2}, do ANOTHER O(1) lookup: 1 -> {4}, 2 -> {5, 6} then do a union of the results {4, 5, 6}.

Or am I missing something? Perhaps you meant I should explicitly store an index entry linking Book to Country. That works for very small data sets. But for a large data set, the number of indexes required to match any possible combination of queries would make the index grow exponentially.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-11T00:09:44+00:00

For joins like this on large data sets, a modern RDBMS will often use an algorithm called a list merge. Using your example:

Prepare a list, A, of all authors who live in Chicago and sort them by author in O(Nlog(N)) time.*
Prepare a list, B, of all (author, book name) pairs and sort them by author in O(Mlog(M)) time.*
Place these two lists ‘side by side’, and compare the authors from the ‘top’ (lexicographically minimum) element in each pile.
- Are they the same? If so:
  - Output the (author, book name) pair from top(B)
  - Remove the top element of the B pile
  - Goto 3.
- Otherwise, is top(A).author < top(B).author? If so:
  - Remove the top element of the A pile
  - Goto 3.
- Otherwise, it must be that top(A).author > top(B).author:
  - Remove the top element of the B pile
  - Goto 3.

* (Or O(0) time if the table is already sorted by author, or has an index which is.)

The loop continues removing one item at a time until both piles are empty, thus taking O(N + M) steps, where N and M are the sizes of piles A and B respectively. Because the two ‘piles’ are sorted by author, this algorithm will discover every matching pair. It does not require an index (although the presence of indexes may remove the need for one or both sort operations at the start).

Note that the RDBMS may well choose a different algorithm (e.g. the simple one you mentioned) if it estimates that it would be faster to do so. The RDBMS’s query analyser generally estimates the costs in terms of disk accesses and CPU time for many thousands of different approaches, possibly taking into account such information as the statistical distributions of values in the relevant tables, and selects the best.

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions