Consider a problem where a random sublist of k items, Y, must be selected from X, a list of n items, where the items in Y must appear in the same order as they do in X. The selected items in Y need not be distinct. One solution is this:
for i = 1 to k
A[i] = floor(rand * n) + 1
Y[i] = X[A[i]]
sort Y according to the ordering of A
However, this has running time O(k log k) due to the sort operation. To remove this it’s tempting to
high_index = n
for i = 1 to k
index = floor(rand * high_index) + 1
Y[k - i + 1] = X[index]
high_index = index
But this gives a clear bias to the returned list due to the uniform index selection. It feels like a O(k) solution is attainable if the indices in the second solution were distributed non-uniformly. Does anyone know if this is the case, and if so what properties the distribution the marginal indices are drawn from has?
For the first index in Y, the distribution of indices in X is given by:
P(x; n, k) = binomial(n – x + k – 2, k – 1) / norm
where binomial denotes calculation of the binomial coefficient, and norm is a normalisation factor, equal to the total number of possible sublist configurations.
norm = binomial(n + k – 1, k)
So for k = 5 and n = 10 we have:
We can sample the X index of the first item in Y from this distribution (call it x1). The distribution of the second index in Y can then be sampled in the same way with P(x; (n – x1), (k – 1)), and so on for all subsequent indices.
My feeling now is that the problem is not solvable in O(k), because in general we are unable to sample from the distribution described in constant time. If k = 2 then we can solve in constant time using the quadratic formula (because the probability function simplifies to 0.5(x^2 + x)) but I can’t see a way to extend this to all k (my maths isn’t great though).