Given an array of n word-frequency pairs:
[ (w0, f0), (w1, f1), ..., (wn-1, fn-1) ]
where wi is a word, fi is an integer frequencey, and the sum of the frequencies ∑fi = m,
I want to use a pseudo-random number generator (pRNG) to select p words wj0, wj1, ..., wjp-1 such that
the probability of selecting any word is proportional to its frequency:
P(wi = wjk) = P(i = jk) = fi / m
(Note, this is selection with replacement, so the same word could be chosen every time).
I’ve come up with three algorithms so far:
-
Create an array of size
m, and populate it so the firstf0entries arew0, the nextf1entries arew1, and so on, so the lastfp-1entries arewp-1.[ w0, ..., w0, w1,..., w1, ..., wp-1, ..., wp-1 ]
Then use the pRNG to select
pindices in the range0...m-1, and report the words stored at those indices.
This takesO(n + m + p)work, which isn’t great, sincemcan be much much larger than n. -
Step through the input array once, computing
mi = ∑h≤ifh = mi-1 + fi
and after computing
mi, use the pRNG to generate a numberxkin the range0...mi-1for eachkin0...p-1
and selectwiforwjk(possibly replacing the current value ofwjk) ifxk < fi.
This requiresO(n + np)work. - Compute
mias in algorithm 2, and generate the following array on n word-frequency-partial-sum triples:[ (w0, f0, m0), (w1, f1, m1), ..., (wn-1, fn-1, mn-1) ]
and then, for each k in
0...p-1, use the pRNG to generate a numberxkin the range0...m-1then do binary search on the array of triples to find theis.t.mi-fi ≤ xk < mi, and selectwiforwjk.
This requiresO(n + p log n)work.
My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?
Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:
npartitions, all of the same widthrs.t.nr = m.wi,fi = ∑partitions t s.t wi ∈ t r × ratio(t,wi)Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from
0...n-1at random), and the partition’s ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means thepselections can be done inO(p)work, given such a partition.The reason that such a partitioning exists is that there exists a word
wis.t.fi < r, if and only if there exists a wordwi's.t.fi' > r, since r is the average of the frequencies.Given such a pair
wiandwi'we can replace them with a pseudo-wordw'iof frequencyf'i = r(that representswiwith probabilityfi/randwi'with probability1 - fi/r) and a new wordw'i'of adjusted frequencyf'i' = fi' - (r - fi)respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.To construct this partition in
O(n)time,This actually still works if the number of partitions
q > n(you just have to prove it differently). If you want to make sure that r is integral, and you can’t easily find a factorqofms.t.q > n, you can pad all the frequencies by a factor ofn, sof'i = nfi, which updatesm' = mnand setsr' = mwhenq = n.In any case, this algorithm only takes
O(n + p)work, which I have to think is optimal.In ruby: