Given an array of n word-frequency pairs: [ (w 0 , f 0 ),

Question

0

Asked: May 11, 20262026-05-11T20:32:23+00:00 2026-05-11T20:32:23+00:00

Given an array of n word-frequency pairs: [ (w 0 , f 0 ),

0

Given an array of n word-frequency pairs:

[ (w₀, f₀), (w₁, f₁), ..., (w_n-1, f_n-1) ]

where w_i is a word, f_i is an integer frequencey, and the sum of the frequencies ∑f_i = m,

I want to use a pseudo-random number generator (pRNG) to select p words w_j₀, w_j₁, ..., w_{j_p-1} such that
the probability of selecting any word is proportional to its frequency:

P(w_i = w_{j_k}) = P(i = j_k) = f_i / m

(Note, this is selection with replacement, so the same word could be chosen every time).

I’ve come up with three algorithms so far:

Create an array of size m, and populate it so the first f₀ entries are w₀, the next f₁ entries are w₁, and so on, so the last f_p-1 entries are w_p-1.
```
[ w₀, ..., w₀, w₁,..., w₁, ..., w_p-1, ..., w_p-1 ]
```
Then use the pRNG to select p indices in the range 0...m-1, and report the words stored at those indices.
This takes O(n + m + p) work, which isn’t great, since m can be much much larger than n.
Step through the input array once, computing
```
m_i = ∑_h≤if_h = m_i-1 + f_i
```
and after computing m_i, use the pRNG to generate a number x_k in the range 0...m_i-1 for each k in 0...p-1
and select w_i for w_{j_k} (possibly replacing the current value of w_{j_k}) if x_k < f_i.
This requires O(n + np) work.
Compute m_i as in algorithm 2, and generate the following array on n word-frequency-partial-sum triples:
```
[ (w₀, f₀, m₀), (w₁, f₁, m₁), ..., (w_n-1, f_n-1, m_n-1) ]
```
and then, for each k in 0...p-1, use the pRNG to generate a number x_k in the range 0...m-1 then do binary search on the array of triples to find the i s.t. m_i-f_i ≤ x_k < m_i, and select w_i for w_{j_k}.
This requires O(n + p log n) work.

My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-11T20:32:23+00:00

Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:

There are n partitions, all of the same width r s.t. nr = m.
each partition contains two words in some ratio (which is stored with the partition).
for each word w_i, f_i = ∑_{partitions t s.t w}_i ∈ t r × ratio(t,w_i)

Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from 0...n-1 at random), and the partition’s ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means the p selections can be done in O(p) work, given such a partition.

The reason that such a partitioning exists is that there exists a word w_i s.t. f_i < r, if and only if there exists a word w_i' s.t. f_i' > r, since r is the average of the frequencies.

Given such a pair w_i and w_i' we can replace them with a pseudo-word w'_i of frequency f'_i = r (that represents w_i with probability f_i/r and w_i' with probability 1 - f_i/r) and a new word w'_i' of adjusted frequency f'_i' = f_i' - (r - f_i) respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.

To construct this partition in O(n) time,

go through the list of the words once, constructing two lists:
- one of words with frequency ≤ r
- one of words with frequency > r
then pull a word from the first list
- if its frequency = r, then make it into a one element partition
- otherwise, pull a word from the other list, and use it to fill out a two-word partition. Then put the second word back into either the first or second list according to its adjusted frequency.

This actually still works if the number of partitions q > n (you just have to prove it differently). If you want to make sure that r is integral, and you can’t easily find a factor q of m s.t. q > n, you can pad all the frequencies by a factor of n, so f'_i = nf_i, which updates m' = mn and sets r' = m when q = n.

In any case, this algorithm only takes O(n + p) work, which I have to think is optimal.

In ruby:

def weighted_sample_with_replacement(input, p)
  n = input.size
  m = input.inject(0) { |sum,(word,freq)| sum + freq }

  # find the words with frequency lesser and greater than average
  lessers, greaters = input.map do |word,freq| 
                        # pad the frequency so we can keep it integral
                        # when subdivided
                        [ word, freq*n ] 
                      end.partition do |word,adj_freq| 
                        adj_freq <= m 
                      end

  partitions = Array.new(n) do
    word, adj_freq = lessers.shift

    other_word = if adj_freq < m
                   # use part of another word's frequency to pad
                   # out the partition
                   other_word, other_adj_freq = greaters.shift
                   other_adj_freq -= (m - adj_freq)
                   (other_adj_freq <= m ? lessers : greaters) << [ other_word, other_adj_freq ]
                   other_word
                 end

    [ word, other_word , adj_freq ]
  end

  (0...p).map do 
    # pick a partition at random
    word, other_word, adj_freq = partitions[ rand(n) ]
    # select the first word in the partition with appropriate
    # probability
    if rand(m) < adj_freq
      word
    else
      other_word
    end
  end
end

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Given an array of n word-frequency pairs: [ (w 0 , f 0 ),

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply