Imagine we have four symbols – ‘a’, ‘b’, ‘c’, ‘d’. We also have four given probabilities of those symbols appearing in the function output – P1, P2, P3, P4 (the sum of which is equal to 1). How would one implement a function which would generate a random sample of three of those symbols, such is that the resulting symbols are present in it with those specified probabilities?
Example: ‘a’, ‘b’, ‘c’ and ‘d’ have the probabilities of 9/30, 8/30, 7/30 and 6/30 respectively. The function outputs various random samples of any three out of those four symbols: ‘abc’, ‘dca’, ‘bad’ and so on. We run this function many times, counting the amount of times each of the symbols is encountered in its output. At the end, the value of counts stored for ‘a’ divided by the total amount of symbols output should converge to 9/30, for ‘b’ to 8/30, for ‘c’ to 7/30, and for ‘d’ to 6/30.
E.g. the function generates 10 outputs:
adc
dab
bca
dab
dba
cab
dcb
acd
cab
abc
which out of 30 symbols contains 9 of ‘a’, 8 of ‘b’, 7 of ‘c’ and 6 of ‘d’. This is an idealistic example, of course, as the values would only converge when the number of samples is much larger – but it should hopefully convey the idea.
Obviously, this all is only possible when neither probability is larger than 1/3, since each single sample output would always contain three distinct symbols. It is ok for the function to enter an infinite loop or otherwise behave erratically if it’s impossible to satisfy the values provided.
Note: the function should obviously use an RNG, but should otherwise be stateless. Each new invocation should be independent from any of the previous ones, except for the RNG state.
EDIT: Even though the description mentions choosing 3 out of 4 values, ideally the algorithm should be able to cope with any sample size.
Your problem is underdetermined.
If we assign a probability to each string of three letters that we allow, p(abc), p(abd), p(acd) etc xtc we can gernerate a series of equations
This has more unknowns than equations, so many ways of solving it. Once a solution is found, by whatever method you choose (use the simplex algorithm if you are me), sample from the probabilities of each string using the roulette method that @alestanis describes.
EDIT: Python code