I need to explain to the client why dupes are showing up between 2 supposedly different exams. It’s been 20 years since Prob and Stats.
I have a generated Multiple choice exam. There are 192 questions in the database, 100 are chosen at random (no dupes).
Obviously, there is a 100% chance of there being at least 8 dupes between any two exams so generated. (Pigeonhole principle)
How do I calculate the probability of there being 25 dupes? 50 dupes? 75 dupes?
— Edit after the fact — I ran this through excel, taking sums of the probabilities from n-100, For this particular problem, the probabilities were
n P(n+ dupes) 40 97.5% 52 ~50% 61 ~0
Erm, this is really really hazy for me. But there are (192 choose 100) possible exams, right?
And there are (100 choose N) ways of picking N dupes, each with (92 choose 100-N) ways of picking the rest of the questions, no?
So isn’t the probability of picking N dupes just:
(100 choose N) * (92 choose 100-N) / (192 choose 100)
EDIT: So if you want the chances of N or more dupes instead of exactly N, you have to sum the top half of that fraction for all values of N from the minimum number of dupes up to 100.
Errrr, maybe…