I need to take a random sample without replacement (each element only occurring once in the sample) from a longer list. I’m using the code below, but now I’d like to know:
- Is there a library function that does this?
- How can I improve this code? (I’m a Haskell beginner, so this would be useful even if there is a library function).
The purpose of the sampling is to be able to generalize findings from analyzing the sample to the population.
import System.Random
-- | Take a random sample without replacement of size size from a list.
takeRandomSample :: Int -> Int -> [a] -> [a]
takeRandomSample seed size xs
| size < hi = subset xs rs
| otherwise = error "Sample size must be smaller than population."
where
rs = randomSample seed size lo hi
lo = 0
hi = length xs - 1
getOneRandomV g lo hi = randomR (lo, hi) g
rsHelper size lo hi g x acc
| x `notElem` acc && length acc < size = rsHelper size lo hi new_g new_x (x:acc)
| x `elem` acc && length acc < size = rsHelper size lo hi new_g new_x acc
| otherwise = acc
where (new_x, new_g) = getOneRandomV g lo hi
-- | Get a random sample without replacement of size size between lo and hi.
randomSample seed size lo hi = rsHelper size lo hi g x [] where
(x, g) = getOneRandomV (mkStdGen seed) lo hi
subset l = map (l !!)
Here’s a quick ‘back-of-the-envelope’ implementation of what Daniel Fischer suggested in his comment, using my preferred PRNG (mwc-random):
This is pretty much a (terse) functional rewrite of R’s internal C version of
sample()as it’s called without replacement.sampleis just a wrapper over a recursive worker function that incrementally shuffles the population until the desired sample size is reached, returning only that many shuffled elements. Writing the function like this ensures that GHC can inline it.It’s easy to use:
A production version might want to use something like a mutable vector instead of
Data.Sequencein order to cut down on time spent doing GC.