I have a collection which I’d like to split by an arbitrary percentage. The actual problem I’m trying to solve is to split a dataset into a training and cross-validation set.
The destination of each element should be chosen at random, but each source element should appear only once in the result and the size of the partitions is fixed. If the source collection has duplicates, the duplicates could appear in different output partitions or the same.
I have this implementation:
(defn split-shuffled
"Returns a 2 element vector partitioned by the percentage
specified by p. Elements are selected at random. Each
element of the source collection will appear only once in
the result."
[c p]
(let [m (count c)
idxs (into #{} (take (* m p) (shuffle (range m))))
afn (fn [i x] (if (idxs i) x))
bfn (fn [i x] (if-not (idxs i) x))]
[(keep-indexed afn c) (keep-indexed bfn c)]))
repl> (split-shuffled (range 10) 0.2)
[(4 6) (0 1 2 3 5 7 8 9)]
repl> (split-shuffled (range 10) 0.4)
[(1 4 6 7) [0 2 3 5 8 9)]
But I’m not happy that keep-indexed is called twice.
How can this be improved?
EDIT: I originally wanted to keep the order in the partitions, but I dropped that requirement without re-thinking, so @mikera’s solution is correct!
Why do you need the indexes at all?
Just shuffle the collection directly: