What parallel algorithms could I use to generate random permutations from a given set?
Especially proposals or links to papers suitable for CUDA would be helpful.
A sequential version of this would be the Fisher-Yates shuffle.
Example:
Let S={1, 2, …, 7} be the set of source indices.
The goal is to generate n random permutations in parallel.
Each of the n permutations contains each of the source indices exactly once,
e.g. {7, 6, …, 1}.
Fisher-Yates shuffle could be parallelized. For example, 4 concurrent workers need only 3 iterations to shuffle vector of 8 elements. On first iteration they swap 0<->1, 2<->3, 4<->5, 6<->7; on second iteration 0<->2, 1<->3, 4<->5, 6<->7; and on last iteration 0<->4, 1<->5, 2<->6, 3<->7.
This could be easily implemented as CUDA
__device__code (inspired by standard min/max reduction):Here the curand initialization code is omitted, and method
swap(int *p, int i, int j)exchanges valuesp[i]andp[j].Note that the code above has the following assumptions:
__shared__memory of CUDA deviceTo generate more than one permutation I would suggest to utilize different CUDA blocks. If the goal is to make permutation of 7 elements (as it was mentioned in the original question) then I believe it will be faster to do it in single thread.