I’m working on an algorithm that has to do a small number
of operations on a large numbers of small arrays, somewhat independently.
To give an idea:
- 1k sorting of arrays of length typically of 0.5k-1k elements.
- 1k of LU-solve of matrices that have rank 10-20.
everything is in floats.
Then, there is some horizontality to this problem: the above
operations have to be carried independently on 10k arrays.
Also, the intermediate results need not be stored: for example, i don’t
need to keep the sorted arrays, only the sum of the smallest $m$ elements.
The whole thing has been programmed in c++ and runs. My question is:
would you expect a problem like this to enjoy significant speed ups
(factor 2 or more) with CUDA?
You can run this in 5 lines of ArrayFire code. I’m getting speedups of ~6X with this over the CPU. I’m getting speedups of ~4X with this over Thrust (which was designed for vectors, not matrices). Since you’re only using a single GPU, you can run ArrayFire Free version.
Keep in mind that GPUs perform best when memory accesses are aligned to multiples of 32, so a bunch of 32×32 matrices will perform better than a bunch of 31×31.