I’m working on an algorithm that has to do a small number of operations

Question

0

Asked: June 8, 20262026-06-08T01:26:32+00:00 2026-06-08T01:26:32+00:00

I’m working on an algorithm that has to do a small number of operations

0

I’m working on an algorithm that has to do a small number
of operations on a large numbers of small arrays, somewhat independently.

To give an idea:

1k sorting of arrays of length typically of 0.5k-1k elements.
1k of LU-solve of matrices that have rank 10-20.

everything is in floats.

Then, there is some horizontality to this problem: the above
operations have to be carried independently on 10k arrays.

Also, the intermediate results need not be stored: for example, i don’t
need to keep the sorted arrays, only the sum of the smallest $m$ elements.

The whole thing has been programmed in c++ and runs. My question is:
would you expect a problem like this to enjoy significant speed ups
(factor 2 or more) with CUDA?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T01:26:34+00:00

You can run this in 5 lines of ArrayFire code. I’m getting speedups of ~6X with this over the CPU. I’m getting speedups of ~4X with this over Thrust (which was designed for vectors, not matrices). Since you’re only using a single GPU, you can run ArrayFire Free version.

array x = randu(512,1000,f32);
array y = sort(x); // sort each 512-element column independently
array x = randu(15,15,1000,f32), y;
gfor (array i, x.dim(2))
  y(span,span,i) = lu(x(span,span,i)); // LU-decomposition of each 15x15 matrix

Keep in mind that GPUs perform best when memory accesses are aligned to multiples of 32, so a bunch of 32×32 matrices will perform better than a bunch of 31×31.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m working on an algorithm that has to do a small number of operations

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply