I’ve got a nested loop with a counter in between.
I’ve managed to use CUDA indices for the outer loop, but I can’t think of any way of getting exploiting more parallelism in this kind of loops.
Do you have any experience working with something similar to that?
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < Nx) {
counter = 0;
for (k = 0; k < Ny; k++) {
d_V[i*Ny + k] = 0;
if ( d_X[i*Ny + k] >= 2e2 ) {
/* do stuff with i and k and counter i.e.*/
d_example[i*length + counter] = k;
...
/* increment counter */
counter++;
}
}
}
The problem that I see is how to deal with counter, as k could be also be indexed in CUDA with threadIdx.y + blockIdx.y * blockDim.y
Having a counter/loop variable which is used between loop iterations is a natural antithesis to parallelisation. Ideal parallel loops have iterations which could run in any order, with no knowledge of each other. Unfortunately a common variable makes it both order dependent and mutually aware.
It looks like you’re using the counter to pack your
d_examplearray without gaps. This kind of thing could well be more efficient in compute time by wasting some memory; if you let the elements of d_example which won’t be set stay as zero, by inefficiently packingd_example, you can perform a filter on d_example later, after any expensive computational steps.In fact you could even leave the filtration to a modified iterator when the array is read, which just skips over any zero values. If zero is a valid value in the array, just use a particular NaN value or a separate mask array.