I’ve got a sequential section within my kernel that is really slowing it down. However, I don’t see how I could get rid of the inner loop. Any suggestions here?
__global__ void myKernel( int keep, int inc, int width, int* d_Xnum,
int* d_Xco, bool* d_Xvalid,int* d_A )
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int k1;
if( i < keep && j <= i){
int counter = 0;
for(k1 = 0; k1 < inc; k1++){
if(d_Xvalid[j*inc + k1] == 0)
counter += (d_Xvalid[i*inc + d_Xco[j*width + k1]]);
}
d_A[i*keep+j] = inc - d_Xnum[i] - counter;
}
}
I believe that eliminating k1 will speed up my code by a good factor. However I don’t see how to do it with counter being used. Any suggestions, ideas, thoughts would be more than welcome!
This kernel is called:
...
int t = 32;
int b = keep/(32)+1;
int b2 = (inc/32)+1;
dim3 thread (t, t);
dim3 block (b, inc);
// kernel call
myKernel<<<block, thread>>>(k, inc, width, d_Xnum,
d_Xco, d_Xvalid, d_A);
cudaThreadSynchronize();
...
keep is around 9000 and inc around 20000
This is not an exact answer to your question, but it’s something that probably can optimize your code and maybe help you to implement a parallel reduction for
k1sum because you get rid ofif( i < keep && j <= i). There are other optimizations you could implement depending on your gpu model, like using textures to access those read-only vectors.Because of the way you generate the indices many threads are stopped waiting for the others to finish. You are launching
keep*incthreads but only a maximum number ofkeep*(keep+1)/2are actually doing something (because of the conditionj <= i).I think you can make it better with the following changes:
launch
keep*(keep+1)/2threadsdo the following to your code
What you’re doing (for
keep = 4launch4*4 = 16threads, in the best case scenario. Ifinc > keep, as it seems to be the case, you’re launching even more threads) can be seen as (each ‘box’ is a thread)I propose you add an index
kand generateiandjfrom it according to your needs (forkeep = 4launch(4*(4+1)/2 = 10threads)this can be done with
i = (int)(sqrt(0.25+2*k)-0.5)j = k - i*(i+1)/2You can accept this as a recipe or look a little bit on the mathematics behind it.
To get here you know that for
j = 0you havei*(i+1)/2 = k(because k = 0+1+2+…+i = i*(i+1)/2 ). Now, if you solve this equation you get the equation fori(the cast for int rounds down and ensures that it gets the right result whenj!=0too).To get
jyou should subtract tokthe value it would have ifjwas 0:i*(i+1)/2.