I am trying to write a parallel prefix scan on cuda by following this tutorial –
I am trying the work-inefficient ‘double buffered one’ as explained in the tutorial.
This is what I have:
// double buffered naive.
// d = number of iterations, N - size, and input.
__global__ void prefixsum(int* in, int d, int N)
{
//get the block index
int idx = blockIdx.x*blockDim.x + threadIdx.x;
// allocate shared memory
extern __shared__ int temp_in[], temp_out[];
// copy data to it.
temp_in[idx] = in[idx];
temp_out[idx] = 0;
// block until all threads copy
__syncthreads();
int i = 1;
for (i; i<=d; i++)
{
if (idx < N+1 && idx >= (int)pow(2.0f,(float)i-1))
{
// copy new result to temp_out
temp_out[idx] += temp_in[idx - (int)pow(2.0f,(float)i-1)] + temp_in[idx];
}
else
{
// if the element is to remain unchanged, copy the same thing
temp_out[idx] = temp_in[idx];
}
// block until all theads do this
__syncthreads();
// copy the result to temp_in for next iteration
temp_in[idx] = temp_out[idx];
// wait for all threads to do so
__syncthreads();
}
//finally copy everything back to global memory
in[idx] = temp_in[idx];
}
Can you point out what’s wrong with this? I have written comments for what I think should happen.
This is the kernel invocation –
prefixsum<<<dimGrid,dimBlock>>>(d_arr, log(SIZE)/log(2), N);
This is the grid and block allocations:
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
The problem is that I don’t get the correct output for any input that’s more than 8 elements long.
I see two problems in your code
Problem 1: extern shared memory
Agh…. I hate
extern __shared__memory. The problem is, that the compiler does not know how big are the arrays. As a result, they both point to the same piece of memory!So, in your case:
temp_in[5]andtemp_out[5]refer to the same word in shared memory.If you really want the
extern __shared__memory, you can manually offset the second array, for example something like this:Problem 2: Shared array index
Shared memory is private for each block. That is,
temp[0]in one block can be different thantemp[0]in another block. However, you index it byblockIdx.x*blockDim.x + threadIdx.xas if the temp arrays were shared between the blocks.Instead, you should most likely index your temp arrays just by
threadIdx.x.Of course, the
idxarray is global and you index that one correctly.