I am working through the Cuda Parallel reduction Whitepaper, but unfortunately my algorithm seems to repeatedly produce incorrect results, and I can not seem to figure out why(surely a textbook example must work? Surely I’m just doing something very obvious wrong?). Here is my kernel function:
My define:
#define BLOCK_SIZE 512
My Kernel function:
__global__ void total(float * inputList, float * outputList, int len) {
__shared__ float sdata[2*BLOCK_SIZE];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[t] = inputList[i]+inputList[i+blockDim.x];
__syncthreads();
for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if (tid == 0)
outputList[blockIdx.x] = sdata[0];
}
My memory allocation:
outputSize = inputSize / (BLOCK_SIZE<<1);
cudaMalloc((void**) &deviceInput, inputSize*sizeof(float));
cudaMalloc((void**) &deviceOutput, outputSize*sizeof(float));
cudaMemcpy(deviceInput, hostInput, inputSize*sizeof(float), cudaMemcpyHostToDevice);
My device call:
dim3 dimGrid((inputSize-1)/BLOCK_SIZE +1, 1, 1);
dim3 dimBlock(BLOCK_SIZE,1,1);
total<<<dimBlock, dimGrid>>>(deviceInput, deviceOutput, outputSize);
cudaDeviceSynchronize();
My memory fetch:
cudaMemcpy(hostOutput, deviceOutput, outputSize*sizeof(float), cudaMemcpyDeviceToHost);
And finally my final calculation:
for (int counter = 1; counter < outputSize; counter++) {
hostOutput[0] += hostOutput[counter];
}
Any help would be appreciated.
Your kernel launch configuration in the following line of your code is incorrect.
The first argument of kernel configuration is the grid size and the second argument is the block size.
You should be doing this:
Please always perform error checking on CUDA Runtime function calls and check the returned error codes to get the reason for the failure of your program.
Your kernel launch should fail in your current code. An error checking on the
cudaDeviceSynchronizecall would have led you to the reason of incorrect results.