I am working through the Cuda Parallel reduction Whitepaper, but unfortunately my algorithm seems

Question

0

Asked: June 17, 20262026-06-17T07:37:51+00:00 2026-06-17T07:37:51+00:00

I am working through the Cuda Parallel reduction Whitepaper, but unfortunately my algorithm seems

0

I am working through the Cuda Parallel reduction Whitepaper, but unfortunately my algorithm seems to repeatedly produce incorrect results, and I can not seem to figure out why(surely a textbook example must work? Surely I’m just doing something very obvious wrong?). Here is my kernel function:

My define:

 #define BLOCK_SIZE 512

My Kernel function:

 __global__ void total(float * inputList, float * outputList, int len) {
      __shared__ float sdata[2*BLOCK_SIZE];
      unsigned int tid = threadIdx.x;
      unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
      sdata[t] = inputList[i]+inputList[i+blockDim.x];
      __syncthreads();
      for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
        if (tid < s) {
          sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
      }
      if (tid == 0) 
        outputList[blockIdx.x] = sdata[0];
}

My memory allocation:

  outputSize = inputSize / (BLOCK_SIZE<<1);
  cudaMalloc((void**) &deviceInput, inputSize*sizeof(float));
  cudaMalloc((void**) &deviceOutput, outputSize*sizeof(float));
  cudaMemcpy(deviceInput, hostInput, inputSize*sizeof(float), cudaMemcpyHostToDevice);

My device call:

 dim3 dimGrid((inputSize-1)/BLOCK_SIZE +1, 1, 1);
 dim3 dimBlock(BLOCK_SIZE,1,1);

 total<<<dimBlock, dimGrid>>>(deviceInput, deviceOutput, outputSize);
 cudaDeviceSynchronize();

My memory fetch:

 cudaMemcpy(hostOutput, deviceOutput, outputSize*sizeof(float), cudaMemcpyDeviceToHost);

And finally my final calculation:

 for (int counter = 1; counter < outputSize; counter++) {
    hostOutput[0] += hostOutput[counter];
 }

Any help would be appreciated.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T07:37:52+00:00

Your kernel launch configuration in the following line of your code is incorrect.

total<<<dimBlock, dimGrid>>>(deviceInput, deviceOutput, outputSize);

The first argument of kernel configuration is the grid size and the second argument is the block size.

You should be doing this:

total<<<dimGrid, dimBlock>>>(deviceInput, deviceOutput, outputSize);

Please always perform error checking on CUDA Runtime function calls and check the returned error codes to get the reason for the failure of your program.

Your kernel launch should fail in your current code. An error checking on the cudaDeviceSynchronize call would have led you to the reason of incorrect results.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am working through the Cuda Parallel reduction Whitepaper, but unfortunately my algorithm seems

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply