Can someone please help me to understand this following partial parallel sum algorithm implementation

Question

0

Asked: June 17, 20262026-06-17T04:00:10+00:00 2026-06-17T04:00:10+00:00

Can someone please help me to understand this following partial parallel sum algorithm implementation

0

Can someone please help me to understand this following partial parallel sum algorithm implementation of CUDA-C? I have problem to understand the initial fill up of the shared partialSum array [line 3 to 8]. I have traced it for hours now but I am not getting why should start in the following code should be 2*blockIdx.x*blockDim.x; in stead of blockIdx.x*blockDim.x;?

Host code:

numOutputElements = numInputElements / (BLOCK_SIZE<<1);
 if (numInputElements % (BLOCK_SIZE<<1)) {
     numOutputElements++;
 }
#define BLOCK_SIZE 512
dim3 dimGrid(numOutputElements, 1, 1);
dim3 dimBlock(BLOCK_SIZE, 1, 1);
total<<<dimGrid, dimBlock>>>(deviceInput, deviceOutput, numInputElements);

Kernel code:

1    __global__ void total(float * input, float * output, int len) {
2    
3    __shared__ float partialSum[2*BLOCK_SIZE];
4      
5      unsigned int t = threadIdx.x;
6      unsigned int start = 2*blockIdx.x*blockDim.x;
7      partialSum[t] = input[start + t];
8      partialSum[blockDim.x + t] = input[start + blockDim.x + t];
9     
10    for (unsigned int stride = blockDim.x; stride >=1; stride >>=1)
11      {
12       __syncthreads();
13        
14       if (t < stride)
15         partialSum[t] += partialSum[t + stride];
16      }
17      output[blockIdx.x] = partialSum[0];   
18  }

suppose I have 10 elements to sum and I chose to make blocksize of 4, and 4 thread per block, so there will be 3 block in use, right? [Lets forget about the warp size and things for a while]

When blockIdx.x is 2 (the last block which has 2 elements) then the start is becoming (2*2*4=)16 and it is greater than 10 and it exceeds input length (so both of partialSum[t] and partialSum[blockDim.x + t] will remain unchanged and block2‘s shared memory will remain empty. ) if so then last 2 element of my array will get lost!!

It makes me think that i am getting blockIdx.x, blockDim.x wrong way. Can some one please correct me? Please!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T04:00:11+00:00

You’d only launch half the blocks and do twice as much work per block. The benefit of doing this is the scratch space required to store partial sums is cut down by half (because you are only launching half the blocks).

The usual way to do reductions (in this case sum), is to do something like this.

1    __global__ void total(float * input, float * output, int len) {
2    
3    __shared__ float partialSum[BLOCK_SIZE];
4      
5     unsigned int t = threadIdx.x;
6     unsigned int start = blockIdx.x*blockDim.x;
7     partialSum[t] = 0;
8     for (int T = start; T < len; T += blockDim.x * gridDim.x) 
9        partialSum[t] += input[T];
10    for (unsigned int stride = blockDim.x/2; stride >=1; stride >>=1)
11      {
12       __syncthreads();
13        
14       if (t < stride)
15         partialSum[t] += partialSum[t + stride];
16      }
17      output[blockIdx.x] = partialSum[0];   
18  }

So if you have len = 1024, and BLOCK_SIZE = 256, you could launch anything <= 4 blocks.

Let us see what happens in the for loop contained in lines 8 & 9 when you launch various number of blocks. Also keep in mind output needs to have the number of elements == number of blocks.

Blocks == 4 would mean, blockDim.x * gridDim.x = 256 x 4 = 1024, so it will only iterate once. The number of non coalesced writes to output = 4.
Blocks == 2 would mean, blockDim.x * gridDim.x = 256 x 2 = 512, so it will iterate twice. The number of non coalesced writes to output = 2.
Blocks == 1 would mean, blockDim.x * gridDim.x = 256 x 1 = 256, so it will iterate 4 times. The number of non coalesced writes to output = 1.

So launching fewer blocks has the benefit of reducing memory foot print and also reducing the global writes. However it comes decreased parallelism.

Ideally,you will need to heuristically find which combination works out best for your algorithm. Or you could use existing libraries that do it for you.

The kernel in question is choosing to launch half the blocks to gain some performance improvement. But using twice as much shared memory may not be required.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Can someone please help me to understand this following partial parallel sum algorithm implementation

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply