Can someone please help me to understand this following partial parallel sum algorithm implementation of CUDA-C? I have problem to understand the initial fill up of the shared partialSum array [line 3 to 8]. I have traced it for hours now but I am not getting why should start in the following code should be 2*blockIdx.x*blockDim.x; in stead of blockIdx.x*blockDim.x;?
Host code:
numOutputElements = numInputElements / (BLOCK_SIZE<<1);
if (numInputElements % (BLOCK_SIZE<<1)) {
numOutputElements++;
}
#define BLOCK_SIZE 512
dim3 dimGrid(numOutputElements, 1, 1);
dim3 dimBlock(BLOCK_SIZE, 1, 1);
total<<<dimGrid, dimBlock>>>(deviceInput, deviceOutput, numInputElements);
Kernel code:
1 __global__ void total(float * input, float * output, int len) {
2
3 __shared__ float partialSum[2*BLOCK_SIZE];
4
5 unsigned int t = threadIdx.x;
6 unsigned int start = 2*blockIdx.x*blockDim.x;
7 partialSum[t] = input[start + t];
8 partialSum[blockDim.x + t] = input[start + blockDim.x + t];
9
10 for (unsigned int stride = blockDim.x; stride >=1; stride >>=1)
11 {
12 __syncthreads();
13
14 if (t < stride)
15 partialSum[t] += partialSum[t + stride];
16 }
17 output[blockIdx.x] = partialSum[0];
18 }
suppose I have 10 elements to sum and I chose to make blocksize of 4, and 4 thread per block, so there will be 3 block in use, right? [Lets forget about the warp size and things for a while]
When blockIdx.x is 2 (the last block which has 2 elements) then the start is becoming (2*2*4=)16 and it is greater than 10 and it exceeds input length (so both of partialSum[t] and partialSum[blockDim.x + t] will remain unchanged and block2‘s shared memory will remain empty. ) if so then last 2 element of my array will get lost!!
It makes me think that i am getting blockIdx.x, blockDim.x wrong way. Can some one please correct me? Please!
You’d only launch half the blocks and do twice as much work per block. The benefit of doing this is the scratch space required to store partial sums is cut down by half (because you are only launching half the blocks).
The usual way to do reductions (in this case sum), is to do something like this.
So if you have
len = 1024, andBLOCK_SIZE = 256, you could launch anything <= 4 blocks.Let us see what happens in the for loop contained in lines 8 & 9 when you launch various number of blocks. Also keep in mind output needs to have the number of elements == number of blocks.
Blocks == 4would mean,blockDim.x * gridDim.x= 256 x 4 = 1024, so it will only iterate once. The number of non coalesced writes to output = 4.Blocks == 2would mean,blockDim.x * gridDim.x= 256 x 2 = 512, so it will iterate twice. The number of non coalesced writes to output = 2.Blocks == 1would mean,blockDim.x * gridDim.x= 256 x 1 = 256, so it will iterate 4 times. The number of non coalesced writes to output = 1.So launching fewer blocks has the benefit of reducing memory foot print and also reducing the global writes. However it comes decreased parallelism.
Ideally,you will need to heuristically find which combination works out best for your algorithm. Or you could use existing libraries that do it for you.
The kernel in question is choosing to launch half the blocks to gain some performance improvement. But using twice as much shared memory may not be required.