Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9113603
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T04:00:10+00:00 2026-06-17T04:00:10+00:00

Can someone please help me to understand this following partial parallel sum algorithm implementation

  • 0

Can someone please help me to understand this following partial parallel sum algorithm implementation of CUDA-C? I have problem to understand the initial fill up of the shared partialSum array [line 3 to 8]. I have traced it for hours now but I am not getting why should start in the following code should be 2*blockIdx.x*blockDim.x; in stead of blockIdx.x*blockDim.x;?

Host code:

numOutputElements = numInputElements / (BLOCK_SIZE<<1);
 if (numInputElements % (BLOCK_SIZE<<1)) {
     numOutputElements++;
 }
#define BLOCK_SIZE 512
dim3 dimGrid(numOutputElements, 1, 1);
dim3 dimBlock(BLOCK_SIZE, 1, 1);
total<<<dimGrid, dimBlock>>>(deviceInput, deviceOutput, numInputElements);

Kernel code:

1    __global__ void total(float * input, float * output, int len) {
2    
3    __shared__ float partialSum[2*BLOCK_SIZE];
4      
5      unsigned int t = threadIdx.x;
6      unsigned int start = 2*blockIdx.x*blockDim.x;
7      partialSum[t] = input[start + t];
8      partialSum[blockDim.x + t] = input[start + blockDim.x + t];
9     
10    for (unsigned int stride = blockDim.x; stride >=1; stride >>=1)
11      {
12       __syncthreads();
13        
14       if (t < stride)
15         partialSum[t] += partialSum[t + stride];
16      }
17      output[blockIdx.x] = partialSum[0];   
18  }

suppose I have 10 elements to sum and I chose to make blocksize of 4, and 4 thread per block, so there will be 3 block in use, right? [Lets forget about the warp size and things for a while]

When blockIdx.x is 2 (the last block which has 2 elements) then the start is becoming (2*2*4=)16 and it is greater than 10 and it exceeds input length (so both of partialSum[t] and partialSum[blockDim.x + t] will remain unchanged and block2‘s shared memory will remain empty. ) if so then last 2 element of my array will get lost!!

It makes me think that i am getting blockIdx.x, blockDim.x wrong way. Can some one please correct me? Please!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T04:00:11+00:00Added an answer on June 17, 2026 at 4:00 am

    You’d only launch half the blocks and do twice as much work per block. The benefit of doing this is the scratch space required to store partial sums is cut down by half (because you are only launching half the blocks).

    The usual way to do reductions (in this case sum), is to do something like this.

    1    __global__ void total(float * input, float * output, int len) {
    2    
    3    __shared__ float partialSum[BLOCK_SIZE];
    4      
    5     unsigned int t = threadIdx.x;
    6     unsigned int start = blockIdx.x*blockDim.x;
    7     partialSum[t] = 0;
    8     for (int T = start; T < len; T += blockDim.x * gridDim.x) 
    9        partialSum[t] += input[T];
    10    for (unsigned int stride = blockDim.x/2; stride >=1; stride >>=1)
    11      {
    12       __syncthreads();
    13        
    14       if (t < stride)
    15         partialSum[t] += partialSum[t + stride];
    16      }
    17      output[blockIdx.x] = partialSum[0];   
    18  }
    

    So if you have len = 1024, and BLOCK_SIZE = 256, you could launch anything <= 4 blocks.

    Let us see what happens in the for loop contained in lines 8 & 9 when you launch various number of blocks. Also keep in mind output needs to have the number of elements == number of blocks.

    • Blocks == 4 would mean, blockDim.x * gridDim.x = 256 x 4 = 1024, so it will only iterate once. The number of non coalesced writes to output = 4.
    • Blocks == 2 would mean, blockDim.x * gridDim.x = 256 x 2 = 512, so it will iterate twice. The number of non coalesced writes to output = 2.
    • Blocks == 1 would mean, blockDim.x * gridDim.x = 256 x 1 = 256, so it will iterate 4 times. The number of non coalesced writes to output = 1.

    So launching fewer blocks has the benefit of reducing memory foot print and also reducing the global writes. However it comes decreased parallelism.

    Ideally,you will need to heuristically find which combination works out best for your algorithm. Or you could use existing libraries that do it for you.

    The kernel in question is choosing to launch half the blocks to gain some performance improvement. But using twice as much shared memory may not be required.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Can someone please help me out with a JavaScript/jQuery solution for this arithmetic problem:
Please can someone help? I have the following code which uploads a file to
Can someone please help me understand the following: In the previous version of NHibernate
Can someone please help me understand exactly what this means? <stmt> := var <ident>
Can someone please help me understand this solution : Initialize 2D array with 81
Java docs says the following about Set interface, can someone please help me understand
Can someone please help me understand why this bool isn't saving? Here is my
can someone please help me. why does this return an error: Dim stuff As
Can someone please help me figure out a way to achieve the following (see
Could someone please help explain why I can't get this to work? I properly

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.