I am trying to write a parallel prefix scan on cuda by following this

Question

0

Editorial Team

Asked: May 28, 20262026-05-28T13:55:32+00:00 2026-05-28T13:55:32+00:00

I am trying to write a parallel prefix scan on cuda by following this

0

I am trying to write a parallel prefix scan on cuda by following this tutorial –

I am trying the work-inefficient ‘double buffered one’ as explained in the tutorial.

This is what I have:

// double buffered naive.

// d = number of iterations, N - size, and input.
__global__ void prefixsum(int* in, int d, int N)
{

        //get the block index
        int idx = blockIdx.x*blockDim.x + threadIdx.x;

        // allocate shared memory
        extern __shared__ int temp_in[], temp_out[];

        // copy data to it.
        temp_in[idx] = in[idx];
        temp_out[idx] = 0;

        // block until all threads copy

        __syncthreads();

        int i = 1;
        for (i; i<=d; i++)
        {
                if (idx < N+1 && idx >= (int)pow(2.0f,(float)i-1))
                {
                        // copy new result to temp_out
                        temp_out[idx] += temp_in[idx - (int)pow(2.0f,(float)i-1)] + temp_in[idx];
                }
                else
                {
                        // if the element is to remain unchanged, copy the same thing
                        temp_out[idx] = temp_in[idx];
                }
                // block until all theads do this
                __syncthreads();
                // copy the result to temp_in for next iteration
                temp_in[idx] = temp_out[idx];
                // wait for all threads to do so
                __syncthreads();
        }

        //finally copy everything back to global memory
        in[idx] = temp_in[idx];
}

Can you point out what’s wrong with this? I have written comments for what I think should happen.

This is the kernel invocation –

prefixsum<<<dimGrid,dimBlock>>>(d_arr, log(SIZE)/log(2), N);

This is the grid and block allocations:

dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);

The problem is that I don’t get the correct output for any input that’s more than 8 elements long.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T13:55:33+00:00

I see two problems in your code

Problem 1: extern shared memory

Agh…. I hate extern __shared__ memory. The problem is, that the compiler does not know how big are the arrays. As a result, they both point to the same piece of memory!
So, in your case: temp_in[5] and temp_out[5] refer to the same word in shared memory.

If you really want the extern __shared__ memory, you can manually offset the second array, for example something like this:

size_t size = .... //the size of your array
extern __shared__ int memory[];
int* temp_in=memory;
int* temp_out=memory+size;

Problem 2: Shared array index

Shared memory is private for each block. That is, temp[0] in one block can be different than temp[0] in another block. However, you index it by blockIdx.x*blockDim.x + threadIdx.x as if the temp arrays were shared between the blocks.

Instead, you should most likely index your temp arrays just by threadIdx.x.

Of course, the idx array is global and you index that one correctly.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to write a parallel prefix scan on cuda by following this

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply