Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6786181
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T17:12:54+00:00 2026-05-26T17:12:54+00:00

I have a buffer in global memory that I want to copy in shared

  • 0

I have a buffer in global memory that I want to copy in shared memory for each block as to speed up my read-only access. Each thread in each block will use the whole buffer at different positions concurrently.

How does one do that?

I know the size of the buffer only at run time:

__global__ void foo( int *globalMemArray, int N )
{
    extern __shared__ int s_array[];

    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if( idx < N )
    {

       ...?
    }
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T17:12:55+00:00Added an answer on May 26, 2026 at 5:12 pm

    The first point to make is that shared memory is limited to a maximum of either 16kb or 48kb per streaming multiprocessor (SM), depending on which GPU you are using and how it is configured, so unless your global memory buffer is very small, you will not be able to load all of it into shared memory at the same time.

    The second point to make is that the contents of shared memory only has the scope and lifetime of the block it is associated with. Your sample kernel only has a single global memory argument, which makes me think that you are either under the misapprehension that the contents of a shared memory allocation can be preserved beyond the life span of the block that filled it, or that you intend to write the results of the block calculations back into same global memory array from which the input data was read. The first possibility is wrong and the second will result in memory races and inconsistant results. It is probably better to think of shared memory as a small, block scope L1 cache which is fully programmer managed than some sort of faster version of global memory.

    With those points out of the way, a kernel which loaded sucessive segments of a large input array, processed them and then wrote some per thread final result back input global memory might look something like this:

    template <int blocksize>
    __global__ void foo( int *globalMemArray, int *globalMemOutput, int N ) 
    { 
        __shared__ int s_array[blocksize]; 
        int npasses = (N / blocksize) + (((N % blocksize) > 0) ? 1 : 0);
    
        for(int pos = threadIdx.x; pos < (blocksize*npasses); pos += blocksize) { 
            if( pos < N ) { 
                s_array[threadIdx.x] = globalMemArray[pos];
            }
            __syncthreads(); 
    
            // Calculations using partial buffer contents
            .......
    
            __syncthreads(); 
        }
    
        // write final per thread result to output
        globalMemOutput[threadIdx.x + blockIdx.x*blockDim.x] = .....;
    } 
    

    In this case I have specified the shared memory array size as a template parameter, because it isn’t really necessary to dynamically allocate the shared memory array size at runtime, and the compiler has a better chance at performing optimizations when the shared memory array size is known at compile time (perhaps in the worst case there could be selection between different kernel instances done at run time).

    The CUDA SDK contains a number of good example codes which demonstrate different ways that shared memory can be used in kernels to improve memory read and write performance. The matrix transpose, reduction and 3D finite difference method examples are all good models of shared memory usage. Each also has a good paper which discusses the optimization strategies behind the shared memory use in the codes. You would be well served by studying them until you understand how and why they work.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a kernel that, for each thread in a given block, computes a
I have a memory buffer corresponding to my screen resolution (1280x800 at 24-bits-per-pixel) that
I have a buffer in class 'bufferClass' that will generate a signal to tell
i have a char buffer[100] and i'm trying to use gdb to read the
I want my emacs buffer to have a different name than the file name.
I have a char array buffer that I am using to store characters that
I have a single buffer, and several pointers into it. I want to sort
I'm allocating my pthread thread-specific data from a fixed-size global pool that's controlled by
I have function in C that reads byte by byte from a given buffer
Hi I have turned on buffer cycling placing following commands in my .emacs (global-set-key

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.