Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8330119
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 9, 20262026-06-09T02:01:16+00:00 2026-06-09T02:01:16+00:00

I read some CUDA documentation that refers to local memory. (It is mostly the

  • 0

I read some CUDA documentation that refers to local memory. (It is mostly the early documentation.) The device-properties reports a local-mem size (per thread). What does ‘local’ memory mean? What is ‘local’ memory? Where is ‘local’ memory? How do I access ‘local’ mem? It is __device__ memory, no?

The device-properties also reports: global, shared, & constant mem size.
Are these statements correct:
Global memory is __device__ memory. It has grid scope, and a lifetime of the grid (kernel).
Constant memory is __device__ __constant__ memory. It has grid scope & a lifetime of the grid (kernel).
Shared memory is __device__ __shared__ memory. It has single block scope & a lifetime of that block (of threads).

I’m thinking shared mem is SM memory. i.e. Memory that only that single SM had direct access to. A resource that is rather limited. Isn’t an SM assigned a bunch of blocks at a time? Does this mean an SM can interleave the execution of different blocks (or not)? i.e. Run block*A* threads until they stall. Then run block*B* threads until they stall. Then swap back to block*A* threads again. OR Does the SM run a set of threads for block*A* until they stall. Then another set of block*A* threads are swapped in. This swap continues until block*A* is exhausted. Then and only then does work begin on block*B*.
I ask because of shared memory. If a single SM is swapping code in from 2 different blocks, then how does the SM quickly swap in/out the shared memory chunks?
(I’m thinking the later senerio is true, and there is no swapping in/out of shared memory space. Block*A* runs until completion, then block*B* starts execution. Note: block*A* could be a different kernel than block*B*.)

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-09T02:01:18+00:00Added an answer on June 9, 2026 at 2:01 am

    From the CUDA C Programming Guide section 5.3.2.2, we see that local memory is used in several circumstances:

    • When each thread has some arrays but their size is not known at compile time (so they might not fit in the registers)
    • When the size of the arrays are known at compile time, and this size is too big for register memory (this can also happen with big structs)
    • When the kernel has already used up all the register memory (so if we have filled the registers with n ints, the n+1th int will go into local memory) – this last case is register spilling, and it should be avoided, because:

    “Local” memory actually lives in the global memory space, which means reads and writes to it are comparatively slow compared to register and shared memory. You’ll access local memory every time you use some variable, array, etc in the kernel that doesn’t fit in the registers, isn’t shared memory, and wasn’t passed as global memory. You don’t have to do anything explicit to use it – in fact you should try to minimize its use, since registers and shared memory are much faster.

    Edit:
    Re: shared memory, you cannot have two blocks exchanging shared memory or looking at each others’ shared memory. Since the order of execution of blocks is not guaranteed, if you tried to do this you might tie up a SMP for hours waiting for another block to get executed. Similarly, two kernels running on the device at the same time can’t see each others’ memory UNLESS it is global memory, and even then you’re playing with fire (of race conditions). As far as I am aware, blocks/kernels can’t really send “messages” to each other. Your scenario doesn’t really make sense since order of execution for the blocks will be different every time and it’s bad practice to stall a block waiting for another.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I read some properties from an xml file, amongst which is a string that
I read some XSLT examples and found that code: <xsl:apply-template select=@*|node()/> What does that
I read some articles about Comet tech. All of them mentioned that the long-life
I read some of the discussion in this question and thought to myself that
I read some microsoft articles.They explained that WCF uses DataContractSerializer for serialization.But the articles
I read some paragraphs in a book saying that it is not possible to
I read some tutorial about sockets in Java, and I found that different examples
i read some advice about table indexing, and i fount out that indexes should
I read some docs about md5, it said that its 128 bits, but why
I read some other posts suggesting that they would add multi-threading support in 3.00.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.