Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8168763
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 6, 20262026-06-06T20:38:07+00:00 2026-06-06T20:38:07+00:00

After Compute Capability 2.0 (Fermi) was released, I’ve wondered if there are any use

  • 0

After Compute Capability 2.0 (Fermi) was released, I’ve wondered if there are any use cases left for shared memory. That is, when is it better to use shared memory than just let L1 perform its magic in the background?

Is shared memory simply there to let algorithms designed for CC < 2.0 run efficiently without modifications?

To collaborate via shared memory, threads in a block write to shared memory and synchronize with __syncthreads(). Why not simply write to global memory (through L1), and synchronize with __threadfence_block()? The latter option should be easier to implement since it doesn’t have to relate to two different locations of values, and it should be faster because there is no explicit copying from global to shared memory. Since the data gets cached in L1, threads don’t have to wait for data to actually make it all the way out to global memory.

With shared memory, one is guaranteed that a value that was put there remains there throughout the duration of the block. This is as opposed to values in L1, which get evicted if they are not used often enough. Are there any cases where it’s better to cache such rarely used data in shared memory than to let the L1 manage them based on the usage pattern that the algorithm actually has?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-06T20:38:10+00:00Added an answer on June 6, 2026 at 8:38 pm

    As far as i know, L1 cache in a GPU behaves much like the cache in a CPU. So your comment that “This is as opposed to values in L1, which get evicted if they are not used often enough” doesn’t make much sense to me

    Data on L1 cache isn’t evicted when it isn’t used often enough. Usually it is evicted when a request is made for a memory region that wasn’t previously in cache, and whose address resolves to one that is already in use. I don’t know the exact caching algorithm employed by NVidia, but assuming a regular n-way associative, then each memory entry can only be cached in a small subset of the entire cache, based on it’s address

    I suppose this may also answer your question. With shared memory, you get full control as to what gets stored where, while with cache, everything is done automatically. Even though the compiler and the GPU can still be very clever in optimizing memory accesses, you can sometimes still find a better way, since you’re the one who knows what input will be given, and what threads will do what (to a certain extent of course)

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

After watching Sussman's lecture http://www.infoq.com/presentations/We-Really-Dont-Know-How-To-Compute , I am inspired to give core.logic and core.match
I have an installer that prompts users to restart their computer after an install.
I have a sum that I'm trying to compute, and I'm having difficulty parallelizing
In NumPy, I'm trying to use linalg to compute matrix inverses at each step
The script below illustrates a capability of set and frozenset that I would like
After reading this post regarding the use ECC to implement the hashing using aa
I'm finding that i'm needing to compute large numbers to high precision. For example:
I have a series of commands (calls to bash shell functions really) that compute
I would like my application after making it five calculations. That is, using 5
I know that if I were to compute a list of squares in Haskell,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.