After Compute Capability 2.0 (Fermi) was released, I’ve wondered if there are any use cases left for shared memory. That is, when is it better to use shared memory than just let L1 perform its magic in the background?
Is shared memory simply there to let algorithms designed for CC < 2.0 run efficiently without modifications?
To collaborate via shared memory, threads in a block write to shared memory and synchronize with __syncthreads(). Why not simply write to global memory (through L1), and synchronize with __threadfence_block()? The latter option should be easier to implement since it doesn’t have to relate to two different locations of values, and it should be faster because there is no explicit copying from global to shared memory. Since the data gets cached in L1, threads don’t have to wait for data to actually make it all the way out to global memory.
With shared memory, one is guaranteed that a value that was put there remains there throughout the duration of the block. This is as opposed to values in L1, which get evicted if they are not used often enough. Are there any cases where it’s better to cache such rarely used data in shared memory than to let the L1 manage them based on the usage pattern that the algorithm actually has?
As far as i know, L1 cache in a GPU behaves much like the cache in a CPU. So your comment that “This is as opposed to values in L1, which get evicted if they are not used often enough” doesn’t make much sense to me
Data on L1 cache isn’t evicted when it isn’t used often enough. Usually it is evicted when a request is made for a memory region that wasn’t previously in cache, and whose address resolves to one that is already in use. I don’t know the exact caching algorithm employed by NVidia, but assuming a regular n-way associative, then each memory entry can only be cached in a small subset of the entire cache, based on it’s address
I suppose this may also answer your question. With shared memory, you get full control as to what gets stored where, while with cache, everything is done automatically. Even though the compiler and the GPU can still be very clever in optimizing memory accesses, you can sometimes still find a better way, since you’re the one who knows what input will be given, and what threads will do what (to a certain extent of course)