Can someone please help me with a very simple example on how to use shared memory? The example included in the Cuda C programming guide seems cluttered by irrelevant details.
For example, if I copy a large array to the device global memory and want to square each element, how can shared memory be used to speed this up? Or is it not useful in this case?
In the specific case you mention, shared memory is not useful, for the following reason: each data element is used only once. For shared memory to be useful, you must use data transferred to shared memory several times, using good access patterns, to have it help. The reason for this is simple: just reading from global memory requires 1 global memory read and zero shared memory reads; reading it into shared memory first would require 1 global memory read and 1 shared memory read, which takes longer.
Here’s a simple example, where each thread in the block computes the corresponding value, squared, plus the average of both its left and right neighbors, squared:
Note that this is envisioned to work using only one block. The extension to more blocks should be straightforward. Assumes block dimension (1024, 1, 1) and grid dimension (1, 1, 1).