I’m trying to understand coalescing global memory.
Say I’d like to load an odd set of floats to global memory. Each thread will process a set of 3 floats. Say these floats are A, B, and C.
A0, B0, C0
A1, B1, C1
A2, B2, C2
..
A19, B19, C19
So the threads would grab the data like such:
Thread 0: A0, B0, C0
Thread 1: A1, B1, C1
Thread 2: A2, B2, C2
..
Thread 19: A19, B19, C19
First approach:
I could load 3 arrays of: float A[20]; float B[20]; floatC[20]; I’d have to cudaMemcpy() three different times to load the data into global memory. This approach would probably not coalesce very well.
Second approach:
A better approach would be something like:
struct {float A, float B, float C} dataPt;
dataPt data[20];
I could load the data with one cudaMemcpy(), but I’m not sure the memory access would coalesce very well.
Third approach:
struct {float A, float B, float C, float padding} dataPt2;
dataPt2 data2[20];
or
struct __align__(16){float A, float B, float C} dataPt3;
dataPt3 data3[20];
I could load the data to global memory with a single cudaMemcpy(), and the thread access to data would be coalesced. (At the cost of wasted global memory.)
1) The 1st approach would not coalesce because each thread will probably need 3 bus cycles to load the input data.
2) The 2nd approach will coalesce for many of the threads, but there will be a few threads that will need two bus cycles to get the input data.
3) The 3rd approach will coalesce for all threads.
Is this accurate? Is there a significant difference between the 2nd & 3rd approach? Is there an approach the uses the 3 thread dimensions (threadIdx.x, threadIdx.y, threadIdx.z)?
Just amplifying on @talonmies answer.
Let’s assume our kernel looks like this:
ignoring optimizations (which would result in an empty kernel), and assuming we launch 1 block of 32 threads:
Then we have 32 threads (1 warp) executing in lock-step. That means each thread will process the following kernel code line:
at exactly the same time. The definition of a coalesced load (from global memory) is when a warp loads a sequence of data items that are all within a single 128-byte aligned boundary in global memory (for CC 2.0 devices). A perfectly coalesced load with 100% bandwidth utilization implies that each thread is using one unique 32 bit quantity within that 128 byte aligned region. If thread zero loads a[0], thread 1 loads a[1], etc, that may be a typical example of a coalesced load.
So in your first case, since the a[] array is all contiguous and aligned, and a[0..31] fit within a 128 byte aligned region in global memory, we get a coalesced load. thread 0 reads a[0], thread 1 reads a[1] etc.
In the second case, a[0] is not contiguous with a[1], and furthermore the elements a[0..31] (which are all loaded at the same code line) do not fit within a 128 byte aligned sequence in global memory. I’m going to let you parse what happens in your third case, but suffice it to say that like the second case, the elements a[0..31] are niether contiguous nor contained within a single 128 byte aligned region in global memory. While it’s not necessary to have data items that are contiguous to achieve some level of coalescing, a 100% bandwidth utilization (“perfectly”) coalesced load from a 32 thread warp implies that each thread is using a unique 32bit item, all of which are contiguous and contained within a single 128-byte aligned sequence in global memory.
A handy mental model is to contrast an Arrary of Structures (AoS) (which corresponds to your cases 2 and 3) and a Structure of Arrays (SoA), which is essentially your first case. SoA’s usually present better possibilities for coalescing than AoS’s. From the nvidia webinar page you may find this presentation interesting, especially slides 11-22 or so.