In a 2-d or 3-d CUDA block, how are threads grouped into warps? My assumption is that they iterate first by x, then y, then z. For example, in threads with <z,y,x>, <0,0,[0-31]> is a warp, and so is <0,1,[0-31]>, etc. Is this correct?
In a 2-d or 3-d CUDA block, how are threads grouped into warps? My
Share
Yes that is correct. Threads are grouped first by X, then Y, then Z (thread coordinates) when creating warps (groups of 32 threads that execute together). This has implications for good coalescing: you will want to arrange your usage of thread coordinates in matrix subscripts so that warp-adjacent threads (i.e. in X coordinates, typically) will access adjacent elements in the matrix (by using threadIdx.x or a derivative in the most rapidly varying matrix dimension. We typically want
data[z][y][x], notdata[x][y][z]