Why this kernel produces incoherent stores
__global__ void reverseArrayBlock(int *d_out, int *d_in)
{
int inOffset = blockDim.x * blockIdx.x;
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int in = inOffset + threadIdx.x;
int out = outOffset + (blockDim.x - 1 - threadIdx.x);
d_out[out] = d_in[in];
}
and this one doesn’t
__global__ void reverseArrayBlock(int *d_out, int *d_in)
{
extern __shared__ int s_data[];
int inOffset = blockDim.x * blockIdx.x;
int in = inOffset + threadIdx.x;
// Load one element per thread from device memory and store it
// *in reversed order* into temporary shared memory
s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];
// Block until all threads in the block have written their data to shared mem
__syncthreads();
// write the data from shared memory in forward order,
// but to the reversed block offset as before
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int out = outOffset + threadIdx.x;
d_out[out] = s_data[threadIdx.x];
}
I’m aware that second one is using shared memory. But when I look at indicies of d_out they seem to be the same in both kernel. Would you help me to understand this?
Coalescing requires that the addresses follow a “base + tid” pattern within a warp, where tid is short for the thread index. In other words, as tid increases, so does the address. Your comment calls this “forward order”. In the first kernel, addresses are generated such that as tid increases, the address decreases, i.e. the accesses are in “backward order”.