When I read the nbody code in Cuda-SDK, I went through some lines in the code and I found that it is a little bit different than their paper in GPUGems3 “Fast N-Body Simulation with CUDA”.
My questions are: First, why the blockIdx.x is still involved in loading memory from global to share memory as written in the following code?
for (int tile = blockIdx.y; tile < numTiles + blockIdx.y; tile++)
{
sharedPos[threadIdx.x+blockDim.x*threadIdx.y] =
multithreadBodies ?
positions[WRAP(blockIdx.x + q * tile + threadIdx.y, gridDim.x) * p + threadIdx.x] : //this line
positions[WRAP(blockIdx.x + tile, gridDim.x) * p + threadIdx.x]; //this line
__syncthreads();
// This is the "tile_calculation" function from the GPUG3 article.
acc = gravitation(bodyPos, acc);
__syncthreads();
}
isn’t it supposed to be like this according to paper? I wonder why
sharedPos[threadIdx.x+blockDim.x*threadIdx.y] =
multithreadBodies ?
positions[WRAP(q * tile + threadIdx.y, gridDim.x) * p + threadIdx.x] :
positions[WRAP(tile, gridDim.x) * p + threadIdx.x];
Second, in the multiple threads per body why the threadIdx.x is still involved? Isn’t it supposed to be a fix value or not involving at all because the sum only due to threadIdx.y
if (multithreadBodies)
{
SX_SUM(threadIdx.x, threadIdx.y).x = acc.x; //this line
SX_SUM(threadIdx.x, threadIdx.y).y = acc.y; //this line
SX_SUM(threadIdx.x, threadIdx.y).z = acc.z; //this line
__syncthreads();
// Save the result in global memory for the integration step
if (threadIdx.y == 0)
{
for (int i = 1; i < blockDim.y; i++)
{
acc.x += SX_SUM(threadIdx.x,i).x; //this line
acc.y += SX_SUM(threadIdx.x,i).y; //this line
acc.z += SX_SUM(threadIdx.x,i).z; //this line
}
}
}
Can anyone explain this to me? Is it some kind of optimization for faster code?
I am an author of this code and the paper. Numbered answers correspond to your numbered questions.
The blockIdx.x offset to the
WRAPmacro is not mentioned in the paper because this is a micro-optimization. I’m not even sure it is worthwhile any more. The purpose was to ensure that different SMs were accessing different DRAM memory banks rather than all pounding on the same bank at the same time, to ensure we maximize the memory throughput during these loads. Without theblockIdx.xoffset, all simultaneously running thread blocks will access the same address at the same time. Since the overall algorithm is compute rather than bandwidth bound, this is definitely a minor optimization. Sadly, it makes the code more confusing.The sum is across
threadIdx.y, as you say, but each thread needs to do a separate sum (each thread computes gravitation for a separate body). Therefore we need to usethreadIdx.xto index the right column of the (conceptually 2D) shared memory array.To Answer SystmD’s question in his (not really correct) answer,
gridDim.yis only 1 in the (default/common) 1D block case.