I am teaching myself CUDA with the programming guide that CUDA offers. For practice, I made a simple kernel that determines the maximum of an array and returns it to the CPU:
__global__ void getTheMaximum(float* myArrayFromCPU, float* returnedMaximum) {
// Store my current value in shared memory.
extern __shared__ float sharedData[];
sharedData[threadIdx.x] = myArrayFromCPU[threadIdx.x];
// Iteratively calculate the maximum.
int halfScan = blockDim.x / 2;
while (halfScan > 0 && threadIdx.x < halfScan) {
if (sharedData[threadIdx.x] < sharedData[threadIdx.x + halfScan]) {
sharedData[threadIdx.x] = sharedData[threadIdx.x + halfScan];
}
halfScan = halfScan / 2;
}
// Put maximum value in global memory for later return to CPU.
returnedMaximum[0] = sharedData[0];
}
myArrayFromCPU is an array of float values of size 1024. returnedMaximum is a trivial array containing a single item: The computed maximum value.
My idea for this algorithm is that it would iteratively determine the maximum as it whittles down values from half the block size beyond the current value.
However, when I run this code, I get unreliable output. The returned maximum value varies. Why is that? How can a single algorithm produce different values every time?
Update:
I’m also just running on a single block. I guarantee this by setting a 1-dimensional block size of X=1024.
It is not guaranteed that all threads of the whole block execute at the very same moment in time. That guarantee you have only within a single warp (group of 32 threads).
To avoid concurrency hazards within a block – you can use
__syncthreads()intrinsic function which halts the threads reaching it until all reach the point.Note, you should not put
__syncthreads()in a branching code where you cannot guarantee that all threads will reach the spot uniformly.Try the following loop:
Do note, that I removed the condition
threadIdx.x < halfScanfrom the while loop, because I want all threads to execute__syncthreads()at the same spot and the same number of times.Also,
__syncthreads()before the loop may help to ensure that the load frommyArrayFromCPUis complete (for all threads) before the loop starts.