I want to implement Gaussian elimination in CUDA. But I have problem with thread synchronization inside if/else.
Here is my simple code:
__device__ bool zr(float val) {
const float zeroEpsilon = 1e-12f;
return fabs(val) < zeroEpsilon;
}
__global__ void gauss(float* data, unsigned int size, bool* success) {
//unsigned int len = size * (size + 1);
extern __shared__ float matrix[];
__shared__ bool succ;
__shared__ float div;
unsigned int ridx = threadIdx.y;
unsigned int cidx = threadIdx.x;
unsigned int idx = (size + 1) * ridx + cidx;
matrix[idx] = data[idx];
if (idx == 0)
succ = true;
__syncthreads();
for (unsigned int row = 0; row < size; ++row) {
if (ridx == row) {
if (cidx == row) {
div = matrix[idx];
if (zr(div)) {
succ = false;
div = 1.0;
}
}
__syncthreads();
matrix[idx] = matrix[idx] / div;
__syncthreads();
}
else {
__syncthreads();
__syncthreads();
}
if (!succ)
break;
}
__syncthreads();
if (idx == 0)
*success = succ;
data[idx] = matrix[idx];
__syncthreads();
}
It works this way:
- Copy matrix into shared memory.
- Iterate over rows.
- Divide row by value on it’s diagonal.
The problem is inside if/else block inside for loop – deadlock:
==Ocelot== PTX Emulator failed to run kernel "_Z5gaussPfjPb" with exception:
==Ocelot== [PC 30] [thread 0] [cta 0] bar.sync 0 - barrier deadlock:
==Ocelot== context at: [PC: 59] gauss.cu:57:1 11111111111111111111
==Ocelot== context at: [PC: 50] gauss.cu:54:1 11111111111111111111
==Ocelot== context at: [PC: 33] gauss.cu:40:1 00000000000000011111
==Ocelot== context at: [PC: 30] gauss.cu:51:1 11111111111111100000
I hava no idea why is that. When I remove synchronization from if/else block it’s works. Can someone explain me that?
__syncthreads()is doing this.When a
threadreaches__syncthreadsas an instruction it will block/stall, when that happens thewarp(32 threads) will also block, it will block until allthreadsin the sameblock of threadshave reached that statement.However, if one warp or one thread in the same
block of threadsdoes not reach the same__syncthreadsstatement it will deadlock as at least onethreadis waiting for all the otherthreadsto reach the same statement, if if does not happen you will get a deadlock.What you are now doing is excluding at least one
threadsfrom participating in the__syncthreadsevent, by placing a__syncthreadsinside a if statement not all threads will reach. Thus, deadlock.