I am brand new to CUDA programming. I have a CUDA subroutine that, hopefully, models the diffusion equation with a simple source:
attributes(global) subroutine diff_time_stepper(v,diffconst)
real*8 :: v(:,:)
real*8 :: diffconst
real*8 :: vintermed
integer :: i,j,m
integer :: nx, ny
nx=256
ny=256
i=(blockIdx%x-1)*blockDim%x+threadIdx%x
j=(blockIdx%y-1)*blockDim%y+threadIdx%y
if (i<nx .and. j<ny .and. i>1 .and. j>1) then
vintermed=v(i,j)+diffconst*(v(i-1,j)-2.*v(i,j)+v(i+1,j)+v(i,j-1)-2.*v(i,j)+v(i,j+1))
v(i,j)=vintermed
! add a source for the heck of it
if (i==64 .and. j==64) v(i,j)=v(i,j)+1
endif
end subroutine
My question: This routine seems to work, giving reasonable results (even though not running as fast as I had hoped). But do I have “backward dependencies” here? In particular, vinterm gets set by a function involving several v’s, and then v gets set equal to vtinerm. And then v(64,64) gets set after this calculation. Are these potential problems? More generally, even though I have found alot of discussion about how to program for CUDA, I have found very little discussion on this issue of backward dependence, which seems to me of paramount importance. Can anyone point me to a good discussion of this? Thanks.
Yes, you do. The various threads are reading in and updating v in, in general, an unpredictable order (although there are things you can say about the behaviour of threads within a warp on the same block, etc.) This would be true (with different caveats) with say OpenMP or whatever your favourite CPU-based threading framework was.
The diffusion problem is more robust to this than most – eg, Gauss Sidel or Jacobi iteration explicitly make use of updating values before they’re read – but you’ll still get get problems with this as different threads have values from different timesteps; in particular threads reading in or not reading in v(64,64) before it is updated will in general lead to inconsistent shapes of the peak around the source.
You ran make sure this doesn’t happen within a block by synchronizing threads after reading:
But this just pushes the problems to the block boundaries rather than the thread boundaries; you still don’t know the order in which the blocks are reading/updating
v. But the only way to synchronize blocks is with the end of the kernel.There’s a few ways around this, all involving using more memory or less parallelism. One way to do it would be to have everyone read from vold, say, and update vnew; then everyone is just reading from the old array and updating the new, and then there’s no issue of synchronization. Then you’d just switch the meanings of old and new each timestep, so every odd timestep reads in vold and outputs vnew, and every even timestep goes the other way.