In my program, I’m tracking a large number of particles through a voxel grid. The ratio of particles to voxels is arbitrary. At a certain point, I need to know which particles lie in which voxels, and how many do. Specifically, the voxels must know exactly which particles are contained within them. Since I can’t use anything like std::vector in CUDA, I’m using the following algorithm (at the high level):
- Allocate an array of ints the size of the number of voxels
- Launch threads for the all the particles, determine the voxel each one lies in, and increase the appropriate counter in my ‘bucket’ array
- Allocate an array of pointers the size of the number of particles
- Calculate each voxel’s offset into this new array (summing the number of particles in the voxels preceding it)
- Place the particles in the array in an ordered fashion (I use this data to accelerate an operation later on. The speed increase is well worth the increased memory usage).
This breaks down on the second step though. I haven’t been programming in CUDA for long, and just found out that simultaneous writes among threads to the same location in global memory produce undefined results. This is reflected in the fact that I mostly get 1’s in buckets, with the occasional 2. Here’s an sketch of the code I’m using for this step:
__global__ void GPU_AssignParticles(Particle* particles, Voxel* voxels, int* buckets) {
int tid = threadIdx.x + blockIdx.x*blockDim.x;
if(tid < num_particles) { // <-- you can assume I actually passed this to the function :)
// Some math to determine the index of the voxel which this particle
// resides in.
buckets[index] += 1;
}
}
My question is, what’s the proper way to generate these counts in CUDA?
Also, is there a way to store references to the particles within the voxels? The issue I see there is that the number of particles within a voxel constantly changes, so new arrays would have to be deallocated and reallocated almost every frame.
Although there may be more efficient solutions for calculating the bucket counts, a first working solution is to use your current approach, but using an atomic increment. This way only one thread at a time increments the bucket count atomically (synchronized over the whole grid):