I am writing a cuda kernel which requires me to allocate an array of aligned struct on the device.
I am getting the correct results from my computations and I need to write the values to this array starting from index 0.
When I try to write to this array and display the results back to host side, some of the answers are displayed as zero.
Clearly, I am not increasing the index as per my requirement. I tried using counter which I increase using atomicAdd(), however I still get some values as zero.
To be precise, I may use 1000 threads in my kernel for computations but my output allocated array can have a size less than 100 or more than 10000.
My question is, how do I make all these threads write the value to exactly one location of array ( as they are calculated ) and increment the array index/counter by 1 without overwriting it.
Any help will be appreciated.Thanks in advance.
You can use
atomicAdd(). It returns the old value, so you use that value as the index:However, you will get poor performance if many of your threads write out results, as the atomicAdd() will (indirectly) serialize all the writes. In that case, you should let each thread write its result,if any, to a slot set aside for that thread and then use a compaction algorithm (see
thrust::copy_if), to gather up the results.