I’m trying very hard to make a code run in parallel (CUDA). The code, simplified, is:
float sum = ... //sum = some number
for (i = 0; i < N; i++){
f = ... // f = a function that returns a float and puts it into f
sum += f;
}
The issue I’m having is sum+=f because it requires sum to be shared between the threads. I tried using the __shared__ argument when declaring sum (__shared__ float sum), but that didn’t work (it didn’t give me the proper result). I’ve also heard about reduction (and know how to use it on OpenMP), but have no idea how to apply it here.
Any help would be greatly appreciated. Thanks!
Here is the code:
The parallel reduction section can be replaced by a working CUDA parallel sum reduction (for example this one). Soon I will take time to change it.
EDIT:
Here is the code for performing parallel reduction with CUDA: