If multiple threads are simultaneously writing a single memory location.,there will be a race condition,right??
In my case same is happening..
Consider a module from ‘reduce.cl’
int i = get_global_id(0);
int n,j;
n = keyMobj[i]; // this n is the key..It can be either 0 or 1.
for(j=0; j<2; j++)
sumMobj[n*2+j] += dataMobj[i].dattr[j]; //summing operation.
Here, The memory locations
sumMobj===> […0…, ….1…] is accessed 4 threads simultaneously &
sumMobj===> [….3…, ….4…] is accessed 6 threads simultaneously..
Is there any way to still make it parallely,like using locking or semaphore? As this summing is a very big part in my algorithm…
I can give you some hint as I was also facing similar problem.
I can think of three different methods for achieving similar goal:
Consider a simple kernel, assuming you launched 4 (0-3) threads
You want to add values p[0], p[1], p[2], p[3], p[4], and store the final sum in p[4]. right? i.e:
Method -1 (no parallelism)
Assign this job to only 1 thread (no parallelism):
Method-2 (with parallelism)
Express your problem as follows:
This is a reduction problem
So launch 3 threads: i=0 to i=2. In first iteration
Now you have three numbers, you apply the same logic as above and add these numbers (with suitable padding of 0 to make it in power of two)
Method -3 Atomic operations
If you still need to implement this atomically, you can use atomic_add():
This is assuming the data is int type. Otherwise you can see the link as suggested above.