I’m calculating the Euclidean distance between n-dimensional points using OpenCL. I get two lists of n-dimensional points and I should return an array that contains just the distances from every point in the first table to every point in the second table.
My approach is to do the regular doble loop (for every point in Table1{ for every point in Table2{…} } and then do the calculation for every pair of points in paralell.
The euclidean distance is then split in 3 parts:
1. take the difference between each dimension in the points
2. square that difference (still for every dimension)
3. sum all the values obtained in 2.
4. Take the square root of the value obtained in 3. (this step has been omitted in this example.)
Everything works like a charm until I try to accumulate the sum of all differences (namely, executing step 3. of the procedure described above, line 49 of the code below).
As test data I’m using DescriptorLists with 2 points each:
DescriptorList1: 001,002,003,…,127,128; (p1)
129,130,131,…,255,256; (p2)
DescriptorList2: 000,001,002,…,126,127; (p1)
128,129,130,…,254,255; (p2)
So the resulting vector should have the values: 128, 2064512, 2130048, 128
Right now I’m getting random numbers that vary with every run.
I appreciate any help or leads on what I’m doing wrong. Hopefully everything is clear about the scenario I’m working in.
#define BLOCK_SIZE 128
typedef struct
{
//How large each point is
int length;
//How many points in every list
int num_elements;
//Pointer to the elements of the descriptor (stored as a raw array)
__global float *elements;
} DescriptorList;
__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float As[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
//temporary array to store the difference between each dimension of 2 points
float dif_acum[BLOCK_SIZE];
//counter to track the iterations of the inner loop
int loop = 0;
//loop over all descriptors in A
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
//take the i-th descriptor. Returns a DescriptorList with just the i-th
//descriptor in DescriptorList A
DescriptorList tmpA = GetDescriptor(A, i);
//copy the current descriptor to local memory.
//returns one element of the only descriptor in DescriptorList tmpA
//and index featA
As[featA] = GetElement(tmpA, 0, featA);
//wait for all the threads to finish copying before continuing
barrier(CLK_LOCAL_MEM_FENCE);
//loop over all the descriptors in B
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
//take the difference of both current points
dif_acum[featA] = As[featA]-B.elements[k*BLOCK_SIZE + featA];
//wait again
barrier(CLK_LOCAL_MEM_FENCE);
//square value of the difference in dif_acum and store in C
//which is where the results should be stored at the end.
C[loop] = 0;
C[loop] += dif_acum[featA]*dif_acum[featA];
loop += 1;
barrier(CLK_LOCAL_MEM_FENCE);
}
}
}
Your problem lies in these lines of code:
All threads in your workgroup (well, actually all your threads, but lets come to to that later) are trying to modify this array position concurrently without any synchronization whatsoever. Several factors make this really problematic:
Now lets fix this:
While we might be able to get this to work on global memory using atomics, it won’t be fast, so lets accumulate in local memory:
Of course you can reuse other local buffers for this, but I think the point is clear (btw: Are you sure that dif_acum will be created in local memory, because I think I read somewhere that this wouldn’t be put in local memory, which would make preloading A into local memory kind of pointless).
Some other points about this code:
get_local_size(0)for the workgroupsize than to use a Define (since you might change it in the host code without realizing you should have changed your opencl code to)Considering the last bullet you could simply do:
This would make the code (not considering the first two bullets):