I wrote a kernel for OpenCL where I initialise all the elements of a 3D array to -> i*i*i + j*j*j. I’m now having problems in creating a grid of threads to do the initialisation of the elements (concurrently). I know that the code that I have now only uses 3 threads, how can I expand on that?
Please help. I’m new to OpenCL, so any suggestion or explanation might be handy. Thanks!
This is code:
_kernel void initialize (
int X;
int Y;
int Z;
_global float*A) {
// Get global position in X direction
int dirX = get_global_id(0);
// Get global position in Y direction
int dirY = get_global_id(1);
// Get global position in Z direction
int dirZ = get_global_id(2);
int A[2000][100][4];
int i,j,k;
for (i=0;i<2000;i++)
{
for (j=0;j<100;j++)
{
for (k=0;k<4;k++)
{
A[dirX*X+i][dirY*Y+j][dirZ*Z+k] = i*i*i + j*j*j;
}
}
}
}
You create the buffer to store your output ‘A’ in the calling (host) code. This is passed to your kernel as a pointer, which is correct in your function definition above. However you don’t need to declare it again inside your kernel function, so remove the line
int A[2000][100][4];.You can simplify the code greatly. Using the 3D global ID to indicate the 3D index into the array for each work-item, you could change the loop as follows (assuming that for a given i and j, all elements along Z should have the same value):
In your calling code you would then create the kernel with a global work-size of 2000x100x4.
Practically this is a lot of work items to schedule, so you would likely get better performance from a global (one-dimensional) work-size of 2000 and a loop inside the kernel, e.g.: