I have a long sequence of kernels I need to run on some data like
data -> kernel1 -> data1 -> kernel2 -> data2 -> kernel3 -> data3 etc.
I need all the intermediate results to copied back to the host as well, so the idea would be something like (pseudo code):
inputdata = clCreateBuffer(...hostBuffer[0]);
for (int i = 0; i < N; ++i)
{
// create output buffer
outputdata = clCreateBuffer(...);
// run kernel
kernel = clCreateKernel(...);
kernel.setArg(0, inputdata);
kernel.setArg(1, outputdata);
enqueueNDRangeKernel(kernel);
// read intermediate result
enqueueReadBuffer(outputdata, hostBuffer[i]);
// output of operation becomes input of next
inputdata = outputdata;
}
There are several ways to schedule these operations:
- Simplest is to always wait for the event of previous enqueue operation, so we wait for a read operation to complete before proceeding with the next kernel. I can release buffers as soon as they are not needed.
- OR Make everything as asynchronous as possible, where kernel and read enqueues only wait for previous kernels, so buffer reads can happen while another kernel is running.
In the second (asynchronous) case I have a few questions:
- Do I have to keep references to all cl_mem objects in the long chain of actions and release them after everything is complete?
- Importantly, how does OpenCL handle the case when the sum of all memory objects exceeds that of the total memory available on the device? At any point a kernel only needs the input and output kernels (which should fit in memory), but what if 4 or 5 of these buffers exceed the total, how does OpenCL allocate/deallocate these memory objects behind the scenes? How does this affect the reads?
I would be grateful if someone could clarify what happens in these situations, and perhaps there is something relevant to this in the OpenCL spec.
Thank you.
Your Second case is the way to go.
Yes. But If all the data arrays are of the same size I would use just 2, and overwrite one after the other each iteration.
Then you will only need to have 2 memory zones, and the release and allocation should only occur at the beggining/end.
Don’t worry about the data having bad values, if you set proper events the processing will wait to the I/O to finish. ie:
For doing that just set a condition that forces the kernel3 to start only if the first I/O has finished. You can chain all the events that way.
NOTE: Use 2 queues, one for I/O and another for processing will bring you parallel I/O, which is 2 times faster.
Gives an error OUT_OF_RESOURCES or similar when allocating.
It will not do this automatically, except you have set the memory as a host PTR. But I’m unsure if that way the OpenCL driver will handle it properly. I would not allocate more than the maximum if I were you.