I have a long sequence of kernels I need to run on some data

Question

0

Asked: June 17, 20262026-06-17T14:08:34+00:00 2026-06-17T14:08:34+00:00

I have a long sequence of kernels I need to run on some data

0

I have a long sequence of kernels I need to run on some data like

data -> kernel1 -> data1 -> kernel2 -> data2 -> kernel3 -> data3 etc.

I need all the intermediate results to copied back to the host as well, so the idea would be something like (pseudo code):

inputdata = clCreateBuffer(...hostBuffer[0]);

for (int i = 0; i < N; ++i)
{
    // create output buffer
    outputdata = clCreateBuffer(...);

    // run kernel
    kernel = clCreateKernel(...);
    kernel.setArg(0, inputdata);
    kernel.setArg(1, outputdata);
    enqueueNDRangeKernel(kernel);

    // read intermediate result
    enqueueReadBuffer(outputdata, hostBuffer[i]);

    // output of operation becomes input of next
    inputdata = outputdata;
}

There are several ways to schedule these operations:

Simplest is to always wait for the event of previous enqueue operation, so we wait for a read operation to complete before proceeding with the next kernel. I can release buffers as soon as they are not needed.
OR Make everything as asynchronous as possible, where kernel and read enqueues only wait for previous kernels, so buffer reads can happen while another kernel is running.

In the second (asynchronous) case I have a few questions:

Do I have to keep references to all cl_mem objects in the long chain of actions and release them after everything is complete?
Importantly, how does OpenCL handle the case when the sum of all memory objects exceeds that of the total memory available on the device? At any point a kernel only needs the input and output kernels (which should fit in memory), but what if 4 or 5 of these buffers exceed the total, how does OpenCL allocate/deallocate these memory objects behind the scenes? How does this affect the reads?

I would be grateful if someone could clarify what happens in these situations, and perhaps there is something relevant to this in the OpenCL spec.

Thank you.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T14:08:35+00:00

Your Second case is the way to go.

In the second (asynchronous) case I have a few questions:
Do I have to keep references to all cl_mem objects 
in the long chain of actions and release them after 
everything is complete?

Yes. But If all the data arrays are of the same size I would use just 2, and overwrite one after the other each iteration.
Then you will only need to have 2 memory zones, and the release and allocation should only occur at the beggining/end.

Don’t worry about the data having bad values, if you set proper events the processing will wait to the I/O to finish. ie:

data -> kernel1 -> data1 -> kernel2 -> data -> kernel3 -> data1
                -> I/O operation    -> I/O operation

For doing that just set a condition that forces the kernel3 to start only if the first I/O has finished. You can chain all the events that way.

NOTE: Use 2 queues, one for I/O and another for processing will bring you parallel I/O, which is 2 times faster.

Importantly, how does OpenCL handle the case when the sum
of all memory objects exceeds that of the total memory available on the
device?

Gives an error OUT_OF_RESOURCES or similar when allocating.

At any point a kernel only needs the input and output kernels
(which should fit in memory), but what if 4 or 5 of these buffers
exceed the total, how does OpenCL allocate/deallocate these memory
objects behind the scenes? How does this affect the reads?

It will not do this automatically, except you have set the memory as a host PTR. But I’m unsure if that way the OpenCL driver will handle it properly. I would not allocate more than the maximum if I were you.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a long sequence of kernels I need to run on some data

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply