Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9162151
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T14:08:34+00:00 2026-06-17T14:08:34+00:00

I have a long sequence of kernels I need to run on some data

  • 0

I have a long sequence of kernels I need to run on some data like

data -> kernel1 -> data1 -> kernel2 -> data2 -> kernel3 -> data3 etc.

I need all the intermediate results to copied back to the host as well, so the idea would be something like (pseudo code):

inputdata = clCreateBuffer(...hostBuffer[0]);

for (int i = 0; i < N; ++i)
{
    // create output buffer
    outputdata = clCreateBuffer(...);

    // run kernel
    kernel = clCreateKernel(...);
    kernel.setArg(0, inputdata);
    kernel.setArg(1, outputdata);
    enqueueNDRangeKernel(kernel);

    // read intermediate result
    enqueueReadBuffer(outputdata, hostBuffer[i]);

    // output of operation becomes input of next
    inputdata = outputdata;
}

There are several ways to schedule these operations:

  • Simplest is to always wait for the event of previous enqueue operation, so we wait for a read operation to complete before proceeding with the next kernel. I can release buffers as soon as they are not needed.
  • OR Make everything as asynchronous as possible, where kernel and read enqueues only wait for previous kernels, so buffer reads can happen while another kernel is running.

In the second (asynchronous) case I have a few questions:

  • Do I have to keep references to all cl_mem objects in the long chain of actions and release them after everything is complete?
  • Importantly, how does OpenCL handle the case when the sum of all memory objects exceeds that of the total memory available on the device? At any point a kernel only needs the input and output kernels (which should fit in memory), but what if 4 or 5 of these buffers exceed the total, how does OpenCL allocate/deallocate these memory objects behind the scenes? How does this affect the reads?

I would be grateful if someone could clarify what happens in these situations, and perhaps there is something relevant to this in the OpenCL spec.

Thank you.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T14:08:35+00:00Added an answer on June 17, 2026 at 2:08 pm

    Your Second case is the way to go.

    In the second (asynchronous) case I have a few questions:

    Do I have to keep references to all cl_mem objects 
    in the long chain of actions and release them after 
    everything is complete?
    

    Yes. But If all the data arrays are of the same size I would use just 2, and overwrite one after the other each iteration.
    Then you will only need to have 2 memory zones, and the release and allocation should only occur at the beggining/end.

    Don’t worry about the data having bad values, if you set proper events the processing will wait to the I/O to finish. ie:

    data -> kernel1 -> data1 -> kernel2 -> data -> kernel3 -> data1
                    -> I/O operation    -> I/O operation
    

    For doing that just set a condition that forces the kernel3 to start only if the first I/O has finished. You can chain all the events that way.

    NOTE: Use 2 queues, one for I/O and another for processing will bring you parallel I/O, which is 2 times faster.

    Importantly, how does OpenCL handle the case when the sum
    of all memory objects exceeds that of the total memory available on the
    device?

    Gives an error OUT_OF_RESOURCES or similar when allocating.

    At any point a kernel only needs the input and output kernels
    (which should fit in memory), but what if 4 or 5 of these buffers
    exceed the total, how does OpenCL allocate/deallocate these memory
    objects behind the scenes? How does this affect the reads?

    It will not do this automatically, except you have set the memory as a host PTR. But I’m unsure if that way the OpenCL driver will handle it properly. I would not allocate more than the maximum if I were you.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a long sequence, and I would like to know how often some
I have a very long sequence of strings which individually need to be processed
I have a DNA sequence like: cgtcgctgtttgtcaaagtcg.... that is possibly 1000+ letters long. However,
I have a very long string containing DNA sequence data, which is usually about
I have very long integer sequences that look like this (arbitrary length!): 0000000001110002220033333 Now
Imagine you have a very long sequence. What is the most efficient way of
I have a long string (a DNA sequence). It does not contain any whitespace
I have a long, lazy sequence that I want to reduce and test lazily.
My acceptance tests (RSpec/Capybara) have some long sequences of steps all under a single
I have a long string (DNA sequence with \n at regular intervals) that I'm

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.