We have multi-device system which distribute main task across them. Each subtask consist of:
- enqueue write buffer
- enqueue kernel
- enqueue read buffer
All enqueues are async and command queues are in-order. We assign a callback to cl_event of enqueue read buffer where we determine whether main task is completed. If it’s not, we schedule one more subtask to the queue.
Unfortunately, we’ve found that keeping host’s CPU busy doesn’t allow it to process callbacks from other devices (GPU) and most of the time they are not involved in work. The idea is to exclude host’s cpu from the list of devices we use to complete main task.
You should look into device fission. If your platform supports this feature, you will be able to create an opencl device with any combination of cpu cores. Look here for details. This extension will allow you to save some number of cores for your host application.
I like how it allows you create sub-devices which share various level of cache memory. You might be interested in CL_DEVICE_PARTITION_BY_NAMES_EXT (search for “CL_DEVICE_PARTITION_BY NAMES_EXT” on the page).