I was curious as to how the GPU executes the same kernel multiple times.
I have a kernel which is being queued hundreds (possibly thousands) of times in a row, and using the AMD App Profiler I noticed that it would execute clusters of kernels extremely fast, then like clockwork every so often a kernel would “hang” (i.e. take orders of magnitude longer to execute). I think it’s every 64th kernel that hangs.
This is odd because each time through the kernel performs the exact same operations with the same local and global sizes. I’m even re-using the same buffers.
Is there something about the execution model that I’m missing (perhaps other programs/the OS accessing the GPU or the timing frequency of the GPU memory). I’m testing this on an ATI HD5650 card under Windows 7 (64-bit), with AMD App SDK 2.5 with in-order queue execution.
As a side note, if the I don’t have any global memory accesses in my kernel (a rather impractical prospect), the profiler puts a gap in between the quick executing kernels and where the slow executing kernels were before is now a large empty gap where none of my kernels are being executed.
As a follow-up question, is there anything that can be done to fix this?
It’s probable you’re seeing the effects of your GPU’s maximum number of concurrent tasks. Each task enqueued is assigned to one or more multiprocessors, which are frequently capable of running hundreds of workitems at a time – of the same kernel, enqueued in the same call. Perhaps what you’re seeing is the OpenCL runtime waiting for one of the multiprocessors to free up. This relates most directly to the occupancy issue – if the work size can’t keep the multiprocessor busy, through memory latencies and all, it has idle cycles. The limit here depends on how many registers (local or private memory) your kernel requires. In summary, you want to write your kernel to operate on multiple pieces of data more so than queueing it many times.
Did your measurement include reading back results from the apparently fast executions?