I’m writing a program using JOGL/openCL to utilize the GPU. I have code that kicks in when we work with data sizes which is suppose to detect the available memory on the GPU. If there is insufficient memory on the GPU to process the entire calculation at once it will break the process up into sub process with X number of frames which utilizes less then the max GPU global memory to store.
I had expected that using the maximum possible value of X would give me the largest speed up by minimizing the number of kernels used. Instead I found using a smaller group (X/2 or X/4) gives me better speeds. I’m trying to figure out why breaking the GPU processing into smaller groups rather then having the GPU process the maximum amount it can handle at one time gives me a speed increase; and how I can optimize to figure out what the best value of X is.
My current tests have been running on a GPU kernel which uses very little processing power (both kernels decimate output by selecting part of input and returning it) However, I am fairly certain the same effects occur when I activate all kernels which do a larger degree of processing on the value before returning.
The short answer is, it’s complicated. There are many factors at play. These include (but are not limited to):
I recommend you check out the following link:
http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
In particular, check out section 5.3. Dynamic Partitioning of SM Resources. This text is meant to be general purpose, but uses CUDA for its examples. However, the concepts still apply just the same to OpenCL.
This text comes from the following book:
http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1314279939&sr=8-2
For what its worth, I found this book to be very informative. It will give you a deeper understanding of the hardware that will allow you to answer questions like this.