I’m writing a program using JOGL/openCL to utilize the GPU. I have code that

Question

0

Asked: May 25, 20262026-05-25T01:32:49+00:00 2026-05-25T01:32:49+00:00

I’m writing a program using JOGL/openCL to utilize the GPU. I have code that

0

I’m writing a program using JOGL/openCL to utilize the GPU. I have code that kicks in when we work with data sizes which is suppose to detect the available memory on the GPU. If there is insufficient memory on the GPU to process the entire calculation at once it will break the process up into sub process with X number of frames which utilizes less then the max GPU global memory to store.

I had expected that using the maximum possible value of X would give me the largest speed up by minimizing the number of kernels used. Instead I found using a smaller group (X/2 or X/4) gives me better speeds. I’m trying to figure out why breaking the GPU processing into smaller groups rather then having the GPU process the maximum amount it can handle at one time gives me a speed increase; and how I can optimize to figure out what the best value of X is.

My current tests have been running on a GPU kernel which uses very little processing power (both kernels decimate output by selecting part of input and returning it) However, I am fairly certain the same effects occur when I activate all kernels which do a larger degree of processing on the value before returning.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T01:32:50+00:00

The short answer is, it’s complicated. There are many factors at play. These include (but are not limited to):

Amount of local memory you are using.
Amount of private memory you are using.
A limit on the max number of work groups the Symmetric Multiprocessor is able to handle at once.
Exceeding register limits, causing memory access slow-down.
And many more…

I recommend you check out the following link:

http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

In particular, check out section 5.3. Dynamic Partitioning of SM Resources. This text is meant to be general purpose, but uses CUDA for its examples. However, the concepts still apply just the same to OpenCL.

This text comes from the following book:

http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1314279939&sr=8-2

For what its worth, I found this book to be very informative. It will give you a deeper understanding of the hardware that will allow you to answer questions like this.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m writing a program using JOGL/openCL to utilize the GPU. I have code that

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply