I’ve programmed a least squares optimization for CUDA. It works fine when it optimizes

Question

0

Asked: June 17, 20262026-06-17T16:51:43+00:00 2026-06-17T16:51:43+00:00

I’ve programmed a least squares optimization for CUDA. It works fine when it optimizes

0

I’ve programmed a least squares optimization for CUDA. It works fine when it optimizes one data-set. For further use I have to implement it for a usage of three data-sets at the same time.
The code consists of three kernels and some host code between them to prepare data etc.
A simple implementation would be to call the program three times for every data set.

serial compuatation

But my task is to find out, how I can run it three times at the same time.
multithreaded

Is it possible or even a good idea to call the program or concurrent kernels from three host threads at the same time when I use libraries like openmp or posix? Or should I try to program an own scheduler?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T16:51:44+00:00

When you say “four blocks at the same time”, do you mean four blocks per Multi-processor(MP)?

According to your additional comments in the Q, you probably has 384/32=12 Multi-processors(MPs) on your 560 Ti. If you launch more than 12*4=48 blocks for one kernel, you won’t be able to run your three kernels concurrently.

In this case, the scale of your task is too large for concurrent kernel execution, but you may still be able to overlap the data transfer and the kernel exeicution as shown in this blog.

You can find more info in the section Asynchronous Concurrent Execution of the CUDA programming guide.

On the other hand, since you also have some host code for each dataset, you could speed up your program by running host code of one dataset and kernel of another dataset concurently.

For host code parallelism, you could use posix/omp, and then bind each kernel with a different CUDA stream to the corresponding host thread.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve programmed a least squares optimization for CUDA. It works fine when it optimizes

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply