I’ve programmed a least squares optimization for CUDA. It works fine when it optimizes one data-set. For further use I have to implement it for a usage of three data-sets at the same time.
The code consists of three kernels and some host code between them to prepare data etc.
A simple implementation would be to call the program three times for every data set.

But my task is to find out, how I can run it three times at the same time.

Is it possible or even a good idea to call the program or concurrent kernels from three host threads at the same time when I use libraries like openmp or posix? Or should I try to program an own scheduler?
When you say “four blocks at the same time”, do you mean four blocks per Multi-processor(MP)?
According to your additional comments in the Q, you probably has 384/32=12 Multi-processors(MPs) on your 560 Ti. If you launch more than 12*4=48 blocks for one kernel, you won’t be able to run your three kernels concurrently.
In this case, the scale of your task is too large for concurrent kernel execution, but you may still be able to overlap the data transfer and the kernel exeicution as shown in this blog.
You can find more info in the section Asynchronous Concurrent Execution of the CUDA programming guide.
On the other hand, since you also have some host code for each dataset, you could speed up your program by running host code of one dataset and kernel of another dataset concurently.
For host code parallelism, you could use posix/omp, and then bind each kernel with a different CUDA stream to the corresponding host thread.