what is better? I need to process data in several steps and it appears to me that I’ve 2 options:
1) use one big kernel
2) use streams with one kernel for each step
There is some latency before a kernel is executed, but does it really matter in this case? Is latency for a big kernel same as sum of latencies for several smaller kernels?
Are there any advantages one way compared to the other one?
Thanks guys.
Launch latency for a kernel on a Fermi card is on the order of 10us, so nothing to worry about. It makes sense — to render a scene in a game, one has to run many different shaders (which are kernels).
A kernel has to read the data that it will process from global memory and write the results back to global memory. So each separate kernel implies that full read/write cycle. You may be able to speed things up if you are able to chain multiple steps together in a big kernel, still bracketed by a single read/write cycle.
As an example, if you need to perform operations A, B and C, chaining them might give you READ – A – B – C – WRITE while separate kernels would give you READ – A – WRITE – READ – B – WRITE – READ – C – WRITE.
Remember, even if you run even a single kernel, you can still keep your code readable by breaking the separate steps out to separate device functions.