In Nvida CUDA C Programming Guide 4.0, section 3.2.5.5.4, it says that two commands from different streams cannot run concurrently if a device-to-device memory copy is issued in-between them. I am not sure what it exactly means. Hope someone can clarify my confusion.
Let’s say my program have two streams, stream 0 and stream 1. The following is the order kernels are launched to these streams.
Kernel 0.0 (stream 0; assume the execution time is 10 ms)
kernel 1.0 (stream 1; assume the execution time is 1 ms)
kernel 1.1 (stream 1; assume the execution time is 3 ms)
kernel 1.2 (stream 1; this kernel causes a device-to-device memory copy, assume the execution time is 1 ms)
kernel 1.3 (stream 1; assume the execution time is 6 ms)
Let’s also assume the program doesn’t have other overhead and the GPU has enough SM to run these kernels concurrently. My question is if kernel 0.0 can run concurrently with kernel 1.2 and kernel 1.3? What is the running time for the whole program?
As mentioned, a device-to-device memory copy is done using
cudaMemcpy()from the host; kernels are free to read and write global memory as they please. Kernels may overlap if they are in different streams but there is no guarantee. The exact speedup will depend on the SM utilization by each of the kernels. Nvidia recommends using events to time kernel execution (a start and stop timer) to determine if the overlapping version is faster than the sequential. You can compare this output with either switching kernels to stream 0 or running your app in the profiler, which serializes kernel execution.