I created streams in this way:
cudaStream_t stream0;
cudaStream_t stream1;
cudaStreamCreate( &stream0);
cudaStreamCreate( &stream1);
I run the kernel functions like
singlecore<<<1,1>>>(devL2,1000);
singlecore<<<1,1,0,stream0>>>(devL2,1000);
The two kernels are not executed currently. But if I execute the first kernel in stream1 as:
singlecore<<<1,1,0,stream1>>>(devL2,1000);
singlecore<<<1,1,0,stream0>>>(devL2,1000);
they will execute currently.
I wonder if the kernel function in default stream can not be executed currently.
Yes there is a limitation on cuda commands issued to the default stream. Referring to the C programming guide section on implicit synchronization:
“Two commands from different streams cannot run concurrently if any one of the following operations is issued in-between them by the host thread:
…
•any CUDA command to the default stream,
“
So as a general rule of thumb, for overlapped copy and compute operations, it’s easiest to program all such operations in a set of non-default streams. There’s a bit of a loophole (which you’ve discovered) where it’s possible to get overlap with commands issued in the default stream (and other streams), but it requires careful understanding of the restrictions between the default stream and other streams, as well as careful attention to the order in which you issue commands. A good example is explained in the C programming guide. Read all the way through the section on “overlapping behavior”.
In your first example, the kernel issued to the default stream blocks execution of the kernel issued to the other stream. In your second example, you can have concurrency because the kernel issued to the non-default stream does not block the execution of the kernel issued to the default stream.