I have two tasks. Each of them perform copy to device (D), run kernel

Question

0

Asked: June 10, 20262026-06-10T11:13:16+00:00 2026-06-10T11:13:16+00:00

I have two tasks. Each of them perform copy to device (D), run kernel

0

I have two tasks. Each of them perform copy to device (D), run kernel (R), and copy to host (H) operations. I am overlapping copy to device of task2 (D2) with run kernel of task1 (R1). In addition, I am overlapping run kernel of task2 (R2) with copy to host of task1 (H1).

I also record start and stop time of D, R, H ops of each task using cudaEventRecord.

I have GeForce GT 555M, CUDA 4.1, and Fedora 16.

I have three scenarios:

Scenario1: I use one stream for each task. I place start/stop events right before/after the ops.

Scenario2: I use one stream for each task. I place the start event of the second of the overlapping ops before the start of first one (i.e. place start R1 before start D2, and place start H1 before start R2).

Scenario3: I use two streams for each task. I use cudaStreamWaitEvents to synchronize between these two streams. One stream is used for D and H (copy) ops, the other one is used for R op. I place start/stop events right before/after the ops.

Scenario1 fails to overlap ops (neither D2-R1 nor R2-H1 can be overlapped), whereas Scenario2 and Scenario3 succeed. And my question is: Why Scenerio1 fails while the other ones succeed?

For each scenario I measure the overall time for performing Task1 and Task2. Running both R1 and R2 takes 5 ms each. Since Scenario1 fails to overlap ops, the overall time is 10ms more than Scenario 2 and 3.

Here are the pseudo-code for scenarios:

Scenario1 (FAILS): use stream1 for task1, use stream2 for task2

start overall 

start D1 on stream1 
D1 on stream1
stop D1 on stream1 

start D2 on stream2
D2 on stream2
stop D2 on stream2

start R1 on stream1
R1 on stream1
stop R1 on stream1

start R2 on stream2
R2 on stream2
stop R2 on stream2

start H1 on stream1
H1 on stream1
stop H1 on stream1

start H2 on stream2
H2 on stream2
stop H2 on stream2

stop overall

Scenario2 (SUCCEEDS): use stream1 for task1, use stream2 for task2, move-up the start event of the second of the overlaping ops.

start overall

start D1 on stream1
D1 on stream1
stop D1 on stream1 

start R1 on stream1 //moved-up

start D2 on stream2
D2 on stream2
stop D2 on stream2

R1 on stream1
stop R1 on stream1

start H1 on stream1 //moved-up

start R2 on stream2
R2 on stream2
stop R2 on stream2

H1 on stream1
stop H1 on stream1

start H2 on stream2
H2 on stream2
stop H2 on stream2

stop overall

Scenario3 (SUCCEEDS): use stream1 and 3 for task1, use stream2 and 4 for task2

start overall

start D1 on stream1
D1 on stream1
stop D1 on stream1 

start D2 on stream2
D2 on stream2
stop D2 on stream2

start R1 on stream3
R1 on stream3
stop R1 on stream3

start R2 on stream4
R2 on stream4
stop R2 on stream4

start H1 on stream1
H1 on stream1
stop H1 on stream1

start H2 on stream2
H2 on stream2
stop H2 on stream2

stop overall

Here are the overall timing info for all Scenarios:
Scenario1 = 39.390240
Scenario2 = 29.190241
Scenario3 = 29.298208

I also attach the CUDA code below:

#include <stdio.h>
#include <cuda_runtime.h>
#include <sys/time.h>

__global__ void VecAdd(const float* A, const float* B, float* C, int N)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N)
        {
        C[i] = A[i] + B[N-i];
        C[i] = A[i] + B[i] * 2;
        C[i] = A[i] + B[i] * 3;
        C[i] = A[i] + B[i] * 4;
        C[i] = A[i] + B[i];
        }
}

void overlap()
{

float* h_A;
float *d_A, *d_C;
float* h_A2;
float *d_A2, *d_C2;

int N = 10000000;
size_t size = N * sizeof(float); 

cudaMallocHost((void**) &h_A, size);
cudaMallocHost((void**) &h_A2, size);

// Allocate vector in device memory
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_C, size);
cudaMalloc((void**)&d_A2, size);
cudaMalloc((void**)&d_C2, size);

float fTimCpyDev1, fTimKer1, fTimCpyHst1, fTimCpyDev2, fTimKer2, fTimCpyHst2;
float fTimOverall3, fTimOverall1, fTimOverall2;

for (int i = 0; i<N; ++i)
    {
    h_A[i] = 1;
    h_A2[i] = 5;
    }

int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

cudaStream_t csStream1, csStream2, csStream3, csStream4;
cudaStreamCreate(&csStream1);
cudaStreamCreate(&csStream2);
cudaStreamCreate(&csStream3);
cudaStreamCreate(&csStream4);

cudaEvent_t ceEvStart, ceEvStop; 
cudaEventCreate( &ceEvStart );
cudaEventCreate( &ceEvStop );

cudaEvent_t ceEvStartCpyDev1, ceEvStopCpyDev1, ceEvStartKer1, ceEvStopKer1, ceEvStartCpyHst1, ceEvStopCpyHst1;
cudaEventCreate( &ceEvStartCpyDev1 );
cudaEventCreate( &ceEvStopCpyDev1 );
cudaEventCreate( &ceEvStartKer1 );
cudaEventCreate( &ceEvStopKer1 );
cudaEventCreate( &ceEvStartCpyHst1 );
cudaEventCreate( &ceEvStopCpyHst1 );
cudaEvent_t ceEvStartCpyDev2, ceEvStopCpyDev2, ceEvStartKer2, ceEvStopKer2, ceEvStartCpyHst2, ceEvStopCpyHst2; 
cudaEventCreate( &ceEvStartCpyDev2 );
cudaEventCreate( &ceEvStopCpyDev2 );
cudaEventCreate( &ceEvStartKer2 );
cudaEventCreate( &ceEvStopKer2 );
cudaEventCreate( &ceEvStartCpyHst2 );
cudaEventCreate( &ceEvStopCpyHst2 );


//Scenario1

cudaDeviceSynchronize();

cudaEventRecord(ceEvStart, 0);

cudaEventRecord(ceEvStartCpyDev1, csStream1);
cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice, csStream1);
cudaEventRecord(ceEvStopCpyDev1, csStream1);

cudaEventRecord(ceEvStartCpyDev2, csStream2);
cudaMemcpyAsync(d_A2, h_A2, size, cudaMemcpyHostToDevice, csStream2);
cudaEventRecord(ceEvStopCpyDev2, csStream2);

cudaEventRecord(ceEvStartKer1, csStream1); 
VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream1>>>(d_A, d_A, d_C, N);
cudaEventRecord(ceEvStopKer1, csStream1); 

cudaEventRecord(ceEvStartKer2, csStream2); 
VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream2>>>(d_A2, d_A2, d_C2, N);
cudaEventRecord(ceEvStopKer2, csStream2);

cudaEventRecord(ceEvStartCpyHst1, csStream1);
cudaMemcpyAsync(h_A, d_C, size, cudaMemcpyDeviceToHost, csStream1);
cudaEventRecord(ceEvStopCpyHst1, csStream1);

cudaEventRecord(ceEvStartCpyHst2, csStream2);
cudaMemcpyAsync(h_A2, d_C2, size, cudaMemcpyDeviceToHost, csStream2);
cudaEventRecord(ceEvStopCpyHst2, csStream2);

cudaEventRecord(ceEvStop, 0);
cudaDeviceSynchronize();

cudaEventElapsedTime( &fTimOverall1, ceEvStart, ceEvStop);
printf("Scenario1 overall time= %10f\n", fTimOverall1);


//Scenario2 

cudaDeviceSynchronize();

cudaEventRecord(ceEvStart, 0);

cudaEventRecord(ceEvStartCpyDev1, csStream1);
cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice, csStream1);
cudaEventRecord(ceEvStopCpyDev1, csStream1);

cudaEventRecord(ceEvStartKer1, csStream1); //moved up 

cudaEventRecord(ceEvStartCpyDev2, csStream2);
cudaMemcpyAsync(d_A2, h_A2, size, cudaMemcpyHostToDevice, csStream2);
cudaEventRecord(ceEvStopCpyDev2, csStream2);

VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream1>>>(d_A, d_A, d_C, N);
cudaEventRecord(ceEvStopKer1, csStream1); 

cudaEventRecord(ceEvStartCpyHst1, csStream1); //moved up

cudaEventRecord(ceEvStartKer2, csStream2); 
VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream2>>>(d_A2, d_A2, d_C2, N);
cudaEventRecord(ceEvStopKer2, csStream2);

cudaMemcpyAsync(h_A, d_C, size, cudaMemcpyDeviceToHost, csStream1);
cudaEventRecord(ceEvStopCpyHst1, csStream1);

cudaEventRecord(ceEvStartCpyHst2, csStream2);
cudaMemcpyAsync(h_A2, d_C2, size, cudaMemcpyDeviceToHost, csStream2);
cudaEventRecord(ceEvStopCpyHst2, csStream2);

cudaEventRecord(ceEvStop, 0);
cudaDeviceSynchronize();


cudaEventElapsedTime( &fTimOverall2, ceEvStart, ceEvStop);
printf("Scenario2 overall time= %10f\n", fTimOverall2);

//Scenario3
cudaDeviceSynchronize();

cudaEventRecord(ceEvStart, 0);

cudaEventRecord(ceEvStartCpyDev1, csStream1);
cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice, csStream1);
cudaEventRecord(ceEvStopCpyDev1, csStream1);

cudaEventRecord(ceEvStartCpyDev2, csStream2);
cudaMemcpyAsync(d_A2, h_A2, size, cudaMemcpyHostToDevice, csStream2);
cudaEventRecord(ceEvStopCpyDev2, csStream2);

cudaStreamWaitEvent(csStream3, ceEvStopCpyDev1, 0);
cudaEventRecord(ceEvStartKer1, csStream3); 
VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream3>>>(d_A, d_A, d_C, N);
cudaEventRecord(ceEvStopKer1, csStream3);

cudaStreamWaitEvent(csStream4, ceEvStopCpyDev2, 0);
cudaEventRecord(ceEvStartKer2, csStream4); 
VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream4>>>(d_A2, d_A2, d_C2, N);
cudaEventRecord(ceEvStopKer2, csStream4);

cudaStreamWaitEvent(csStream1, ceEvStopKer1, 0);
cudaEventRecord(ceEvStartCpyHst1, csStream1);
cudaMemcpyAsync(h_A, d_C, size, cudaMemcpyDeviceToHost, csStream1);
cudaEventRecord(ceEvStopCpyHst1, csStream1);

cudaStreamWaitEvent(csStream2, ceEvStopKer2, 0);
cudaEventRecord(ceEvStartCpyHst2, csStream2);
cudaMemcpyAsync(h_A2, d_C2, size, cudaMemcpyDeviceToHost, csStream2);
cudaEventRecord(ceEvStopCpyHst2, csStream2);

cudaEventRecord(ceEvStop, 0);
cudaDeviceSynchronize();

cudaEventElapsedTime( &fTimOverall3, ceEvStart, ceEvStop);
printf("Scenario3 overall time = %10f\n", fTimOverall3);

cudaStreamDestroy(csStream1);
cudaStreamDestroy(csStream2);
cudaStreamDestroy(csStream3);
cudaStreamDestroy(csStream4);

cudaFree(d_A);
cudaFree(d_C);
cudaFreeHost(h_A);
cudaFree(d_A2);
cudaFree(d_C2);
cudaFreeHost(h_A2);

}

int main()
{

  overlap();
}

Thank you very much for your time in advance!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T11:13:18+00:00

(Note, I’m more familiar with the Tesla series devices, and don’t actually have a GT 555M to experiment with, so my results refer specifically to a C2070. I don’t know how many copy engines the 555m has, but I expect the issues described below are what’s causing the behavior you are seeing.)

The issue is the lesser-known fact that the cudaEventRecords are CUDA operations too, and they also must be placed in one of the hardware queues before getting launched/executed. (A complicating factor is that, since cudaEventRecord is neither a copy operation, nor a compute kernel, it can actually go in any hardware queue. My understanding is that they usually go in the same hardware queue as the preceding CUDA operation of the same stream, but as this is not specified in the docs the actual operation may be device/driver dependent.)

If I can extend your notation to use ‘E’ for ‘Event record’, and detail how the hardware queues are filled (similar to what is done in the “CUDA C/C++ Streams and Concurrency” webinar) then, in your Scenario 1 example, you have:

Issue order for CUDA operations:
   ED1
   D1
   ED1
   ED2
   D2
   ED2
   ER1
   R1
   ER1
   ...

These fill the queues like:

Hardware Queues:    copyH2D     Kernel
                    -------     ------
                    ED1       * R1
                    D1       /  ER1
                    ED1     /   ...
                    ED2    /
                    D2    /
                    ED2  /
                    ER1 *

and you can see that R1, by virtue of being in stream 1, will not execute until ER1 has completed, which won’t happen until both D1 and D2 have completed since they are all serialized in the H2D copy queue.

By moving the cudaEventRecord, ER1, up in Scenario 2, you avoid this since all CUDA operations in stream 1, prior to R1, complete before D2. This permits R1 to launch concurrently to D2.

Hardware Queues:    copyH2D     Kernel
                    -------     ------
                    ED1      *  R1
                    D1      /   ER1
                    ED1    /    ...
                    ER1   *
                    ED2    
                    D2    
                    ED2

In your Scenario 3, the ER1 is replaced with an ER3. As this is the first operation in stream 3, it can go anywhere, and (guessing) goes in either the Kernel or copy D2H queue from which it could get launched immediately, (if you didn’t have the

cudaStreamWaitEvent(csStream3, ceEvStopCpyDev1, 0);

for synchronization with stream 1) so it does not cause false serialization with D2.

Hardware Queues:    copyH2D     Kernel
                    -------     ------
                    ED1     *   ER3
                    D1     /    R3
                    ED1   *     ER3
                    ED2         ...
                    D2    
                    ED2

My comments would be

Issue order for CUDA operations is very important when considering concurrency
cudaEventRecord, and similar operations, get placed on hardware queues like everything else and can cause false serialization. Exactly how they get placed in hardware queues is not well described, and could be device/driver dependent. So for optimal concurrency, the use of cudaEventRecord and similar operations should be reduced to the minimum necessary.
If kernels need to be timed for performance studies, that can be done using events but it will break concurrency. This is fine for development but should be avoided for production code.

However you should note that the upcoming Kepler GK110 (Tesla K20) devices make significant improvements in reducing false serialization by using 32 hardware queues. See the GK110 Whitepaper for details (page 17).

Hope this helps.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have two tasks. Each of them perform copy to device (D), run kernel

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply