Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8451371
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 10, 20262026-06-10T11:13:16+00:00 2026-06-10T11:13:16+00:00

I have two tasks. Each of them perform copy to device (D), run kernel

  • 0

I have two tasks. Each of them perform copy to device (D), run kernel (R), and copy to host (H) operations. I am overlapping copy to device of task2 (D2) with run kernel of task1 (R1). In addition, I am overlapping run kernel of task2 (R2) with copy to host of task1 (H1).

I also record start and stop time of D, R, H ops of each task using cudaEventRecord.

I have GeForce GT 555M, CUDA 4.1, and Fedora 16.

I have three scenarios:

Scenario1: I use one stream for each task. I place start/stop events right before/after the ops.

Scenario2: I use one stream for each task. I place the start event of the second of the overlapping ops before the start of first one (i.e. place start R1 before start D2, and place start H1 before start R2).

Scenario3: I use two streams for each task. I use cudaStreamWaitEvents to synchronize between these two streams. One stream is used for D and H (copy) ops, the other one is used for R op. I place start/stop events right before/after the ops.

Scenario1 fails to overlap ops (neither D2-R1 nor R2-H1 can be overlapped), whereas Scenario2 and Scenario3 succeed. And my question is: Why Scenerio1 fails while the other ones succeed?

For each scenario I measure the overall time for performing Task1 and Task2. Running both R1 and R2 takes 5 ms each. Since Scenario1 fails to overlap ops, the overall time is 10ms more than Scenario 2 and 3.

Here are the pseudo-code for scenarios:

Scenario1 (FAILS): use stream1 for task1, use stream2 for task2

start overall 

start D1 on stream1 
D1 on stream1
stop D1 on stream1 

start D2 on stream2
D2 on stream2
stop D2 on stream2

start R1 on stream1
R1 on stream1
stop R1 on stream1

start R2 on stream2
R2 on stream2
stop R2 on stream2

start H1 on stream1
H1 on stream1
stop H1 on stream1

start H2 on stream2
H2 on stream2
stop H2 on stream2

stop overall 

Scenario2 (SUCCEEDS): use stream1 for task1, use stream2 for task2, move-up the start event of the second of the overlaping ops.

start overall

start D1 on stream1
D1 on stream1
stop D1 on stream1 

start R1 on stream1 //moved-up

start D2 on stream2
D2 on stream2
stop D2 on stream2

R1 on stream1
stop R1 on stream1

start H1 on stream1 //moved-up

start R2 on stream2
R2 on stream2
stop R2 on stream2

H1 on stream1
stop H1 on stream1

start H2 on stream2
H2 on stream2
stop H2 on stream2

stop overall 

Scenario3 (SUCCEEDS): use stream1 and 3 for task1, use stream2 and 4 for task2

start overall

start D1 on stream1
D1 on stream1
stop D1 on stream1 

start D2 on stream2
D2 on stream2
stop D2 on stream2

start R1 on stream3
R1 on stream3
stop R1 on stream3

start R2 on stream4
R2 on stream4
stop R2 on stream4

start H1 on stream1
H1 on stream1
stop H1 on stream1

start H2 on stream2
H2 on stream2
stop H2 on stream2

stop overall

Here are the overall timing info for all Scenarios:
Scenario1 = 39.390240
Scenario2 = 29.190241
Scenario3 = 29.298208

I also attach the CUDA code below:

#include <stdio.h>
#include <cuda_runtime.h>
#include <sys/time.h>

__global__ void VecAdd(const float* A, const float* B, float* C, int N)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N)
        {
        C[i] = A[i] + B[N-i];
        C[i] = A[i] + B[i] * 2;
        C[i] = A[i] + B[i] * 3;
        C[i] = A[i] + B[i] * 4;
        C[i] = A[i] + B[i];
        }
}

void overlap()
{

float* h_A;
float *d_A, *d_C;
float* h_A2;
float *d_A2, *d_C2;

int N = 10000000;
size_t size = N * sizeof(float); 

cudaMallocHost((void**) &h_A, size);
cudaMallocHost((void**) &h_A2, size);

// Allocate vector in device memory
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_C, size);
cudaMalloc((void**)&d_A2, size);
cudaMalloc((void**)&d_C2, size);

float fTimCpyDev1, fTimKer1, fTimCpyHst1, fTimCpyDev2, fTimKer2, fTimCpyHst2;
float fTimOverall3, fTimOverall1, fTimOverall2;

for (int i = 0; i<N; ++i)
    {
    h_A[i] = 1;
    h_A2[i] = 5;
    }

int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

cudaStream_t csStream1, csStream2, csStream3, csStream4;
cudaStreamCreate(&csStream1);
cudaStreamCreate(&csStream2);
cudaStreamCreate(&csStream3);
cudaStreamCreate(&csStream4);

cudaEvent_t ceEvStart, ceEvStop; 
cudaEventCreate( &ceEvStart );
cudaEventCreate( &ceEvStop );

cudaEvent_t ceEvStartCpyDev1, ceEvStopCpyDev1, ceEvStartKer1, ceEvStopKer1, ceEvStartCpyHst1, ceEvStopCpyHst1;
cudaEventCreate( &ceEvStartCpyDev1 );
cudaEventCreate( &ceEvStopCpyDev1 );
cudaEventCreate( &ceEvStartKer1 );
cudaEventCreate( &ceEvStopKer1 );
cudaEventCreate( &ceEvStartCpyHst1 );
cudaEventCreate( &ceEvStopCpyHst1 );
cudaEvent_t ceEvStartCpyDev2, ceEvStopCpyDev2, ceEvStartKer2, ceEvStopKer2, ceEvStartCpyHst2, ceEvStopCpyHst2; 
cudaEventCreate( &ceEvStartCpyDev2 );
cudaEventCreate( &ceEvStopCpyDev2 );
cudaEventCreate( &ceEvStartKer2 );
cudaEventCreate( &ceEvStopKer2 );
cudaEventCreate( &ceEvStartCpyHst2 );
cudaEventCreate( &ceEvStopCpyHst2 );


//Scenario1

cudaDeviceSynchronize();

cudaEventRecord(ceEvStart, 0);

cudaEventRecord(ceEvStartCpyDev1, csStream1);
cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice, csStream1);
cudaEventRecord(ceEvStopCpyDev1, csStream1);

cudaEventRecord(ceEvStartCpyDev2, csStream2);
cudaMemcpyAsync(d_A2, h_A2, size, cudaMemcpyHostToDevice, csStream2);
cudaEventRecord(ceEvStopCpyDev2, csStream2);

cudaEventRecord(ceEvStartKer1, csStream1); 
VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream1>>>(d_A, d_A, d_C, N);
cudaEventRecord(ceEvStopKer1, csStream1); 

cudaEventRecord(ceEvStartKer2, csStream2); 
VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream2>>>(d_A2, d_A2, d_C2, N);
cudaEventRecord(ceEvStopKer2, csStream2);

cudaEventRecord(ceEvStartCpyHst1, csStream1);
cudaMemcpyAsync(h_A, d_C, size, cudaMemcpyDeviceToHost, csStream1);
cudaEventRecord(ceEvStopCpyHst1, csStream1);

cudaEventRecord(ceEvStartCpyHst2, csStream2);
cudaMemcpyAsync(h_A2, d_C2, size, cudaMemcpyDeviceToHost, csStream2);
cudaEventRecord(ceEvStopCpyHst2, csStream2);

cudaEventRecord(ceEvStop, 0);
cudaDeviceSynchronize();

cudaEventElapsedTime( &fTimOverall1, ceEvStart, ceEvStop);
printf("Scenario1 overall time= %10f\n", fTimOverall1);


//Scenario2 

cudaDeviceSynchronize();

cudaEventRecord(ceEvStart, 0);

cudaEventRecord(ceEvStartCpyDev1, csStream1);
cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice, csStream1);
cudaEventRecord(ceEvStopCpyDev1, csStream1);

cudaEventRecord(ceEvStartKer1, csStream1); //moved up 

cudaEventRecord(ceEvStartCpyDev2, csStream2);
cudaMemcpyAsync(d_A2, h_A2, size, cudaMemcpyHostToDevice, csStream2);
cudaEventRecord(ceEvStopCpyDev2, csStream2);

VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream1>>>(d_A, d_A, d_C, N);
cudaEventRecord(ceEvStopKer1, csStream1); 

cudaEventRecord(ceEvStartCpyHst1, csStream1); //moved up

cudaEventRecord(ceEvStartKer2, csStream2); 
VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream2>>>(d_A2, d_A2, d_C2, N);
cudaEventRecord(ceEvStopKer2, csStream2);

cudaMemcpyAsync(h_A, d_C, size, cudaMemcpyDeviceToHost, csStream1);
cudaEventRecord(ceEvStopCpyHst1, csStream1);

cudaEventRecord(ceEvStartCpyHst2, csStream2);
cudaMemcpyAsync(h_A2, d_C2, size, cudaMemcpyDeviceToHost, csStream2);
cudaEventRecord(ceEvStopCpyHst2, csStream2);

cudaEventRecord(ceEvStop, 0);
cudaDeviceSynchronize();


cudaEventElapsedTime( &fTimOverall2, ceEvStart, ceEvStop);
printf("Scenario2 overall time= %10f\n", fTimOverall2);

//Scenario3
cudaDeviceSynchronize();

cudaEventRecord(ceEvStart, 0);

cudaEventRecord(ceEvStartCpyDev1, csStream1);
cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice, csStream1);
cudaEventRecord(ceEvStopCpyDev1, csStream1);

cudaEventRecord(ceEvStartCpyDev2, csStream2);
cudaMemcpyAsync(d_A2, h_A2, size, cudaMemcpyHostToDevice, csStream2);
cudaEventRecord(ceEvStopCpyDev2, csStream2);

cudaStreamWaitEvent(csStream3, ceEvStopCpyDev1, 0);
cudaEventRecord(ceEvStartKer1, csStream3); 
VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream3>>>(d_A, d_A, d_C, N);
cudaEventRecord(ceEvStopKer1, csStream3);

cudaStreamWaitEvent(csStream4, ceEvStopCpyDev2, 0);
cudaEventRecord(ceEvStartKer2, csStream4); 
VecAdd<<<blocksPerGrid, threadsPerBlock, 0, csStream4>>>(d_A2, d_A2, d_C2, N);
cudaEventRecord(ceEvStopKer2, csStream4);

cudaStreamWaitEvent(csStream1, ceEvStopKer1, 0);
cudaEventRecord(ceEvStartCpyHst1, csStream1);
cudaMemcpyAsync(h_A, d_C, size, cudaMemcpyDeviceToHost, csStream1);
cudaEventRecord(ceEvStopCpyHst1, csStream1);

cudaStreamWaitEvent(csStream2, ceEvStopKer2, 0);
cudaEventRecord(ceEvStartCpyHst2, csStream2);
cudaMemcpyAsync(h_A2, d_C2, size, cudaMemcpyDeviceToHost, csStream2);
cudaEventRecord(ceEvStopCpyHst2, csStream2);

cudaEventRecord(ceEvStop, 0);
cudaDeviceSynchronize();

cudaEventElapsedTime( &fTimOverall3, ceEvStart, ceEvStop);
printf("Scenario3 overall time = %10f\n", fTimOverall3);

cudaStreamDestroy(csStream1);
cudaStreamDestroy(csStream2);
cudaStreamDestroy(csStream3);
cudaStreamDestroy(csStream4);

cudaFree(d_A);
cudaFree(d_C);
cudaFreeHost(h_A);
cudaFree(d_A2);
cudaFree(d_C2);
cudaFreeHost(h_A2);

}

int main()
{

  overlap();
}

Thank you very much for your time in advance!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-10T11:13:18+00:00Added an answer on June 10, 2026 at 11:13 am

    (Note, I’m more familiar with the Tesla series devices, and don’t actually have a GT 555M to experiment with, so my results refer specifically to a C2070. I don’t know how many copy engines the 555m has, but I expect the issues described below are what’s causing the behavior you are seeing.)

    The issue is the lesser-known fact that the cudaEventRecords are CUDA operations too, and they also must be placed in one of the hardware queues before getting launched/executed. (A complicating factor is that, since cudaEventRecord is neither a copy operation, nor a compute kernel, it can actually go in any hardware queue. My understanding is that they usually go in the same hardware queue as the preceding CUDA operation of the same stream, but as this is not specified in the docs the actual operation may be device/driver dependent.)

    If I can extend your notation to use ‘E’ for ‘Event record’, and detail how the hardware queues are filled (similar to what is done in the “CUDA C/C++ Streams and Concurrency” webinar) then, in your Scenario 1 example, you have:

    Issue order for CUDA operations:
       ED1
       D1
       ED1
       ED2
       D2
       ED2
       ER1
       R1
       ER1
       ...
    

    These fill the queues like:

    Hardware Queues:    copyH2D     Kernel
                        -------     ------
                        ED1       * R1
                        D1       /  ER1
                        ED1     /   ...
                        ED2    /
                        D2    /
                        ED2  /
                        ER1 *
    

    and you can see that R1, by virtue of being in stream 1, will not execute until ER1 has completed, which won’t happen until both D1 and D2 have completed since they are all serialized in the H2D copy queue.

    By moving the cudaEventRecord, ER1, up in Scenario 2, you avoid this since all CUDA operations in stream 1, prior to R1, complete before D2. This permits R1 to launch concurrently to D2.

    Hardware Queues:    copyH2D     Kernel
                        -------     ------
                        ED1      *  R1
                        D1      /   ER1
                        ED1    /    ...
                        ER1   *
                        ED2    
                        D2    
                        ED2  
    

    In your Scenario 3, the ER1 is replaced with an ER3. As this is the first operation in stream 3, it can go anywhere, and (guessing) goes in either the Kernel or copy D2H queue from which it could get launched immediately, (if you didn’t have the

    cudaStreamWaitEvent(csStream3, ceEvStopCpyDev1, 0);
    

    for synchronization with stream 1) so it does not cause false serialization with D2.

    Hardware Queues:    copyH2D     Kernel
                        -------     ------
                        ED1     *   ER3
                        D1     /    R3
                        ED1   *     ER3
                        ED2         ...
                        D2    
                        ED2 
    

    My comments would be

    1. Issue order for CUDA operations is very important when considering concurrency
    2. cudaEventRecord, and similar operations, get placed on hardware queues like everything else and can cause false serialization. Exactly how they get placed in hardware queues is not well described, and could be device/driver dependent. So for optimal concurrency, the use of cudaEventRecord and similar operations should be reduced to the minimum necessary.
    3. If kernels need to be timed for performance studies, that can be done using events but it will break concurrency. This is fine for development but should be avoided for production code.

    However you should note that the upcoming Kepler GK110 (Tesla K20) devices make significant improvements in reducing false serialization by using 32 hardware queues. See the GK110 Whitepaper for details (page 17).

    Hope this helps.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Imagine I have two (three, four, whatever) tasks that have to run in parallel.
I have two asynctask working with each other. I'm using them for creating Restaurant
We have two tables - Tasks and TasksUsers (users assigned to task). Task has
I have two tables: projects and tasks. A project consists of tasks. I want
I have two queries in an AggregateModel, and I'd like them to be passed
Suppose that I have users, tasks and each task belongs to a batch. Each
I have two tables containing Tasks and Notes, and want to retrieve a list
I have 4 different queries and each of them return individual unique set of
I have two arrays in JSON something like: proj1 proj1 taska taska charge1 charge1
I have an application in which I am starting a timer for two task:

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.