Kernel launches in CUDA are generally asynchronous, which (as I understand) means that once

Question

0

Asked: May 26, 20262026-05-26T21:07:26+00:00 2026-05-26T21:07:26+00:00

Kernel launches in CUDA are generally asynchronous, which (as I understand) means that once

0

Kernel launches in CUDA are generally asynchronous, which (as I understand) means that once the CUDA kernel is launched control returns immediately to the CPU. The CPU continues doing some useful work while the GPU is busy number crunching
unless the CPU is forcefully stalled using cudaThreadsynchronize() or cudaMemcpy() .

Now I have just started using the Thrust library for CUDA. Are the function calls in Thrust
synchronous or asynchronous?

In other words, if I invoke thrust::sort(D.begin(),D.end()); where D is a device vector, does it make sense to measure the sorting time using

        start = clock();//Start

             thrust::sort(D.begin(),D.end());

        diff = ( clock() - start ) / (double)CLOCKS_PER_SEC;
        std::cout << "\nDevice Time taken is: " <<diff<<std::endl;

If the function call is asynchronous then diff will be 0 seconds for any vector (which is junk for timings), but if it is synchronous I will indeed get the real time performance.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T21:07:26+00:00

Thrust calls which invoke kernels are asynchronous, just like the underlying CUDA APIs thrust uses. Thrust calls which copy data are synchronous, just like the underlying CUDA APIs thrust uses.

So your example would only be measuring the kernel launch and thrust host side setup overheads, not the operation itself. For timing, you can get around this by calling either cudaThreadSynchronize or cudaDeviceSynchronize (the later in CUDA 4.0 or later) after the thrust kernel launch. Alternatively, if you include a post kernel launch copy operation and record the stop time after that, your timing will include setup, execution, and copying time.

In your example this would look something like

   start = clock();//Start 

   thrust::sort(D.begin(),D.end()); 
   cudaThreadSynchronize(); // block until kernel is finished

   diff = ( clock() - start ) / (double)CLOCKS_PER_SEC; 
   std::cout << "\nDevice Time taken is: " <<diff<<std::endl;

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Kernel launches in CUDA are generally asynchronous, which (as I understand) means that once

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply