When timing CUDA kernels, the following doesn’t work because the kernel doesn’t block the

Question

0

Asked: June 15, 20262026-06-15T13:05:03+00:00 2026-06-15T13:05:03+00:00

When timing CUDA kernels, the following doesn’t work because the kernel doesn’t block the

0

When timing CUDA kernels, the following doesn’t work because the kernel doesn’t block the CPU program execution while it executes:

start timer
kernel<<<g,b>>>();
end timer

I’ve seen three basic ways of (successfully) timing CUDA kernels:

(1) Two CUDA eventRecords.

float responseTime; //result will be in milliseconds
cudaEvent_t start; cudaEventCreate(&start); cudaEventRecord(start); cudaEventSynchronize(start);
cudaEvent_t stop;  cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time

(2) One CUDA eventRecord.

float start = read_timer(); //helper function on CPU, in milliseconds
cudaEvent_t stop;  cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
float responseTime = read_timer() - start;

(3) deviceSynchronize instead of eventRecord. (Probably only useful when using programming in a single stream.)

float start = read_timer(); //helper function on CPU, in milliseconds
kernel<<<g,b>>>();
cudaDeviceSynchronize();
float responseTime = read_timer() - start;

I experimentally verified that these three strategies produce the same timing result.

Questions:

What are the tradeoffs of these strategies? Any hidden details here?
Aside from timing many kernels in multiple streams, is there any advantages of using two event records and the cudaEventElapsedTime() function?

You can probably use your imagination to figure out what read_timer() does. Nevertheless, it can’t hurt to provide an example implementation:

double read_timer(){
    struct timeval start;
    gettimeofday( &start, NULL ); //you need to include <sys/time.h>
    return (double)((start.tv_sec) + 1.0e-6 * (start.tv_usec))*1000; //milliseconds
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T13:05:04+00:00

You seem to have ruled out most of the differences by saying they all produce the same result for the relatively simple case you have shown (probably not exactly true, but I understand what you mean), and “Aside from timing (complex sequences) …” where the first case is clearly better.

One possible difference would be portability between windows and linux. I believe your example read_timer function is linux-oriented. You could probably craft a read_timer function that is “portable” but the cuda event system (method 1) is portable as-is.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

When timing CUDA kernels, the following doesn’t work because the kernel doesn’t block the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply