Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8948785
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T13:05:03+00:00 2026-06-15T13:05:03+00:00

When timing CUDA kernels, the following doesn’t work because the kernel doesn’t block the

  • 0

When timing CUDA kernels, the following doesn’t work because the kernel doesn’t block the CPU program execution while it executes:

start timer
kernel<<<g,b>>>();
end timer

I’ve seen three basic ways of (successfully) timing CUDA kernels:

(1) Two CUDA eventRecords.

float responseTime; //result will be in milliseconds
cudaEvent_t start; cudaEventCreate(&start); cudaEventRecord(start); cudaEventSynchronize(start);
cudaEvent_t stop;  cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time

(2) One CUDA eventRecord.

float start = read_timer(); //helper function on CPU, in milliseconds
cudaEvent_t stop;  cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
float responseTime = read_timer() - start;

(3) deviceSynchronize instead of eventRecord. (Probably only useful when using programming in a single stream.)

float start = read_timer(); //helper function on CPU, in milliseconds
kernel<<<g,b>>>();
cudaDeviceSynchronize();
float responseTime = read_timer() - start;

I experimentally verified that these three strategies produce the same timing result.


Questions:

  • What are the tradeoffs of these strategies? Any hidden details here?
  • Aside from timing many kernels in multiple streams, is there any advantages of using two event records and the cudaEventElapsedTime() function?

You can probably use your imagination to figure out what read_timer() does. Nevertheless, it can’t hurt to provide an example implementation:

double read_timer(){
    struct timeval start;
    gettimeofday( &start, NULL ); //you need to include <sys/time.h>
    return (double)((start.tv_sec) + 1.0e-6 * (start.tv_usec))*1000; //milliseconds
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T13:05:04+00:00Added an answer on June 15, 2026 at 1:05 pm

    You seem to have ruled out most of the differences by saying they all produce the same result for the relatively simple case you have shown (probably not exactly true, but I understand what you mean), and “Aside from timing (complex sequences) …” where the first case is clearly better.

    One possible difference would be portability between windows and linux. I believe your example read_timer function is linux-oriented. You could probably craft a read_timer function that is “portable” but the cuda event system (method 1) is portable as-is.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am timing how long it takes my CUDA program to calculate matrices of
I need to time a CUDA kernel execution. The Best Practices Guide says that
I have the following bit of code that is timing out when I pass
Kernel launches in CUDA are generally asynchronous, which (as I understand) means that once
I have the following code written to make my lazy second CPU core working.
We have a C# program that performs some timing measurements. Unfortunately, the first time
After running my webapp for a while (timing varies between hours and days depending
I'm running some timing code on various OS. I notice the following patterns with
The following query is timing out after 600 seconds. update placed p ,Results r
Consider the following code parts: this is Timing.h: class Timing { public: //creates a

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.