Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6237529
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T11:02:41+00:00 2026-05-24T11:02:41+00:00

I am using the following two functions to time different parts (cudaMemcpyHtoD, kernel execution,

  • 0

I am using the following two functions to time different parts (cudaMemcpyHtoD, kernel execution, cudaMemcpyDtoH) of my code (which includes multi-gpus, concurrent kernels on same GPU, sequential execution of kernels, et al). As I understand, these functions would record the time elapsed between the events, but I guess inserting events along the lifetime of the code may result in overheads and inaccuracies. I would like to hear criticisms, general advice to improve these functions and caveat emptors regarding them.

//Create event and start recording
cudaEvent_t *start_event(int device, cudaEvent_t *events, cudaStream_t streamid=0)
{
        cutilSafeCall( cudaSetDevice(device) );
        cutilSafeCall( cudaEventCreate(&events[0]) );
        cutilSafeCall( cudaEventCreate(&events[1]) );
        cudaEventRecord(events[0], streamid);

    return events;
 }

 //Return elapsed time and destroy events
 float end_event(int device, cudaEvent_t *events, cudaStream_t streamid=0)
 {

        float elapsed = 0.0;
        cutilSafeCall( cudaSetDevice(device) );
        cutilSafeCall( cudaEventRecord(events[1], streamid) );
        cutilSafeCall( cudaEventSynchronize(events[1]) );
        cutilSafeCall( cudaEventElapsedTime(&elapsed, events[0], events[1]) );

        cutilSafeCall( cudaEventDestroy( events[0] ) );
        cutilSafeCall( cudaEventDestroy( events[1] ) );

        return elapsed;
 }

Usage:

cudaEvent_t *events;
cudaEvent_t event[2]; //0 for start and 1 for end
...
events = start_event( cuda_device, event, 0 );
<Code to time>
printf("Time taken for the above code... - %f secs\n\n", (end_event(cuda_device, events, 0) / 1000) );
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T11:02:42+00:00Added an answer on May 24, 2026 at 11:02 am

    First, if this is for production code, you may want to be able to do something between the second cudaEventRecord and cudaEventSynchronize(). Otherwise, this could reduce the ability of your app to overlap GPU and CPU work.

    Next, I would separate event creation and destruction from event recording. I’m not sure of the cost, but in general you might not want to call cudaEventCreate and cudaEventDestroy often.

    What I would do is create a class like this

    class EventTimer {
    public:
      EventTimer() : mStarted(false), mStopped(false) {
        cudaEventCreate(&mStart);
        cudaEventCreate(&mStop);
      }
      ~EventTimer() {
        cudaEventDestroy(mStart);
        cudaEventDestroy(mStop);
      }
      void start(cudaStream_t s = 0) { cudaEventRecord(mStart, s); 
                                       mStarted = true; mStopped = false; }
      void stop(cudaStream_t s = 0)  { assert(mStarted);
                                       cudaEventRecord(mStop, s); 
                                       mStarted = false; mStopped = true; }
      float elapsed() {
        assert(mStopped);
        if (!mStopped) return 0; 
        cudaEventSynchronize(mStop);
        float elapsed = 0;
        cudaEventElapsedTime(&elapsed, mStart, mStop);
        return elapsed;
      }
    
    private:
      bool mStarted, mStopped;
      cudaEvent_t mStart, mStop;
    };
    

    Note I didn’t include cudaSetDevice() — seems to me that should be left to the code that uses this class, to make it more flexible. The user would have to ensure the same device is active when start and stop are called.

    PS: It is not NVIDIA’s intent for CUTIL to be relied upon for production code — it is used simply for convenience in our examples and is not as rigorously tested or optimized as the CUDA libraries and compilers themselves. I recommend you extract things like cutilSafeCall() into your own libraries and headers.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I don't see any difference between the two functions. Using the following JavaScript code:
What is the difference between the following two snippets of code: using (Object o
I have created an alert view with two buttons using the following code: UIAlertView
I understand, using srand(time(0)), helps in setting the random seed. However, the following code,
I have this following code fragment which is accessed by different threads. try {
Using C# I was trying to develop the following two. The way I am
I have parsed XML using both of the following two methods... Parsing the XmlDocument
Let's consider the two following lines in C# (using framework .NET 3.5) Regex regex
I am using the following class written by Mark Brittingham for two way AES
I am using VB.NET with LINQ to MS SQL. I have two following tables.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.