Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8137997
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 6, 20262026-06-06T11:21:25+00:00 2026-06-06T11:21:25+00:00

I have a CUDA kernel that calls out to a series of device functions.

  • 0

I have a CUDA kernel that calls out to a series of device functions.

What is the best way to get the execution time for each of the device functions?

What is the best way to get the execution time for a section of code in one of the device functions?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-06T11:21:26+00:00Added an answer on June 6, 2026 at 11:21 am

    In my own code, I use the clock() function to get precise timings. For convenience, I have the macros

    enum {
        tid_this = 0,
        tid_that,
        tid_count
        };
    __device__ float cuda_timers[ tid_count ];
    #ifdef USETIMERS
     #define TIMER_TIC clock_t tic; if ( threadIdx.x == 0 ) tic = clock();
     #define TIMER_TOC(tid) clock_t toc = clock(); if ( threadIdx.x == 0 ) atomicAdd( &cuda_timers[tid] , ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) );
    #else
     #define TIMER_TIC
     #define TIMER_TOC(tid)
    #endif
    

    These can then be used to instrument the device code as follows:

    __global__ mykernel ( ... ) {
    
        /* Start the timer. */
        TIMER_TIC
    
        /* Do stuff. */
        ...
    
        /* Stop the timer and store the results to the "timer_this" counter. */
        TIMER_TOC( tid_this );
    
        }
    

    You can then read the cuda_timers in the host code.

    A few notes:

    • The timers work on a per-block basis, i.e. if you have 100 blocks executing the same kernel, the sum of all their times will be stored.
    • Having said that, the timer assumes that the zeroth thread is active, so make sure you do not call these macros in a possibly divergent part of the code.
    • The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.
    • The timers can slow down your code a bit, which is why I wrapped them in the #ifdef USETIMERS so you can switch them off easily.
    • Although clock() returns integer values of type clock_t, I store the accumulated values as float, otherwise the values will wrap around for kernels that take longer than a few seconds (accumulated over all blocks).
    • The selection ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) ) is necessary in case the clock counter wraps around.

    P.S. This is a copy of my reply to this question, which didn’t get many points there since the timing required was for the whole kernel.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a CUDA program that calls the kernel repeatedly within a for loop.
I have a CUDA kernel that do my hard work, but I also have
TL;DR version: What's the best way to round-robin kernel calls to multiple GPUs with
I have some code that I want to make into a cuda kernel. Behold:
I have these template functions for use inline on device with cuda template <class
In a CUDA kernel, I have code similar to the following. I am trying
I have a data object that looks like this: { 'node-16': { 'tags': ['cuda'],
I understand that in CUDA's memory hierachy, we have things like shared memory, texture
I ran some CUDA code that updated an array of floats. I have a
I have found weird behavior of CUDA. After I got segfault in my kernel,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.