Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6951067
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T14:10:20+00:00 2026-05-27T14:10:20+00:00

I am timing how long it takes my CUDA program to calculate matrices of

  • 0

I am timing how long it takes my CUDA program to calculate matrices of a certain size. For example, 10×10, 100×100, 500×500,100×1000.

However, the results are not at all what I was expecting. The numbers for the graph are not at what is expected. With the increase in size of the matrices, the computational time decreases.

For example, here is the average time (from 1000 runs):
10×10: 0.032768s
100×100: 0.068960s
500×500: 0.006336s
1000×1000: 0.018400s

The time goes down, then up again at 1000. What is going on? Shouldn’t the numbers peak off at a certain point? Why is it going in a roller coaster like this?

Here is how the actual timing code is being run:

int blocksNeeded=0;
cudaError_t cudaStatus;
blocksNeeded=(size/MAXTHREADS)+1;
int threadsPerBlock = MAXTHREADS/blocksNeeded+1;
cudaEvent_t start, stop;
float elapsedtime;
.
.
.
.
.
cudaEventCreate(&start);
cudaEventCreate(&stop); 
cudaEventRecord(start, 0);
addKernel<<<blocksNeeded, size>>>(dev_c, dev_a, dev_b,size);
cudaStatus = cudaDeviceSynchronize();
cudaEventRecord(stop, 0); 
cudaEventSynchronize(stop); 
cudaEventElapsedTime(&elapsedtime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);

where MAXTHREADS are 1024 and, size is the amount of elements I have in the matrix. I.E. 10×10 matrix will have 100 elements which is the size.

Updated with kernel:

__global__ void addKernel(float *c, float *a, float *b,int size)
{
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
    if(idx < size) 
        c[idx] = a[idx] + b[idx];

}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T14:10:21+00:00Added an answer on May 27, 2026 at 2:10 pm

    I’ve made a test on a recent GPU cluster equipped with NVIDIA Tesla M2090. Basically i’m performing a vector addition with different sizes. The results are:

    Size     Kernel time (msec)
    ===========================
    2        0.04
    4        0.010912
    8        0.012128
    16       0.012256
    32       0.011296
    64       0.01248
    128      0.012192
    256      0.012576
    512      0.012416
    1024     0.012736
    2048     0.01232
    4096     0.011968
    8192     0.011264
    16384    0.007296
    32768    0.007776
    65536    0.009728
    131072   0.018304
    262144   0.031392
    524288   0.055168
    1048576  0.10352
    

    What you can see is, that there is knee at a vector size of 16384, which basically resembles your observations. This is not an error but normal behavior since the GPU has to be utilized for showing performance. The point of utilization is, in case of the Tesla M2090, reached around 16384 parallel additions.

    The way you are measuring kernel performance is perfectly ok. I assume you’ve taken this from the “Best Practices Guide” for CUDA.

    Notice: Please consider that the shown data is generated by using a single kernel run, i. e. it is not representative. Generally for exact time measurements the kernel should run multiple times with the same problem and the kernel time is the mean of the runs.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Is there anyway to see how long a simple $.getJSON method takes to call
Can anyone provide me with a ballpark timing (in milliseconds) for how long it
I'm timing some things, which I can't just put in a long loop. And
How would you prevent a browser from timing out while a long process is
I did some timing tests and also read some articles like this one (last
Due to timing issues, I'm trying to prevent an applet from loading until a
What is the timing method and interval that scriptaculous uses in prototype for its
What is the most accurate way of timing a thread or a line of
I'm running a query that is timing out for first two times and returning
I am experiencing some very odd timing behavior from a function I wrote. If

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.