I am timing how long it takes my CUDA program to calculate matrices of

Question

0

Asked: May 27, 20262026-05-27T14:10:20+00:00 2026-05-27T14:10:20+00:00

I am timing how long it takes my CUDA program to calculate matrices of

0

I am timing how long it takes my CUDA program to calculate matrices of a certain size. For example, 10×10, 100×100, 500×500,100×1000.

However, the results are not at all what I was expecting. The numbers for the graph are not at what is expected. With the increase in size of the matrices, the computational time decreases.

For example, here is the average time (from 1000 runs):
10×10: 0.032768s
100×100: 0.068960s
500×500: 0.006336s
1000×1000: 0.018400s

The time goes down, then up again at 1000. What is going on? Shouldn’t the numbers peak off at a certain point? Why is it going in a roller coaster like this?

Here is how the actual timing code is being run:

int blocksNeeded=0;
cudaError_t cudaStatus;
blocksNeeded=(size/MAXTHREADS)+1;
int threadsPerBlock = MAXTHREADS/blocksNeeded+1;
cudaEvent_t start, stop;
float elapsedtime;
.
.
.
.
.
cudaEventCreate(&start);
cudaEventCreate(&stop); 
cudaEventRecord(start, 0);
addKernel<<<blocksNeeded, size>>>(dev_c, dev_a, dev_b,size);
cudaStatus = cudaDeviceSynchronize();
cudaEventRecord(stop, 0); 
cudaEventSynchronize(stop); 
cudaEventElapsedTime(&elapsedtime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);

where MAXTHREADS are 1024 and, size is the amount of elements I have in the matrix. I.E. 10×10 matrix will have 100 elements which is the size.

Updated with kernel:

__global__ void addKernel(float *c, float *a, float *b,int size)
{
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
    if(idx < size) 
        c[idx] = a[idx] + b[idx];

}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T14:10:21+00:00

I’ve made a test on a recent GPU cluster equipped with NVIDIA Tesla M2090. Basically i’m performing a vector addition with different sizes. The results are:

Size     Kernel time (msec)
===========================
2        0.04
4        0.010912
8        0.012128
16       0.012256
32       0.011296
64       0.01248
128      0.012192
256      0.012576
512      0.012416
1024     0.012736
2048     0.01232
4096     0.011968
8192     0.011264
16384    0.007296
32768    0.007776
65536    0.009728
131072   0.018304
262144   0.031392
524288   0.055168
1048576  0.10352

What you can see is, that there is knee at a vector size of 16384, which basically resembles your observations. This is not an error but normal behavior since the GPU has to be utilized for showing performance. The point of utilization is, in case of the Tesla M2090, reached around 16384 parallel additions.

The way you are measuring kernel performance is perfectly ok. I assume you’ve taken this from the “Best Practices Guide” for CUDA.

Notice: Please consider that the shown data is generated by using a single kernel run, i. e. it is not representative. Generally for exact time measurements the kernel should run multiple times with the same problem and the kernel time is the mean of the runs.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am timing how long it takes my CUDA program to calculate matrices of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply