I am timing how long it takes my CUDA program to calculate matrices of a certain size. For example, 10×10, 100×100, 500×500,100×1000.
However, the results are not at all what I was expecting. The numbers for the graph are not at what is expected. With the increase in size of the matrices, the computational time decreases.
For example, here is the average time (from 1000 runs):
10×10: 0.032768s
100×100: 0.068960s
500×500: 0.006336s
1000×1000: 0.018400s
The time goes down, then up again at 1000. What is going on? Shouldn’t the numbers peak off at a certain point? Why is it going in a roller coaster like this?
Here is how the actual timing code is being run:
int blocksNeeded=0;
cudaError_t cudaStatus;
blocksNeeded=(size/MAXTHREADS)+1;
int threadsPerBlock = MAXTHREADS/blocksNeeded+1;
cudaEvent_t start, stop;
float elapsedtime;
.
.
.
.
.
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
addKernel<<<blocksNeeded, size>>>(dev_c, dev_a, dev_b,size);
cudaStatus = cudaDeviceSynchronize();
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedtime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
where MAXTHREADS are 1024 and, size is the amount of elements I have in the matrix. I.E. 10×10 matrix will have 100 elements which is the size.
Updated with kernel:
__global__ void addKernel(float *c, float *a, float *b,int size)
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if(idx < size)
c[idx] = a[idx] + b[idx];
}
I’ve made a test on a recent GPU cluster equipped with NVIDIA Tesla M2090. Basically i’m performing a vector addition with different sizes. The results are:
What you can see is, that there is knee at a vector size of 16384, which basically resembles your observations. This is not an error but normal behavior since the GPU has to be utilized for showing performance. The point of utilization is, in case of the Tesla M2090, reached around 16384 parallel additions.
The way you are measuring kernel performance is perfectly ok. I assume you’ve taken this from the “Best Practices Guide” for CUDA.
Notice: Please consider that the shown data is generated by using a single kernel run, i. e. it is not representative. Generally for exact time measurements the kernel should run multiple times with the same problem and the kernel time is the mean of the runs.