I’m writing a function that does a lot of BLAS gemv operations. I would

Question

0

Editorial Team

Asked: May 28, 20262026-05-28T11:07:12+00:00 2026-05-28T11:07:12+00:00

I’m writing a function that does a lot of BLAS gemv operations. I would

0

I’m writing a function that does a lot of BLAS gemv operations.

I would like to be able to do this on the GPU, and I’ve tried with cuBlas.

My problem is that my matrix’s and vectors are rather small, 100×100 matrix and 100 vector. CuBlas takes ages compared to a CPU and I see why, a mixture of fast cache on the cpu and a large overhead on doing the calls to the GPU.

Therefore I’m trying to figure out a smart way of measuring the time it takes to communicate the call to the GPU.

That is the time it takes CUDA to setup the call and send it to the graphics processor — not counting the time it actually takes to do the matrix-vector multiplication.

How would I go about doing this?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T11:07:13+00:00

Update: The following results are for a hand-written FFT GPU algorithm on 2005 hardware (nVidia 7800 GTX), but shows the principle of CPU-GPU tranfer bottlenecks

The overhead is not the call per-se but compilation of the GPU program and transferring the data between the GPU and the host. The CPU is highly optimized for functions that can be performed entirely in cache and the latency of DDR3 memory is far lower than the PCI-Express bus which services the GPU. I have experienced this myself when writing GPU FFT routines (prior to CUDA). Please see this related question.

N       FFTw (ms)   GPUFFT (ms)     GPUFFT MFLOPS   GPUFFT Speedup
8         0           0.06             3.352705     0.006881
16        0.001       0.065            7.882117     0.010217
32        0.001       0.075           17.10887      0.014695
64        0.002       0.085           36.080118     0.026744
128       0.004       0.093           76.724324     0.040122
256       0.007       0.107          153.739856     0.066754
512       0.015       0.115          320.200892     0.134614
1024      0.034       0.125          657.735381     0.270512
2048      0.076       0.156         1155.151507     0.484331
4096      0.173       0.215         1834.212989     0.804558
8192      0.483       0.32          2664.042421     1.510011
16384     1.363       0.605         3035.4551       2.255411
32768     3.168       1.14          3450.455808     2.780041
65536     8.694       2.464         3404.628083     3.528726
131072   15.363       5.027         3545.850483     3.05604
262144   33.223      12.513         3016.885246     2.655183
524288   72.918      25.879         3079.443664     2.817667
1048576 173.043      76.537         2192.056517     2.260904
2097152 331.553     157.427         2238.01491      2.106081
4194304 801.544     430.518         1715.573229     1.861814

The table above shows timings of a GPU FFT implementation vs CPU implementation based on kernel size. For smaller sizes, the transfer of data to/from the GPU dominates. Smaller kernels can be performed on the CPU, some implementations/sizes entirely in the cache. This makes the CPU the best choice for small operations.

If on the other hand you need to perform large batches of work on data with minimal moves to/from the GPU then the GPU will beat the CPU hands down.

In so far as measuring the effect in your example, I would suggest performing an experiment like the above. Try to work out the FLOPS computed for each size of matrix and run the test on the CPU and GPU for varying sizes of matrix. Output to a CSV file the size, time and FLOPS for GPU vs CPU. For any profiling ensure you run several hundred iterations of your code and time the whole thing, then divide the total time by iterations to get the loop time. Try different shaped matrices also if your algorithm allows (e.g. 10×100 rather than 100×10).

Using this data you can get a feel for what the overheads are. To find out exactly repeat the same experiment but replace the inner shader code executed on the GPU with no-operation (simply copy from input to output).

Hope this helps,

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m writing a function that does a lot of BLAS gemv operations. I would

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply