Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7128259
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 28, 20262026-05-28T11:07:12+00:00 2026-05-28T11:07:12+00:00

I’m writing a function that does a lot of BLAS gemv operations. I would

  • 0

I’m writing a function that does a lot of BLAS gemv operations.

I would like to be able to do this on the GPU, and I’ve tried with cuBlas.

My problem is that my matrix’s and vectors are rather small, 100×100 matrix and 100 vector. CuBlas takes ages compared to a CPU and I see why, a mixture of fast cache on the cpu and a large overhead on doing the calls to the GPU.

Therefore I’m trying to figure out a smart way of measuring the time it takes to communicate the call to the GPU.

That is the time it takes CUDA to setup the call and send it to the graphics processor — not counting the time it actually takes to do the matrix-vector multiplication.

How would I go about doing this?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-28T11:07:13+00:00Added an answer on May 28, 2026 at 11:07 am

    Update: The following results are for a hand-written FFT GPU algorithm on 2005 hardware (nVidia 7800 GTX), but shows the principle of CPU-GPU tranfer bottlenecks

    The overhead is not the call per-se but compilation of the GPU program and transferring the data between the GPU and the host. The CPU is highly optimized for functions that can be performed entirely in cache and the latency of DDR3 memory is far lower than the PCI-Express bus which services the GPU. I have experienced this myself when writing GPU FFT routines (prior to CUDA). Please see this related question.

    N       FFTw (ms)   GPUFFT (ms)     GPUFFT MFLOPS   GPUFFT Speedup
    8         0           0.06             3.352705     0.006881
    16        0.001       0.065            7.882117     0.010217
    32        0.001       0.075           17.10887      0.014695
    64        0.002       0.085           36.080118     0.026744
    128       0.004       0.093           76.724324     0.040122
    256       0.007       0.107          153.739856     0.066754
    512       0.015       0.115          320.200892     0.134614
    1024      0.034       0.125          657.735381     0.270512
    2048      0.076       0.156         1155.151507     0.484331
    4096      0.173       0.215         1834.212989     0.804558
    8192      0.483       0.32          2664.042421     1.510011
    16384     1.363       0.605         3035.4551       2.255411
    32768     3.168       1.14          3450.455808     2.780041
    65536     8.694       2.464         3404.628083     3.528726
    131072   15.363       5.027         3545.850483     3.05604
    262144   33.223      12.513         3016.885246     2.655183
    524288   72.918      25.879         3079.443664     2.817667
    1048576 173.043      76.537         2192.056517     2.260904
    2097152 331.553     157.427         2238.01491      2.106081
    4194304 801.544     430.518         1715.573229     1.861814
    

    The table above shows timings of a GPU FFT implementation vs CPU implementation based on kernel size. For smaller sizes, the transfer of data to/from the GPU dominates. Smaller kernels can be performed on the CPU, some implementations/sizes entirely in the cache. This makes the CPU the best choice for small operations.

    If on the other hand you need to perform large batches of work on data with minimal moves to/from the GPU then the GPU will beat the CPU hands down.

    In so far as measuring the effect in your example, I would suggest performing an experiment like the above. Try to work out the FLOPS computed for each size of matrix and run the test on the CPU and GPU for varying sizes of matrix. Output to a CSV file the size, time and FLOPS for GPU vs CPU. For any profiling ensure you run several hundred iterations of your code and time the whole thing, then divide the total time by iterations to get the loop time. Try different shaped matrices also if your algorithm allows (e.g. 10×100 rather than 100×10).

    Using this data you can get a feel for what the overheads are. To find out exactly repeat the same experiment but replace the inner shader code executed on the GPU with no-operation (simply copy from input to output).

    Hope this helps,

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this
I would like to count the length of a string with PHP. The string
For some reason, after submitting a string like this Jack’s Spindle from a text
I've got a string that has curly quotes in it. I'd like to replace
Does anyone know how can I replace this 2 symbol below from the string
I need a function that will clean a strings' special characters. I do NOT
link Im having trouble converting the html entites into html characters, (&# 8217;) i
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I have just tried to save a simple *.rtf file with some websites and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.