Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8796401
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 13, 20262026-06-13T23:36:51+00:00 2026-06-13T23:36:51+00:00

I need to determine both latency and throughput for (unsigned) modular multiplication in CUDA

  • 0

I need to determine both latency and throughput for (unsigned) modular multiplication in CUDA and on CPU (i5 750).

For the CPU I found this document, pg 121, for the Sandy Bridge, I am not really sure which one I should refer to, however for the “MUL IMUL r32” I get 4 cycles for the latency and reciprocal throughput equal 2. Then a “DIV r64” has latency 30-94 and rec.thr. 22-76.

Worst case scenario:

  • latency 94+4

  • rec.thr. 76+2

Right? Althought I am using OpenSSL to perform them, I am pretty sure at lowest level they always run simple modular multiplications.

Regarding CUDA, currently I am performing modular multiplications in PTX: multiplying 2 32b number, saving result on a 64b register, loading a 32b modulo on a 64b register and then do a 64b modulo.

If you look here, pg 76, they say throughput on Fermi 2.x for 32b integer multiplication is 16 (per clock-cycle per MP). Regarding modulo, they just say: “below 20 instructions on devices of compute capability 2.x”…

what does it mean exactly? Worst case 20 cycles per modulo per MP of latency? And throughput? How many modulos per MP?

Edit:

And what about if I have a warp where only the first 16 threads of a warp have to perform a 32b multiplication (16 ones per cycle per MP). Will the GPU busy for one cycle or two, although the second half has to do nothing?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-13T23:36:52+00:00Added an answer on June 13, 2026 at 11:36 pm

    [Since you also asked the same question on the NVIDIA forums, http://devtalk.nvidia.com, I simply copied the answer I gave there to StackOverflow. In general, cross-references are helpful when questions are asked on multiple platforms.]

    Latency is fairly meaningless with a throughput architecture like the GPU. The easiest way to determine throughput numbers for whatever operation you are interested in is to measure it on the device you plan to target. As far as I know, this is how the tables are generated for the CPU document you referenced.

    To examine the machine code, you can disassemble the machine code (SASS) for the modulo operation using cuobjdump –dump-sass. When I do this for sm_20, I count a total of sixteen instructions for a 32/32->32 bit unsigned modulo. From the instruction mix, I would estimate the throughput to be around 20 billion operations per second on a Tesla C2050, across the entire GPU (note that this is a guesstimate, not a measured number!).

    As for the 64/64->64 bit unsigned modulo, which is a called subroutine, I recently measured a throughput of 6.4 billion operations per second on a C2050 using CUDA 5.0.

    You might want to look into the algorithms of Montgomery and Barrett for modular multiplications, instead of using division.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I need to determine the path of the original explorer.exe on both 32-bit and
My winform application is launched by another application (the parent), I need determine the
I need to determine in 80% if a file is binary or text, is
I need to determine file type (i.e., MimeType) of stored data in the SQL
I need to determine whether or not two sets contains exactly the same elements.
I need to determine if a given selection is in between a start line
I need to determine the clients .NET framework version in my web application. I'm
I need to determine whether a selected UIColor (picked by the user) is dark
We need to determine a quick way for our web application deployed in a
I need to Determine the serial port name connected to other machine using c#.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.