Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8461149
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 10, 20262026-06-10T13:48:00+00:00 2026-06-10T13:48:00+00:00

From Nvidia release notes: The nvcc compiler switch, –fmad (short name: -fmad), to control

  • 0

From Nvidia release notes:

 The nvcc compiler switch, --fmad (short name: -fmad), to control the contraction of    
 floating-point multiplies and add/subtracts into floating-point multiply-add   
 operations (FMAD, FFMA, or DFMA) has been added: 
 --fmad=true and --fmad=false enables and disables the contraction respectively. 
 This switch is supported only when the --gpu-architecture option is set with     
 compute_20, sm_20, or higher. For other architecture classes, the contraction is     
  always enabled. 
 The --use_fast_math option implies --fmad=true, and enables the contraction.

I have two kernels – one is purely compute bound with lots of multiplications, whereas the other one is memory bound. I notice a consistent improvement in performance (around 5%) for my compute intensive kernel when I do -fmad=false…and around the same percent decline in performance when I turn it off for my memory bound kernel.
So, FMA is working better for my memory bound kernel, but my compute bound kernel could squeeze a little performance by turning it off.
What could be the reason?
My device is M2090 and I am using CUDA 4.2.

Full compilation options:
-arch,sm_20,-ftz=true,-prec-div=false,-prec-sqrt=false,-use_fast_math,-fmad=false (or I just remove fmad=false because that’s the default anyway.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-10T13:48:02+00:00Added an answer on June 10, 2026 at 1:48 pm

    Use of FMA may increase register pressure slightly, because three source operands must be available at the same time. So turning FMA generation on / off can lead to small differences in instruction scheduling and register allocation, which in turn can lead to small performance differences. For a compute-bound kernel with many multiply-add idioms, -fmad=true should make a significant performance difference, but as you say, your kernel is dominated by multiplies and thus will benefit little from use of FMA, and any gains may be offset by the register pressure / instruction scheduling aspects

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Dear CUDA users I am reposting a question from nvidia boards: I am currently
When trying to compile the most recent CUDA SDK from Nvidia (version 4.1.28) for
Was looking into some GPU CUDA samples and trying some samples out from Nvidia's
I'm trying to build code from the nVidia 9.5 SDK but I get the
I'd like to compare a few files from the bazaar branch lp:ubuntu/nvidia-graphics-drivers. I'm mainly
Using NvTriStrip library from Nvidia, http://www.nvidia.in/object/nvtristrip_library.html I was trying to use NVTriStriper for creating
Came across this implementation of FxAA from NVidia. http://developer.download.nvidia.com/assets/gamedev/files/sdk/11/FXAA_WhitePaper.pdf http://timothylottes.blogspot.in/2011/12/fxaa-40-stills-and-features.html The source code as
I see a strange behaviour of NVIDIA NVCC (CUDA 4.0 and 4.1 tested) when
From my website, I have a button which calls a method and then generates
From this context: import itertools lines = itertools.cycle(open('filename')) I'm wondering how I can implement

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.