Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7669519
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 31, 20262026-05-31T15:36:41+00:00 2026-05-31T15:36:41+00:00

Recently I was profiling a program in which the hotspot is definitely this double

  • 0

Recently I was profiling a program in which the hotspot is definitely this

double d = somevalue();
double d2=d*d;
double c = 1.0/d2   // HOT SPOT

The value d2 is not used after because I only need value c. Some time ago I’ve read about the Carmack method of fast inverse square root, this is obviously not the case but I’m wondering if a similar algorithms can help me computing 1/x^2.

I need quite accurate precision, I’ve checked that my program doesn’t give correct results with gcc -ffast-math option. (g++-4.5)

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-31T15:36:43+00:00Added an answer on May 31, 2026 at 3:36 pm

    The tricks for doing fast square roots and the like get their performance by sacrificing precision. (Well, most of them.)

    1. Are you sure you need double precision? You can sacrifice precision easily enough:

      double d = somevalue();
      float c = 1.0f / ((float) d * (float) d);
      

      The 1.0f is absolutely mandatory in this case, if you use 1.0 instead you will get double precision.

    2. Have you tried enabling “sloppy” math on your compiler? On GCC you can use -ffast-math, there are similar options for other compilers. The sloppy math may be more than good enough for your application. (Edit: I did not see any difference in the resulting assembly.)

    3. If you are using GCC, have you considered using -mrecip? There is a “reciprocal estimate” function which only has about 12 bits of precision, but it is much faster. You can use the Newton-Raphson method to increase the precision of the result. The -mrecip option will cause the compiler to automatically generate the reciprocal estimate and Newton-Raphson steps for you, although you can always write the assembly yourself if you want to fine tune the performance-precision trade-off. (Newton-Raphson converges very quickly.) (Edit: I was unable to get GCC to generate RCPSS. See below.)

    I found a blog post (source) discussing the exact problem you are going through, and the author’s conclusion is that the techniques like the Carmack method are not competitive with the RCPSS instruction (which the -mrecip flag on GCC uses).

    The reason why division can be so slow is because processors generally only have one division unit and it’s often not pipelined. So, you can have a few multiplications in the pipe all executing simultaneously, but no division can be issued until the previous division finishes.

    Tricks that don’t work

    1. Carmack’s method: It is obsolete on modern processors, which have reciprocal estimation opcodes. For reciprocals, the best version I’ve seen only gives one bit of precision — nothing compared to the 12 bits of RCPSS. I think it is a coincidence that the trick works so well for reciprocal square roots; a coincidence that is unlikely to be repeated.

    2. Relabeling variables. As far as the compiler is concerned, there is very little difference between 1.0/(x*x) and double x2 = x*x; 1.0/x2. I would be surprised if you found a compiler that generates different code for the two versions with optimizations turned on even to the lowest level.

    3. Using pow. The pow library function is a total monster. With GCC’s -ffast-math turned off, the library call is fairly expensive. With GCC’s -ffast-math turned on, you get the exact same assembly code for pow(x, -2) as you do for 1.0/(x*x), so there is no benefit.

    Update

    Here is an example of a Newton-Raphson approximation for the inverse square of a double-precision floating-point value.

    static double invsq(double x)
    {
        double y;
        int i;
        __asm__ (
            "cvtpd2ps %1, %0\n\t"
            "rcpss %0, %0\n\t"
            "cvtps2pd %0, %0"
            : "=x"(y)
            : "x"(x));
        for (i = 0; i < RECIP_ITER; ++i)
            y *= 2 - x * y;
        return y * y;
    }
    

    Unfortunately, with RECIP_ITER=1 benchmarks on my computer put it slightly slower (~5%) than the simple version 1.0/(x*x). It’s faster (2x as fast) with zero iterations, but then you only get 12 bits of precision. I don’t know if 12 bits is enough for you.

    I think one of the problems here is that this is too small of a micro-optimization; at this scale the compiler writers are on nearly equal footing with the assembly hackers. Maybe if we had the bigger picture we could see a way to make it faster.

    For example, you said that -ffast-math caused an undesirable loss of precision; this may indicate a numerical stability problem in the algorithm you are using. With the right choice of algorithm, many problems can be solved with float instead of double. (Of course, you may just need more than 24 bits. I don’t know.)

    I suspect the RCPSS method shines if you want to compute several of these in parallel.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I recently began profiling an osgi java application that I am writing using VisualVM.
We've been profiling our code recently and we've come across a few annoying hotspots.
I asked a question [ here ] recently and it's just not providing me
I have recently started using ReSharper which is a fantastic tool. Today I came
OK, So I have recently moved into the world of Web development after spending
I recently implemented a program using the Microsoft Accessibility API, but have since been
So recently I was given a problem, which I have been mulling over and
everyone,recently i was debugging a program for improve performance.i notice a interest thing about
I was recently profiling an application trying to work out why certain operations were
I finished an application recently and began profiling it for CPU & Memory usage

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.