Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6881105
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T05:03:55+00:00 2026-05-27T05:03:55+00:00

(EDIT: Let’s title this, Lessons in how measurements can go wrong. I still haven’t

  • 0

(EDIT: Let’s title this, “Lessons in how measurements can go wrong.” I still haven’t figured out exactly what’s causing the discrepancy though.)

I found a very fast integer square root function here by Mark Crowne. At least with GCC on my machine, it’s clearly the fastest integer square root function I’ve tested (including the functions in Hacker’s Delight, this page, and floor(sqrt()) from the standard library).

After cleaning up the formatting a bit, renaming a variable, and using fixed-width types, it looks like this:

static uint32_t mcrowne_isqrt(uint32_t val)
{
    uint32_t temp, root = 0;

    if (val >= 0x40000000)
    {
        root = 0x8000;
        val -= 0x40000000;
    }

    #define INNER_ISQRT(s)                              \
    do                                                  \
    {                                                   \
        temp = (root << (s)) + (1 << ((s) * 2 - 2));    \
        if (val >= temp)                                \
        {                                               \
            root += 1 << ((s)-1);                       \
            val -= temp;                                \
        }                                               \
    } while(0)

    INNER_ISQRT(15);
    INNER_ISQRT(14);
    INNER_ISQRT(13);
    INNER_ISQRT(12);
    INNER_ISQRT(11);
    INNER_ISQRT(10);
    INNER_ISQRT( 9);
    INNER_ISQRT( 8);
    INNER_ISQRT( 7);
    INNER_ISQRT( 6);
    INNER_ISQRT( 5);
    INNER_ISQRT( 4);
    INNER_ISQRT( 3);
    INNER_ISQRT( 2);

    #undef INNER_ISQRT

    temp = root + root + 1;
    if (val >= temp)
        root++;
    return root;
}

The INNER_ISQRT macro isn’t too evil, since it’s local and immediately undefined after it’s no longer needed. Nevertheless, I’d still like to convert it to an inline function as a matter of principle. I’ve read assertions in several places (including the GCC documentation) that inline functions are “just as fast” as macros, but I’ve had trouble converting it without a speed hit.

My current iteration looks like this (note the always_inline attribute, which I threw in for good measure):

static inline void inner_isqrt(const uint32_t s, uint32_t& val, uint32_t& root) __attribute__((always_inline));
static inline void inner_isqrt(const uint32_t s, uint32_t& val, uint32_t& root)
{
    const uint32_t temp = (root << s) + (1 << ((s << 1) - 2));
    if(val >= temp)
    {
        root += 1 << (s - 1);
        val -= temp;
    }
}

//  Note that I just now changed the name to mcrowne_inline_isqrt, so people can compile my full test.
static uint32_t mcrowne_inline_isqrt(uint32_t val)
{
    uint32_t root = 0;

    if(val >= 0x40000000)
    {
        root = 0x8000; 
        val -= 0x40000000;
    }

    inner_isqrt(15, val, root);
    inner_isqrt(14, val, root);
    inner_isqrt(13, val, root);
    inner_isqrt(12, val, root);
    inner_isqrt(11, val, root);
    inner_isqrt(10, val, root);
    inner_isqrt(9, val, root);
    inner_isqrt(8, val, root);
    inner_isqrt(7, val, root);
    inner_isqrt(6, val, root);
    inner_isqrt(5, val, root);
    inner_isqrt(4, val, root);
    inner_isqrt(3, val, root);
    inner_isqrt(2, val, root);

    const uint32_t temp = root + root + 1;
    if (val >= temp)
        root++;
    return root;
}

No matter what I do, the inline function is always slower than the macro. The macro version commonly times at around 2.92s for (2^28 – 1) iterations with an -O2 build, whereas the inline version commonly times at around 3.25s. EDIT: I said 2^32 – 1 iterations before, but I forgot that I had changed it. They take quite a bit longer for the full gamut.

It’s possible that the compiler is just being stupid and refusing to inline it (note again the always_inline attribute!), but if so, that would make the macro version generally preferable anyway. (I tried checking the assembly to see, but it was too complicated as part of a program. The optimizer omitted everything when I tried compiling just the functions of course, and I’m having issues compiling it as a library due to noobishness with GCC.)

In short, is there a way to write this as an inline without a speed hit? (I haven’t profiled, but sqrt is one of those fundamental operations that should always be made fast, since I may be using it in many other programs than just the one I’m currently interested in. Besides, I’m just curious.)

I’ve even tried using templates to “bake in” the constant value, but I get the feeling the other two parameters are more likely to be causing the hit (and the macro can avoid that, since it uses local variables directly)…well, either that or the compiler is stubbornly refusing to inline.


UPDATE: user1034749 below is getting the same assembly output from both functions when he puts them in separate files and compiles them. I tried his exact command line, and I’m getting the same result as him. For all intents and purposes, this question is solved.

However, I’d still like to know why my measurements are coming out differently. Obviously, my measurement code or original build process was causing things to be different. I’ll post the code below. Does anyone know what the deal was? Maybe my compiler is actually inlining the whole mcrowne_isqrt() function in the loop of my main() function, but it’s not inlining the entirety of the other version?

UPDATE 2 (squeezed in before testing code): Note that if I swap the order of the tests and make the inline version come first, the inline version comes out faster than the macro version by the same amount. Is this a caching issue, or is the compiler inlining one call but not the other, or what?

#include <iostream>
#include <time.h>      //  Linux high-resolution timer
#include <stdint.h>

/*  Functions go here */

timespec timespecdiff(const timespec& start, const timespec& end)
{
    timespec elapsed;
    timespec endmod = end;
    if(endmod.tv_nsec < start.tv_nsec)
    {
        endmod.tv_sec -= 1;
        endmod.tv_nsec += 1000000000;
    }

    elapsed.tv_sec = endmod.tv_sec - start.tv_sec;
    elapsed.tv_nsec = endmod.tv_nsec - start.tv_nsec;
    return elapsed;
}


int main()
{
    uint64_t inputlimit = 4294967295;
    //  Test a wide range of values
    uint64_t widestep = 16;

    timespec start, end;

    //  Time macro version:
    uint32_t sum = 0;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for(uint64_t num = (widestep - 1); num <= inputlimit; num += widestep)
    {
        sum += mcrowne_isqrt(uint32_t(num));
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    timespec markcrowntime = timespecdiff(start, end);
    std::cout << "Done timing Mark Crowne's sqrt variant.  Sum of results = " << sum << " (to avoid over-optimization)." << std::endl;


    //  Time inline version:
    sum = 0;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for(uint64_t num = (widestep - 1); num <= inputlimit; num += widestep)
    {
        sum += mcrowne_inline_isqrt(uint32_t(num));
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    timespec markcrowninlinetime = timespecdiff(start, end);
    std::cout << "Done timing Mark Crowne's inline sqrt variant.  Sum of results = " << sum << " (to avoid over-optimization)." << std::endl;

    //  Results:
    std::cout << "Mark Crowne sqrt variant time:\t" << markcrowntime.tv_sec << "s, " << markcrowntime.tv_nsec << "ns" << std::endl;
    std::cout << "Mark Crowne inline sqrt variant time:\t" << markcrowninlinetime.tv_sec << "s, " << markcrowninlinetime.tv_nsec << "ns" << std::endl;
    std::cout << std::endl;
}

UPDATE 3: I still have no idea how to reliably compare the timing of different functions without the timing depending on the order of the tests. I’d greatly appreciate any tips!

However, if anyone else reading this is interested in fast sqrt implementations, I should mention: Mark Crowne’s code tests faster than any other pure C/C++ version I’ve tried by a decent margin (despite reliability issues with testing), but the following SSE code seems like it might be a little bit faster still for a scalar 32-bit integer sqrt. It can’t be generalized for full-blown 64-bit unsigned integer inputs without losing precision though (and the first signed conversion would also have to be replaced by a load intrinsic to handle values >= 2^63):

uint32_t sse_sqrt(uint64_t num)
{
    //  Uses 64-bit input, because SSE conversion functions treat all
    //  integers as signed (so conversion from a 32-bit value >= 2^31
    //  will be interpreted as negative).  As it stands, this function
    //  will similarly fail for values >= 2^63.
    //  It can also probably be made faster, since it generates a strange/
    //  useless movsd %xmm0,%xmm0 instruction before the sqrtsd.  It clears
    //  xmm0 first too with xorpd (seems unnecessary, but I could be wrong).
    __m128d result;
    __m128d num_as_sse_double = _mm_cvtsi64_sd(result, num);
    result = _mm_sqrt_sd(num_as_sse_double, num_as_sse_double);
    return _mm_cvttsd_si32(result);
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T05:03:56+00:00Added an answer on May 27, 2026 at 5:03 am

    I tried your code with gcc 4.5.3.
    I modified your second version of code to match the first one,
    for example:

    (1 << ((s) * 2 - 2)
    

    vs

    (1 << ((s << 1) - 1)
    

    yes, s * 2 == s << 1, but “-2” and “-1”?

    Also I modified your types replace uint32_t with “unsigned long”, because of on my 64 bit machine “long” is not 32bit number.

    And then I run:

    g++ -ggdb -O2 -march=native -c -pipe inline.cpp
    g++ -ggdb -O2 -march=native -c -pipe macros.cpp
    objdump -d inline.o > inline.s
    objdump -d macros.o > macros.s
    

    I could use “-S” instead of “-c” to assembler, but I would like to see assembler without additional info.

    and you know what?
    The assembler completly the same,
    in the first and in the second verison.
    So I think your time measurements are just wrong.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

EDIT: Let me turn this into a straight SQL question... I have a varbinary(max)
Edit: let me include my code so I can get some specific help. import
EDIT : Let's try this again. This time I've used the AdventureWorks sample database
Please let me know if you have any idea about it. Thanks EDIT What
So. Let's say I were to make a hex editor to edit... oh... let's
Edit: This question was written in 2008, which was like 3 internet ages ago.
EDIT: This was formerly more explicitly titled: - Best solution to stop Kontiki's KHOST.EXE
EDIT: This question is more about language engineering than C++ itself. I used C++
What is the string terminator sequence for a UTF-16 string? EDIT: Let me rephrase
EDIT Let's take the following model : The following entities are generated with the

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.