Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7816803
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 2, 20262026-06-02T06:00:23+00:00 2026-06-02T06:00:23+00:00

While writing an optimized ftol function I found some very odd behaviour in GCC

  • 0

While writing an optimized ftol function I found some very odd behaviour in GCC 4.6.1. Let me show you the code first (for clarity I marked the differences):

fast_trunc_one, C:

int fast_trunc_one(int i) {
    int mantissa, exponent, sign, r;

    mantissa = (i & 0x07fffff) | 0x800000;
    exponent = 150 - ((i >> 23) & 0xff);
    sign = i & 0x80000000;

    if (exponent < 0) {
        r = mantissa << -exponent;                       /* diff */
    } else {
        r = mantissa >> exponent;                        /* diff */
    }

    return (r ^ -sign) + sign;                           /* diff */
}

fast_trunc_two, C:

int fast_trunc_two(int i) {
    int mantissa, exponent, sign, r;

    mantissa = (i & 0x07fffff) | 0x800000;
    exponent = 150 - ((i >> 23) & 0xff);
    sign = i & 0x80000000;

    if (exponent < 0) {
        r = (mantissa << -exponent) ^ -sign;             /* diff */
    } else {
        r = (mantissa >> exponent) ^ -sign;              /* diff */
    }

    return r + sign;                                     /* diff */
}

Seems the same right? Well GCC disagrees. After compiling with gcc -O3 -S -Wall -o test.s test.c this is the assembly output:

fast_trunc_one, generated:

_fast_trunc_one:
LFB0:
    .cfi_startproc
    movl    4(%esp), %eax
    movl    $150, %ecx
    movl    %eax, %edx
    andl    $8388607, %edx
    sarl    $23, %eax
    orl $8388608, %edx
    andl    $255, %eax
    subl    %eax, %ecx
    movl    %edx, %eax
    sarl    %cl, %eax
    testl   %ecx, %ecx
    js  L5
    rep
    ret
    .p2align 4,,7
L5:
    negl    %ecx
    movl    %edx, %eax
    sall    %cl, %eax
    ret
    .cfi_endproc

fast_trunc_two, generated:

_fast_trunc_two:
LFB1:
    .cfi_startproc
    pushl   %ebx
    .cfi_def_cfa_offset 8
    .cfi_offset 3, -8
    movl    8(%esp), %eax
    movl    $150, %ecx
    movl    %eax, %ebx
    movl    %eax, %edx
    sarl    $23, %ebx
    andl    $8388607, %edx
    andl    $255, %ebx
    orl $8388608, %edx
    andl    $-2147483648, %eax
    subl    %ebx, %ecx
    js  L9
    sarl    %cl, %edx
    movl    %eax, %ecx
    negl    %ecx
    xorl    %ecx, %edx
    addl    %edx, %eax
    popl    %ebx
    .cfi_remember_state
    .cfi_def_cfa_offset 4
    .cfi_restore 3
    ret
    .p2align 4,,7
L9:
    .cfi_restore_state
    negl    %ecx
    sall    %cl, %edx
    movl    %eax, %ecx
    negl    %ecx
    xorl    %ecx, %edx
    addl    %edx, %eax
    popl    %ebx
    .cfi_restore 3
    .cfi_def_cfa_offset 4
    ret
    .cfi_endproc

That’s an extreme difference. This actually shows up on the profile too, fast_trunc_one is around 30% faster than fast_trunc_two. Now my question: what is causing this?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-02T06:00:26+00:00Added an answer on June 2, 2026 at 6:00 am

    Updated to sync with the OP’s edit

    By tinkering with the code, I’ve managed to see how GCC optimizes the first case.

    Before we can understand why they are so different, first we must understand how GCC optimizes fast_trunc_one().

    Believe it or not, fast_trunc_one() is being optimized to this:

    int fast_trunc_one(int i) {
        int mantissa, exponent;
    
        mantissa = (i & 0x07fffff) | 0x800000;
        exponent = 150 - ((i >> 23) & 0xff);
    
        if (exponent < 0) {
            return (mantissa << -exponent);             /* diff */
        } else {
            return (mantissa >> exponent);              /* diff */
        }
    }
    

    This produces the exact same assembly as the original fast_trunc_one() – register names and everything.

    Notice that there are no xors in the assembly for fast_trunc_one(). That’s what gave it away for me.


    How so?


    Step 1: sign = -sign

    First, let’s take a look at the sign variable. Since sign = i & 0x80000000;, there are only two possible values that sign can take:

    • sign = 0
    • sign = 0x80000000

    Now recognize that in both cases, sign == -sign. Therefore, when I change the original code to this:

    int fast_trunc_one(int i) {
        int mantissa, exponent, sign, r;
    
        mantissa = (i & 0x07fffff) | 0x800000;
        exponent = 150 - ((i >> 23) & 0xff);
        sign = i & 0x80000000;
    
        if (exponent < 0) {
            r = mantissa << -exponent;
        } else {
            r = mantissa >> exponent;
        }
    
        return (r ^ sign) + sign;
    }
    

    It produces the exact same assembly as the original fast_trunc_one(). I’ll spare you the assembly, but it is identical – register names and all.


    Step 2: Mathematical reduction: x + (y ^ x) = y

    sign can only take one of two values, 0 or 0x80000000.

    • When x = 0, then x + (y ^ x) = y then trivial holds.
    • Adding and xoring by 0x80000000 is the same. It flips the sign bit. Therefore x + (y ^ x) = y also holds when x = 0x80000000.

    Therefore, x + (y ^ x) reduces to y. And the code simplifies to this:

    int fast_trunc_one(int i) {
        int mantissa, exponent, sign, r;
    
        mantissa = (i & 0x07fffff) | 0x800000;
        exponent = 150 - ((i >> 23) & 0xff);
        sign = i & 0x80000000;
    
        if (exponent < 0) {
            r = (mantissa << -exponent);
        } else {
            r = (mantissa >> exponent);
        }
    
        return r;
    }
    

    Again, this compiles to the exact same assembly – register names and all.


    This above version finally reduces to this:

    int fast_trunc_one(int i) {
        int mantissa, exponent;
    
        mantissa = (i & 0x07fffff) | 0x800000;
        exponent = 150 - ((i >> 23) & 0xff);
    
        if (exponent < 0) {
            return (mantissa << -exponent);             /* diff */
        } else {
            return (mantissa >> exponent);              /* diff */
        }
    }
    

    which is pretty much exactly what GCC generates in the assembly.


    So why doesn’t the compiler optimize fast_trunc_two() to the same thing?

    The key part in fast_trunc_one() is the x + (y ^ x) = y optimization. In fast_trunc_two() the x + (y ^ x) expression is being split across the branch.

    I suspect that might be enough to confuse GCC to not make this optimization. (It would need to hoist the ^ -sign out of the branch and merge it into the r + sign at the end.)

    For example, this produces the same assembly as fast_trunc_one():

    int fast_trunc_two(int i) {
        int mantissa, exponent, sign, r;
    
        mantissa = (i & 0x07fffff) | 0x800000;
        exponent = 150 - ((i >> 23) & 0xff);
        sign = i & 0x80000000;
    
        if (exponent < 0) {
            r = ((mantissa << -exponent) ^ -sign) + sign;             /* diff */
        } else {
            r = ((mantissa >> exponent) ^ -sign) + sign;              /* diff */
        }
    
        return r;                                     /* diff */
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

While writing some C code, I decided to compile it to assembly and read
While writing a function which will perform some operation with each number in a
While writing code in a file that would comprise of PHP, HTML, CSS &
While writing code, it is pretty common to request a page with an appended
While writing for the first time a multi-modules maven pom, I wonder something. First,
While writing some aysnc controllers in ASP.NET MVC2, I ran up against a situation
While writing kernel modules/drivers, most of the time some structures are initialized to point
While writing some test cases, and some of the tests check for the result
While writing my program I got cannot find symbol message when I tried to
While writing a file using ofstream, how do I know when the file's size

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.