When viewing the assembly output of the following code (no optimizations, -O2 and -O3

Question

0

Asked: May 25, 20262026-05-25T15:39:47+00:00 2026-05-25T15:39:47+00:00

When viewing the assembly output of the following code (no optimizations, -O2 and -O3

0

When viewing the assembly output of the following code (no optimizations, -O2 and -O3 produce very similar results):

int main(int argc, char **argv)
{
    volatile float f1 = 1.0f;
    volatile float f2 = 2.0f;

    if(f1 > f2)
    {
        puts("+");
    }
    else if(f1 < f2)
    {
        puts("-");
    }

    return 0;
}

GCC does something that I have a hard time following:

.LC2:
    .string "+"
.LC3:
    .string "-"
    .text
.globl main
    .type   main, @function
main:
.LFB2:
    pushq   %rbp
.LCFI0:
    movq    %rsp, %rbp
.LCFI1:
    subq    $32, %rsp
.LCFI2:
    movl    %edi, -20(%rbp)
    movq    %rsi, -32(%rbp)
    movl    $0x3f800000, %eax
    movl    %eax, -4(%rbp)
    movl    $0x40000000, %eax
    movl    %eax, -8(%rbp)
    movss   -4(%rbp), %xmm1
    movss   -8(%rbp), %xmm0
    ucomiss %xmm0, %xmm1
    jbe .L9
.L7:
    movl    $.LC2, %edi
    call    puts
    jmp .L4
.L9:
    movss   -4(%rbp), %xmm1
    movss   -8(%rbp), %xmm0
    ucomiss %xmm1, %xmm0
    jbe .L4
.L8:
    movl    $.LC3, %edi
    call    puts
.L4:
    movl    $0, %eax
    leave
    ret

Why does GCC move the the float values into xmm0 and xmm1 twice and also run ucomiss twice?

Wouldn’t it be faster to do the following?

.LC2:
    .string "+"
.LC3:
    .string "-"
    .text
.globl main
    .type   main, @function
main:
.LFB2:
    pushq   %rbp
.LCFI0:
    movq    %rsp, %rbp
.LCFI1:
    subq    $32, %rsp
.LCFI2:
    movl    %edi, -20(%rbp)
    movq    %rsi, -32(%rbp)
    movl    $0x3f800000, %eax
    movl    %eax, -4(%rbp)
    movl    $0x40000000, %eax
    movl    %eax, -8(%rbp)
    movss   -4(%rbp), %xmm1
    movss   -8(%rbp), %xmm0
    ucomiss %xmm0, %xmm1
    jb  .L8 # jump if less than
    je  .L4 # jump if equal
.L7:
    movl    $.LC2, %edi
    call    puts
    jmp .L4
.L8:
    movl    $.LC3, %edi
    call    puts
.L4:
    movl    $0, %eax
    leave
    ret

I’m not at all a real assembly programmer, but it just seemed odd to me to have duplicate instructions running. Is there a problem with my version of the code?

Update

If you remove the volatile which I had originally and replace it with scanf(), you get the same results:

int main(int argc, char **argv)
{
    float f1;
    float f2;

    scanf("%f", &f1);
    scanf("%f", &f2);

    if(f1 > f2)
    {
        puts("+");
    }
    else if(f1 < f2)
    {
        puts("-");
    }

    return 0;
}

And the corresponding assembler:

.LCFI2:
    movl    %edi, -20(%rbp)
    movq    %rsi, -32(%rbp)
    leaq    -4(%rbp), %rsi
    movl    $.LC0, %edi
    movl    $0, %eax
    call    scanf
    leaq    -8(%rbp), %rsi
    movl    $.LC0, %edi
    movl    $0, %eax
    call    scanf
    movss   -4(%rbp), %xmm1
    movss   -8(%rbp), %xmm0
    ucomiss %xmm0, %xmm1
    jbe .L9
.L7:
    movl    $.LC1, %edi
    call    puts
    jmp .L4
.L9:
    movss   -4(%rbp), %xmm1
    movss   -8(%rbp), %xmm0
    ucomiss %xmm1, %xmm0
    jbe .L4
.L8:
    movl    $.LC2, %edi
    call    puts
.L4:
    movl    $0, %eax
    leave
    ret

Final Update

After reviewing some of the follow up comments, it seems han (who commented under Jonathan Leffler’s post) nailed this problem. GCC does not make the optimization not because it can’t but because I hadn’t told it to. It seems it all comes down to IEEE floating point rules and to handle the strict conditions GCC can’t simply do a jump if above or jump if below after the first UCOMISS, because it needs to handle all the special conditions of floating point numbers. When using han’s recommendation of the -ffast-math optimizer (none of the -Ox flags enable -ffast-math as it can break some programs) GCC does exactly what I was looking for:

The following assembly was produced using GCC 4.3.2 “gcc -S -O3 -ffast-math test.c”

.LC0:
    .string "%f"
.LC1:
    .string "+"
.LC2:
    .string "-"
    .text
    .p2align 4,,15
.globl main
    .type   main, @function
main:
.LFB25:
    subq    $24, %rsp
.LCFI0:
    movl    $.LC0, %edi
    xorl    %eax, %eax
    leaq    20(%rsp), %rsi
    call    scanf
    leaq    16(%rsp), %rsi
    xorl    %eax, %eax
    movl    $.LC0, %edi
    call    scanf
    movss   20(%rsp), %xmm0
    comiss  16(%rsp), %xmm0
    ja  .L11
    jb  .L12
    xorl    %eax, %eax
    addq    $24, %rsp
    .p2align 4,,1
    .p2align 3
    ret
    .p2align 4,,10
    .p2align 3
.L12:
    movl    $.LC2, %edi
    call    puts
    xorl    %eax, %eax
    addq    $24, %rsp
    ret
    .p2align 4,,10
    .p2align 3
.L11:
    movl    $.LC1, %edi
    call    puts
    xorl    %eax, %eax
    addq    $24, %rsp
    ret

Notice the two UCOMISS instructions are now replaced with one COMISS directly followed by a JA (jump if above) and JB (jump if below). GCC is able to nail this optimization if you let it using -ffast-math!

UCOMISS vs COMISS (http://www.softeng.rl.ac.uk/st/archive/SoftEng/SESP/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc315.htm): “The UCOMISS instruction differs from the COMISS instruction in that it signals an invalid SIMD floating-point exception only when a source operand is an SNaN. The COMISS instruction signals invalid if a source operand is either a QNaN or an SNaN.”

Thanks again everyone for the helpful discussion.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T15:39:48+00:00

Here’s another reason:
If you take a close look at it, it’s NOT the same expression.

~~They are not complements of each other. Therefore, you have to do two comparisons anyway.~~ volatile will force the values to be reloaded.

EDIT: (see comments, I forgot you can do that with the flags)

To answer the new question:

Combining the those two ucomiss is not a completely obvious optimization from the compiler’s perspective.

In order to combine them, the compiler must:

Recognize that ucomiss %xmm0, %xmm1 is the “same” as ucomiss %xmm1, %xmm0.
Then it must do a common sub-expression elimination pass to pull it out.

All of this needs to be done after the compiler does instruction selection. And most of the optimization passes are done before instruction selection.

What worries me more is why f1 and f2 aren’t being kept in registers after you got rid of the volatiles. -O3 is really giving you this?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

When viewing the assembly output of the following code (no optimizations, -O2 and -O3

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply