I am writing assembly long addition in GAS inline assembly, template <std::size_t NumBits> void

Question

0

Asked: June 15, 20262026-06-15T13:12:36+00:00 2026-06-15T13:12:36+00:00

I am writing assembly long addition in GAS inline assembly, template <std::size_t NumBits> void

0

I am writing assembly long addition in GAS inline assembly,

template <std::size_t NumBits>
void inline KA_add(vli<NumBits> & x, vli<NumBits> const& y);

If I specialize I could do :

template <>
void inline KA_add<128>(vli<128> & x, vli<128> const& y){
     asm("addq  %2, %0; adcq  %3, %1;" :"+r"(x[0]),"+r"(x[1]):"g"(y[0]),"g"(y[1]):"cc");
}

Nice it works, now if I try to generalize to allow the inline of template, and let work my compiler for any length …

template <std::size_t NumBits>
void inline KA_add(vli<NumBits> & x, vli<NumBits> const& y){
    asm("addq  %1, %0;" :"+r"(x[0]):"g"(y[0]):"cc");
    for(int i(1); i < vli<NumBits>::numwords;++i)
        asm("adcq  %1, %0;" :"+r"(x[i]):"g"(y[i]):"cc");
};

Well, it does not work I have no guarantee that the carry bit (CB) is propagated. It is not conserve between the first asm line and the second one. It may be logic because the loop increment i and so “delete” the CB I thing, it should exist a GAS constraint to conserve the CB over the two ASM call. Unfortunately I do not find such informations.

Any idea ?

Thank you, Merci !

PS I rewrite my function to remove the C++ ideology

template <std::size_t NumBits>
inline void KA_add_test(boost::uint64_t* x, boost::uint64_t const* y){
    asm ("addq  %1, %0;" :"+r"(x[0]):"g"(y[0]):"cc");
        for(int i(1); i < vli<NumBits>::numwords;++i)
            asm ("adcq  %1, %0;" :"+r"(x[i]):"g"(y[i]):"cc");
};

The asm gives (GCC Debug mode),

APP

    addq  %rdx, %rax;

NO_APP

    movq    -24(%rbp), %rdx
    movq    %rax, (%rdx)

.LBB94:
.loc 9 55 0

    movl    $1, -4(%rbp)
    jmp     .L323

.L324:

    .loc 9 56 0

    movl    -4(%rbp), %eax
    cltq  
    salq    $3, %rax
    movq    %rax, %rdx
    addq    -24(%rbp), %rdx <----------------- Break the carry bit
    movl    -4(%rbp), %eax
    cltq  
    salq    $3, %rax
    addq    -32(%rbp), %rax
    movq    (%rax), %rcx
    movq    (%rdx), %rax

APP

    adcq  %rcx, %rax;

NO_APP

As we can read there is additional addq, it destroys the propagation of the CB

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T13:12:37+00:00

I see no way to explicitly tell the compiler that the loop code must be created without instructions affecting the C flag.

It’s surely possible to do so – use lea to count the array addresses upwards, dec to count the loop downwards and test Z for end condition. That way, nothing in the loop except the actual array sum changes the C flag.

You’d have to do a manual thing, like:

long long tmp; // hold a register

__asm__("0:
    movq (%1), %0
    lea 8(%1), %1
    adcq  %0, (%2)
    lea 8(%2), %2
    dec %3
    jnz 0b"
    : "=r"(tmp)
    : "m"(&x[0]), "m"(&y[0]), "r"(vli<NumBits>::numwords)
    : "cc", "memory");

For hot code, a tight loop isn’t optimal though; for one, the instructions have dependencies, and there’s significantly more instructions per iteration than inlined / unrolled adc sequences. A better sequence would be something like (%rbp resp. %rsi having the start addresses for the source and target arrays):

0:

lea  64(%rbp), %r13
lea  64(%rsi), %r14
movq   (%rbp), %rax
movq  8(%rbp), %rdx
adcq   (%rsi), %rax
movq 16(%rbp), %rcx
adcq  8(%rsi), %rdx
movq 24(%rbp), %r8
adcq 16(%rsi), %rcx
movq 32(%rbp), %r9
adcq 24(%rsi), %r8
movq 40(%rbp), %r10
adcq 32(%rsi), %r9
movq 48(%rbp), %r11
adcq 40(%rsi), %r10
movq 56(%rbp), %r12
adcq 48(%rsi), %r10
movq %rax,   (%rsi)
adcq 56(%rsi), %r10
movq %rdx,  8(%rsi)
movq %rcx, 16(%rsi)
movq %r8,  24(%rsi)
movq %r13, %rbp     // next src
movq %r9,  32(%rsi)
movq %r10, 40(%rsi)
movq %r11, 48(%rsi)
movq %r12, 56(%rsi)
movq %r14, %rsi     // next tgt
dec  %edi           // use counter % 8 (doing 8 words / iteration)
jnz 0b              // loop again if not yet zero

and looping only around such blocks. The advantage would be that the loads are blocked, and you’d deal with loop count / termination condition only once-per-that.

I would, quite honestly, try not to make the general bit width particularly “neat”, but rather specialcase explicitly unrolled code for, say, bit widths of powers-of-two. Rather add a flag / constructor message to the non-optimized template instantiation telling the user “use a power of two” ?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am writing assembly long addition in GAS inline assembly, template <std::size_t NumBits> void

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply