I am writing assembly long addition in GAS inline assembly,
template <std::size_t NumBits>
void inline KA_add(vli<NumBits> & x, vli<NumBits> const& y);
If I specialize I could do :
template <>
void inline KA_add<128>(vli<128> & x, vli<128> const& y){
asm("addq %2, %0; adcq %3, %1;" :"+r"(x[0]),"+r"(x[1]):"g"(y[0]),"g"(y[1]):"cc");
}
Nice it works, now if I try to generalize to allow the inline of template, and let work my compiler for any length …
template <std::size_t NumBits>
void inline KA_add(vli<NumBits> & x, vli<NumBits> const& y){
asm("addq %1, %0;" :"+r"(x[0]):"g"(y[0]):"cc");
for(int i(1); i < vli<NumBits>::numwords;++i)
asm("adcq %1, %0;" :"+r"(x[i]):"g"(y[i]):"cc");
};
Well, it does not work I have no guarantee that the carry bit (CB) is propagated. It is not conserve between the first asm line and the second one. It may be logic because the loop increment i and so “delete” the CB I thing, it should exist a GAS constraint to conserve the CB over the two ASM call. Unfortunately I do not find such informations.
Any idea ?
Thank you, Merci !
PS I rewrite my function to remove the C++ ideology
template <std::size_t NumBits>
inline void KA_add_test(boost::uint64_t* x, boost::uint64_t const* y){
asm ("addq %1, %0;" :"+r"(x[0]):"g"(y[0]):"cc");
for(int i(1); i < vli<NumBits>::numwords;++i)
asm ("adcq %1, %0;" :"+r"(x[i]):"g"(y[i]):"cc");
};
The asm gives (GCC Debug mode),
APP
addq %rdx, %rax;
NO_APP
movq -24(%rbp), %rdx
movq %rax, (%rdx)
.LBB94:
.loc 9 55 0
movl $1, -4(%rbp)
jmp .L323
.L324:
.loc 9 56 0
movl -4(%rbp), %eax
cltq
salq $3, %rax
movq %rax, %rdx
addq -24(%rbp), %rdx <----------------- Break the carry bit
movl -4(%rbp), %eax
cltq
salq $3, %rax
addq -32(%rbp), %rax
movq (%rax), %rcx
movq (%rdx), %rax
APP
adcq %rcx, %rax;
NO_APP
As we can read there is additional addq, it destroys the propagation of the CB
I see no way to explicitly tell the compiler that the loop code must be created without instructions affecting the
Cflag.It’s surely possible to do so – use
leato count the array addresses upwards,decto count the loop downwards and testZfor end condition. That way, nothing in the loop except the actual array sum changes theCflag.You’d have to do a manual thing, like:
For hot code, a tight loop isn’t optimal though; for one, the instructions have dependencies, and there’s significantly more instructions per iteration than inlined / unrolled
adcsequences. A better sequence would be something like (%rbpresp.%rsihaving the start addresses for the source and target arrays):and looping only around such blocks. The advantage would be that the loads are blocked, and you’d deal with loop count / termination condition only once-per-that.
I would, quite honestly, try not to make the general bit width particularly “neat”, but rather specialcase explicitly unrolled code for, say, bit widths of powers-of-two. Rather add a flag / constructor message to the non-optimized template instantiation telling the user “use a power of two” ?