I’m using a 128 bit integer counter in the very inner loops of my C++ code. (Irrelevant background: The actual application is evaluating finite difference equations on a regular grid, which involves repetitively incrementing large integers, and even 64 bits isn’t enough precision because small rounding accumulates enough to affect the answers.)
I’ve represented the integer as two 64 bit unsigned longs. I now need to increment those values by a 128 bit constant. This isn’t hard, but you have to manually catch the carry from the low word to the high word.
I have working code something like this:
inline void increment128(unsigned long &hiWord, unsigned long &loWord)
{
const unsigned long hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
loWord += loAdd;
if (loWord < loAdd) ++hiWord; // test_and_add_carry
hiWord += hiAdd;
}
This is tight and simple code. It works.
Unfortunately this is about 20% of my runtime. The killer line is that loWord test. If I remove it, I obviously get the wrong answers but the runtime overhead drops from 20% to 4%! So that carry test is especially expensive!
My question: Does C++ expose the hardware carry flags, even as an extension to GCC?
It seems like the additions could be done without the test-and-add-carry line above if the actual compiled instructions used an add using last carry instruction for the hiWord addition.
Is there a way to rewrite the test-and-add-carry line to get the compiler to use the intrinsic opcode?
Actually gcc will use the carry automatically if you write your code carefully…
Current GCC can optimize
hiWord += (loWord < loAdd);intoadd/adc(x86’s add-with-carry). This optimization was introduced in GCC5.3.uint64_tchunks in 64-bit mode: https://godbolt.org/z/S2kGRz.uint32_tchunks: https://godbolt.org/z/9FC9vc(editor’s note: Of course the hard part is writing a correct full-adder with carry in and carry out; that’s hard in C and GCC doesn’t know how to optimize any that I’ve seen.)
Also related: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html can give you carry-out from unsigned, or signed-overflow detection.
Older GCC, like GCC4.5, will branch or
setcon the carry-out from an add, instead of usingadc, and only usedadc(add-with-carry) on the flag-result from anaddif you used__int128. (Oruint64_ton a 32-bit target). See Is there a 128 bit integer in gcc? – only on 64-bit targets, supported since GCC4.1.I compiled this code with
gcc -O2 -Wall -Werror -S:This is the assembly for increment128_1:
…and this is the assembly for increment128_2:
Note the lack of conditional branches in the second version.
[edit]
Also, references are often bad for performance, because GCC has to worry about aliasing… It is often better to just pass things by value. Consider:
Assembly:
This is actually the tightest code of the three.
…OK so none of them actually used the carry automatically :-). But they do avoid the conditional branch, which I bet is the slow part (since the branch prediction logic will get it wrong half the time).
[edit 2]
And one more, which I stumbled across doing a little searching. Did you know GCC has built-in support for 128-bit integers?
The assembly for this one is about as good as it gets:
(Not sure where the push/pop of
ebxcame from, but this is still not bad.)All of these are with GCC 4.5.2, by the way.