For a hobby project I’m working on, I need to emulate certain 64-bit integer operations on a x86 CPU, and it needs to be fast.
Currently, I’m doing this via MMX instructions, but that’s really a pain to work with, because I have to flush the fp register state all the time (and because most MMX instructions deal with signed integers, and I need unsigned behavior).
So I’m wondering if the SSE/optimization gurus here on SO can come up with a better implementation using SSE.
The operations I need are the following (quite specific) ones:
uint64_t X, Y;
X = 0;
X = 1;
X << 1;
X != Y;
X + 1;
X & 0x1 // get lsb
X | 0x1 // set lsb
X > Y;
Specifically, I don’t need general-purpose addition or shifting, for example, just add one and left-shift one. Really, just the exact operations shown here.
Except, of course, on x86, uint64_t is emulated by using two 32-bit scalars, which is slow (and, in my case, simply doesn’t work, because I need loads/stores to be atomic, which they won’t be when loading/storing two separate registers).
Hence, I need a SIMD solution.
Some of these operations are trivial, supported by SSE2 already. Others (!= and <) require a bit more work.
Suggestions?
SSE and SSE2 are fine. It’d take some persuasion to permit SSE3, and SSE4 is probably out of the question (A CPU which supports SSE4 is likely to run 64-bit anyway, and so I don’t need these workarounds)
SSE2 has direct support for some 64-bit integer operations:
Set both elements to 0:
Set both elements to 1:
Set/load the low 64 bits, zero-extending to __m128i
Things based on
_mm_set_epi32can get compiled into a mess by some compilers, so_mm_loadl_epi64appears to be the best bet across MSVC and ICC as well as gcc/clang, and should actually be safe for your requirement of atomic 64-bit loads in 32-bit mode. See it on the Godbolt compiler explorerVertically add/subtract each 64-bit integer:
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_integer_arithmetic.htm#intref_sse2_integer_arithmetic
Left Shift:
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_int_shift.htm
Bitwise operators:
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_integer_logical.htm
SSE doesn’t have increments, so you’ll have to use a constant with
1.Comparisons are harder since there’s no 64-bit support until SSE4.1
pcmpeqqand SSE4.2pcmpgtqHere’s the one for equality:
This will set the each 64-bit element to
0xffffffffffff(aka-1)if they are equal. If you want it as a0or1in anint, you can pull it out using_mm_cvtsi32_si128()and add1. (But sometimes you can dototal -= cmp_result;instead of converting and adding.)And Less-Than: (not fully tested)
This will set the each 64-bit element to
0xffffffffffffif the corresponding element inais less thanb.Here’s are versions of “equals” and “less-than” that return a bool. They return the result of the comparison for the bottom 64-bit integer.