Left and right shift operators (<< and >>) are already available in C++. However, I couldn’t find out how I could perform circular shift or rotate operations.
How can operations like ‘Rotate Left’ and ‘Rotate Right’ be performed?
Rotating right twice here
Initial --> 1000 0011 0100 0010
should result in:
Final --> 1010 0000 1101 0000
An example would be helpful.
(editor’s note: Many common ways of expressing rotates in C suffer from undefined behaviour if the rotate count is zero, or compile to more than just a single rotate machine instruction. This question’s answer should document best practices.)
See also an earlier version of this answer on another rotate question with some more details about what asm gcc/clang produce for x86.
The most compiler-friendly way to express a rotate in C and C++ that avoids any Undefined Behaviour seems to be John Regehr’s implementation. I’ve adapted it to rotate by the width of the type (using fixed-width types like
uint32_t).Works for any unsigned integer type, not just
uint32_t, so you could make versions for other sizes.See also a C++11 template version with lots of safety checks (including a
static_assertthat the type width is a power of 2), which isn’t the case on some 24-bit DSPs or 36-bit mainframes, for example.I’d recommend only using the template as a back-end for wrappers with names that include the rotate width explicitly. Integer-promotion rules mean that
rotl_template(u16 & 0x11UL, 7)would do a 32 or 64-bit rotate, not 16 (depending on the width ofunsigned long). Evenuint16_t & uint16_tis promoted tosigned intby C++’s integer-promotion rules, except on platforms whereintis no wider thanuint16_t.On x86, this version inlines to a single
rol r32, cl(orrol r32, imm8) with compilers that grok it, because the compiler knows that x86 rotate and shift instructions mask the shift-count the same way the C source does.Compiler support for this UB-avoiding idiom on x86, for
uint32_t xandunsigned int nfor variable-count shifts:rororrolinstruction for variable counts.shld edi,edi,7which is slower and takes more bytes thanrol edi,7on some CPUs (especially AMD, but also some Intel), when BMI2 isn’t available forrorx eax,edi,25to save a MOV._rotl/_rotrintrinsics from<intrin.h>on x86 (including x86-64).gcc for ARM uses an
and r1, r1, #31for variable-count rotates, but still does the actual rotate with a single instruction:ror r0, r0, r1. So gcc doesn’t realize that rotate-counts are inherently modular. As the ARM docs say, "ROR with shift length,n, more than 32 is the same as ROR with shift lengthn-32". I think gcc gets confused here because left/right shifts on ARM saturate the count, so a shift by 32 or more will clear the register. (Unlike x86, where shifts mask the count the same as rotates). It probably decides it needs an AND instruction before recognizing the rotate idiom, because of how non-circular shifts work on that target.Current x86 compilers still use an extra instruction to mask a variable count for 8 and 16-bit rotates, probably for the same reason they don’t avoid the AND on ARM. This is a missed optimization, because performance doesn’t depend on the rotate count on any x86-64 CPU. (Masking of counts was introduced with 286 for performance reasons because it handled shifts iteratively, not with constant-latency like modern CPUs.)
BTW, prefer rotate-right for variable-count rotates, to avoid making the compiler do
32-nto implement a left rotate on architectures like ARM and MIPS that only provide a rotate-right. (This optimizes away with compile-time-constant counts.)Fun fact: ARM doesn’t really have dedicated shift/rotate instructions, it’s just MOV with the source operand going through the barrel-shifter in ROR mode:
mov r0, r0, ror r1. So a rotate can fold into a register-source operand for an EOR instruction or something.Make sure you use unsigned types for
nand the return value, or else it won’t be a rotate. (gcc for x86 targets does arithmetic right shifts, shifting in copies of the sign-bit rather than zeroes, leading to a problem when youORthe two shifted values together. Right-shifts of negative signed integers is implementation-defined behaviour in C.)Also, make sure the shift count is an unsigned type, because
(-n)&31with a signed type could be one’s complement or sign/magnitude, and not the same as the modular 2^n you get with unsigned or two’s complement. (See comments on Regehr’s blog post).unsigned intdoes well on every compiler I’ve looked at, for every width ofx. Some other types actually defeat the idiom-recognition for some compilers, so don’t just use the same type asx.Some compilers provide intrinsics for rotates, which is far better than inline-asm if the portable version doesn’t generate good code on the compiler you’re targeting. There aren’t cross-platform intrinsics for any compilers that I know of. These are some of the x86 options:
<immintrin.h>provides_rotland_rotl64intrinsics, and same for right shift. MSVC requires<intrin.h>, while gcc require<x86intrin.h>. An#ifdeftakes care of gcc vs. icc. Clang 9.0 also has it, but before that it doesn’t seem to provide them anywhere, except in MSVC compatibility mode with-fms-extensions -fms-compatibility -fms-compatibility-version=17.00. And the asm it emits for them sucks (extra masking and a CMOV)._rotr8and_rotr16.<x86intrin.h>also provides__rolb/__rorbfor 8-bit rotate left/right,__rolw/__rorw(16-bit),__rold/__rord(32-bit),__rolq/__rorq(64-bit, only defined for 64-bit targets). For narrow rotates, the implementation uses__builtin_ia32_rolhior...qi, but the 32 and 64-bit rotates are defined using shift/or (with no protection against UB, because the code inia32intrin.honly has to work on gcc for x86). GNU C appears not to have any cross-platform__builtin_rotatefunctions the way it does for__builtin_popcount(which expands to whatever’s optimal on the target platform, even if it’s not a single instruction). Most of the time you get good code from idiom-recognition.__builtin_rotateleft8,__builtin_rotateright8, and similar for other widths. See Clang’s language extensions documentation. There are also builtins / intrinsics for other bit-manipulation things like__builtin_bitreverse32(which can compile torbiton ARM / AArch64). You can test for them with__has_builtin.Presumably some non-x86 compilers have intrinsics, too, but let’s not expand this community-wiki answer to include them all. (Maybe do that in the existing answer about intrinsics).
(The old version of this answer suggested MSVC-specific inline asm (which only works for 32bit x86 code), or http://www.devx.com/tips/Tip/14043 for a C version. The comments are replying to that.)
Inline asm defeats many optimizations, especially MSVC-style because it forces inputs to be stored/reloaded. A carefully-written GNU C inline-asm rotate would allow the count to be an immediate operand for compile-time-constant shift counts, but it still couldn’t optimize away entirely if the value to be shifted is also a compile-time constant after inlining. https://gcc.gnu.org/wiki/DontUseInlineAsm.