I was curious if java.lang.Integer.rotateLeft gets optimized by using a rotation instruction and wrote a benchmark for it. The results were inconclusive: It was much faster than two shifts but a bit slower than a single one. So I rewrote it in C++ and got about the same results. When compiling via g++ -S -Wall -O3 I can see the instruction in the generated assembler. My CPU is Intel Core i5.
The benchmark is quite long and surely not the nicest piece of code, but I don’t think it’s broken. Or is it? According to the documentation the rotations take one cycle, just like shifts. Can anybody explain the results?
rotations: 6860
shift: 5100
The first two answers are wrong. Both gcc and java’s JIT know the rotation instructions and use them. Concerning gcc see the link above, concerning java see my java benchmark and its results
benchmark ns linear runtime
Rotate 3.48 ====================
NonRotate 5.05 ==============================
Shift 2.16 ============
I did not know that gcc and the java jit were capable of recognizing that a sequence of SHIFT and OR operators can be reduced to a ROTATE instruction, very interesting.
The g++ compiler unrolls your loops and uses
SHIFT immediateandROTATE immediateinstructions (since you shift and rotate by constant values).Here’s the six instruction sequence that is repeated in the TimeShift loop unroll case:
Here’s the six instruction sequence that is repeated in the TimeRotate loop unroll case:
They differ mainly in the use of salq/sarq for
SHIFTand rorq forROTATEso you are correct in wondering why the timing differs.The answer lies deep in the micro-architecture of Sandy Bridge (your Core i5 processor) and is found in INTEL® 64 and IA-32 Processor Architectures Optimization Reference Manual
The latest is
Order Number: 248966-026 April 2012The
SHIFTinstruction has 1 cycle latency whether you use theby 1opcode orby immediate. It can dispatch from eitherPort 0orPort 1and for this reason has a 0.5 cycle throughput – the processor can dispatch and retire twoSHIFT immediateinstructions per cycle. TheROTATEinstruction needs three micro-ops if the results of the condition flags are needed (they aren’t in the code generated by gcc) and two micro-ops if not (so two micro-ops in your case). TheROTATEinstruction, however, can only be dispatched fromPort 1and therefore has a 1 cycle throughput – the processor can dispatch and retire only oneROTATE immediateper cycle.I’ve copied the relevant image and section below.
3.5.1.5 Bitwise Rotation
Bitwise rotation can choose between rotate with count specified in the CL register, an
immediate constant and by 1 bit. Generally, The rotate by immediate and rotate by
register instructions are slower than rotate by 1 bit. The rotate by 1 instruction has
the same latency as a shift.
Assembly/Compiler Coding Rule 35. (ML impact, L generality) Avoid ROTATE
by register or ROTATE by immediate instructions. If possible, replace with a
ROTATE by 1 instruction.
In Intel microarchitecture code name Sandy Bridge, ROL/ROR by immediate has 1-
cycle throughput, SHLD/SHRD using the same register as source and destination by
an immediate constant has 1-cycle latency with 0.5 cycle throughput. The “ROL/ROR
reg, imm8” instruction has two micro-ops with the latency of 1-cycle for the rotate
register result and 2-cycles for the flags, if used.
In Intel microarchitecture code name Ivy Bridge, The “ROL/ROR reg, imm8” instruction with immediate greater than 1, is one micro-op with one-cycle latency when the
overflow flag result is used. When the immediate is one, dependency on the overflow
flag result of ROL/ROR by a subsequent instruction will see the ROL/ROR instruction
with two-cycle latency.
2.4.4.2 Execution Units and Issue Ports
At each cycle, the core may dispatch µops to one or more of four issue ports. At the
microarchitecture level, store operations are further divided into two parts: store
data and store address operations. The four ports through which μops are dispatched
to execution units and to load and store operations are shown in Figure 2-6. Some
ports can dispatch two µops per clock. Those execution units are marked Double
Speed.
Port 0. In the first half of the cycle, port 0 can dispatch either one floating-point
move µop (a floating-point stack move, floating-point exchange or floating-point
store data) or one arithmetic logical unit (ALU) µop (arithmetic, logic, branch or store
data). In the second half of the cycle, it can dispatch one similar ALU µop.
Port 1. In the first half of the cycle, port 1 can dispatch either one floating-point
execution (all floating-point operations except moves, all SIMD operations) µop or
one normal-speed integer (multiply, shift and rotate) µop or one ALU (arithmetic)
µop. In the second half of the cycle, it can dispatch one similar ALU µop.
Port 2. This port supports the dispatch of one load operation per cycle.
Port 3. This port supports the dispatch of one store address operation per cycle.
The total issue bandwidth can range from zero to six µops per cycle. Each pipeline
contains several execution units. The µops are dispatched to the pipeline that corresponds to the correct type of operation. For example, an integer arithmetic logic unit
and the floating-point execution units (adder, multiplier, and divider) can share a
pipeline.