Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8647365
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 12, 20262026-06-12T13:03:03+00:00 2026-06-12T13:03:03+00:00

I was curious if java.lang.Integer.rotateLeft gets optimized by using a rotation instruction and wrote

  • 0

I was curious if java.lang.Integer.rotateLeft gets optimized by using a rotation instruction and wrote a benchmark for it. The results were inconclusive: It was much faster than two shifts but a bit slower than a single one. So I rewrote it in C++ and got about the same results. When compiling via g++ -S -Wall -O3 I can see the instruction in the generated assembler. My CPU is Intel Core i5.

The benchmark is quite long and surely not the nicest piece of code, but I don’t think it’s broken. Or is it? According to the documentation the rotations take one cycle, just like shifts. Can anybody explain the results?

rotations:  6860
shift:      5100

The first two answers are wrong. Both gcc and java’s JIT know the rotation instructions and use them. Concerning gcc see the link above, concerning java see my java benchmark and its results

benchmark   ns linear runtime
   Rotate 3.48 ====================
NonRotate 5.05 ==============================
    Shift 2.16 ============
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-12T13:03:04+00:00Added an answer on June 12, 2026 at 1:03 pm

    I did not know that gcc and the java jit were capable of recognizing that a sequence of SHIFT and OR operators can be reduced to a ROTATE instruction, very interesting.

    The g++ compiler unrolls your loops and uses SHIFT immediate and ROTATE immediate instructions (since you shift and rotate by constant values).

    Here’s the six instruction sequence that is repeated in the TimeShift loop unroll case:

    movq    %rax, %rbx
    salq    $13, %rbx
    leaq    (%rbp,%rbx), %rbx
    movq    %rdi, %rbp
    sarq    $27, %rbp
    xorq    %rbx, %rdx
    

    Here’s the six instruction sequence that is repeated in the TimeRotate loop unroll case:

    movq    %rdx, %rbx
    rorq    $45, %rbx
    leaq    (%rbp,%rbx), %rbx
    movq    %r8, %rbp
    rorq    $49, %rbp
    xorq    %rbx, %r9
    

    They differ mainly in the use of salq/sarq for SHIFT and rorq for ROTATE so you are correct in wondering why the timing differs.

    The answer lies deep in the micro-architecture of Sandy Bridge (your Core i5 processor) and is found in INTEL® 64 and IA-32 Processor Architectures Optimization Reference Manual
    The latest is Order Number: 248966-026 April 2012

    The SHIFT instruction has 1 cycle latency whether you use the by 1 opcode or by immediate. It can dispatch from either Port 0 or Port 1 and for this reason has a 0.5 cycle throughput – the processor can dispatch and retire two SHIFT immediate instructions per cycle. The ROTATE instruction needs three micro-ops if the results of the condition flags are needed (they aren’t in the code generated by gcc) and two micro-ops if not (so two micro-ops in your case). The ROTATE instruction, however, can only be dispatched from Port 1 and therefore has a 1 cycle throughput – the processor can dispatch and retire only one ROTATE immediate per cycle.

    I’ve copied the relevant image and section below.

    3.5.1.5 Bitwise Rotation

    Bitwise rotation can choose between rotate with count specified in the CL register, an
    immediate constant and by 1 bit. Generally, The rotate by immediate and rotate by
    register instructions are slower than rotate by 1 bit. The rotate by 1 instruction has
    the same latency as a shift.
    Assembly/Compiler Coding Rule 35. (ML impact, L generality) Avoid ROTATE
    by register or ROTATE by immediate instructions. If possible, replace with a
    ROTATE by 1 instruction.

    In Intel microarchitecture code name Sandy Bridge, ROL/ROR by immediate has 1-
    cycle throughput, SHLD/SHRD using the same register as source and destination by
    an immediate constant has 1-cycle latency with 0.5 cycle throughput. The “ROL/ROR
    reg, imm8” instruction has two micro-ops with the latency of 1-cycle for the rotate
    register result and 2-cycles for the flags, if used.
    In Intel microarchitecture code name Ivy Bridge, The “ROL/ROR reg, imm8” instruction with immediate greater than 1, is one micro-op with one-cycle latency when the
    overflow flag result is used. When the immediate is one, dependency on the overflow
    flag result of ROL/ROR by a subsequent instruction will see the ROL/ROR instruction
    with two-cycle latency.

    2.4.4.2 Execution Units and Issue Ports

    At each cycle, the core may dispatch µops to one or more of four issue ports. At the
    microarchitecture level, store operations are further divided into two parts: store
    data and store address operations. The four ports through which μops are dispatched
    to execution units and to load and store operations are shown in Figure 2-6. Some
    ports can dispatch two µops per clock. Those execution units are marked Double
    Speed.

    Port 0. In the first half of the cycle, port 0 can dispatch either one floating-point
    move µop (a floating-point stack move, floating-point exchange or floating-point
    store data) or one arithmetic logical unit (ALU) µop (arithmetic, logic, branch or store
    data). In the second half of the cycle, it can dispatch one similar ALU µop.

    Port 1. In the first half of the cycle, port 1 can dispatch either one floating-point
    execution (all floating-point operations except moves, all SIMD operations) µop or
    one normal-speed integer (multiply, shift and rotate) µop or one ALU (arithmetic)
    µop. In the second half of the cycle, it can dispatch one similar ALU µop.

    Port 2. This port supports the dispatch of one load operation per cycle.

    Port 3. This port supports the dispatch of one store address operation per cycle.

    The total issue bandwidth can range from zero to six µops per cycle. Each pipeline
    contains several execution units. The µops are dispatched to the pipeline that corresponds to the correct type of operation. For example, an integer arithmetic logic unit
    and the floating-point execution units (adder, multiplier, and divider) can share a
    pipeline.

    Figure 2-11. Execution Units and Ports in Out-Of-Order Core

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm curious why is method fillInStackTrace of java.lang.Throwable public? This method replaces original stack
Curious if anybody has considered using EnumMap in place of Java beans, particularly value
I am curious how java generates hash values by using hashCode() method of the
Just curious about speed of Python and Java.. Intuitively, Python should be much slower
I am really curious about this question. When we compile .java file using javac
I've got a curious puzzle with an object-relational mapping, using Java and Hibernate. We
This question has been asked in a C++ context but I'm curious about Java.
My teacher gave out a practice exam on java recently and I'm curious to
In Java, array is a class and extends Object. I am curious to know
I was curious if anyone had any suggestions on a Java library that provides

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.