Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 420903
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 12, 20262026-05-12T18:55:53+00:00 2026-05-12T18:55:53+00:00

I’m trying to find a way to perform an indirect shift-left/right operation without actually

  • 0

I’m trying to find a way to perform an indirect shift-left/right operation without actually using the variable shift op or any branches.

The particular PowerPC processor I’m working on has the quirk that a shift-by-constant-immediate, like

int ShiftByConstant( int x ) { return x << 3 ; } 

is fast, single-op, and superscalar, whereas a shift-by-variable, like

int ShiftByVar( int x, int y ) { return x << y ; }

is a microcoded operation that takes 7-11 cycles to execute while the entire rest of the pipeline stops dead.

What I’d like to do is figure out which non-microcoded integer PPC ops the sraw decodes into and then issue them individually. This won’t help with the latency of the sraw itself — it’ll replace one op with six — but in between those six ops I can dual-dispatch some work to the other execution units and get a net gain.

I can’t seem to find anywhere what μops sraw decodes into — does anyone know how I can replace a variable bit-shift with a sequence of constant shifts and basic integer operations? (A for loop or a switch or anything with a branch in it won’t work because the branch penalty is even bigger than the microcode penalty, even for correctly-predicted branches.)

This needn’t be answered in assembly; I’m hoping to learn the algorithm rather than the particular code, so an answer in C or a high level language or even pseudo code would be perfectly helpful.

Edit: A couple of clarifications that I should add:

  1. I’m not even a little bit worried about portability
  2. PPC has a conditional-move, so we can assume the existence of a branchless intrinsic function

    int isel(a, b, c)  { return a >= 0 ? b : c; }
    

    (if you write out a ternary that does the same thing I’ll get what you mean)

  3. integer multiplication is also microcoded and even slower than sraw. 🙁
  4. On Xenon PPC, the latency of a predicted branch is 8 cycles, so even one makes it as costly as the microcoded instruction. Jump-to-pointer (any indirect branch or function pointer) is a guaranteed mispredict, a 24 cycle stall.
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-12T18:55:53+00:00Added an answer on May 12, 2026 at 6:55 pm

    Here you go…

    I decided to try these out as well since Mike Acton claimed it would be faster than using the CELL/PS3 microcoded shift on his CellPerformance site where he suggests to avoid the indirect shift. However, in all my tests, using the microcoded version was not only faster than a full generic branch-free replacement for indirect shift, it takes way less memory for the code (1 instruction).

    The only reason I did these as templates was to get the right output for both signed (usually arithmetic) and unsigned (logical) shifts.

    template <typename T> FORCEINLINE T VariableShiftLeft(T nVal, int nShift)
    {   // 31-bit shift capability (Rolls over at 32-bits)
        const int bMask1=-(1&nShift);
        const int bMask2=-(1&(nShift>>1));
        const int bMask3=-(1&(nShift>>2));
        const int bMask4=-(1&(nShift>>3));
        const int bMask5=-(1&(nShift>>4));
        nVal=(nVal&bMask1) + nVal;   //nVal=((nVal<<1)&bMask1) | (nVal&(~bMask1));
        nVal=((nVal<<(1<<1))&bMask2) | (nVal&(~bMask2));
        nVal=((nVal<<(1<<2))&bMask3) | (nVal&(~bMask3));
        nVal=((nVal<<(1<<3))&bMask4) | (nVal&(~bMask4));
        nVal=((nVal<<(1<<4))&bMask5) | (nVal&(~bMask5));
        return(nVal);
    }
    template <typename T> FORCEINLINE T VariableShiftRight(T nVal, int nShift)
    {   // 31-bit shift capability (Rolls over at 32-bits)
        const int bMask1=-(1&nShift);
        const int bMask2=-(1&(nShift>>1));
        const int bMask3=-(1&(nShift>>2));
        const int bMask4=-(1&(nShift>>3));
        const int bMask5=-(1&(nShift>>4));
        nVal=((nVal>>1)&bMask1) | (nVal&(~bMask1));
        nVal=((nVal>>(1<<1))&bMask2) | (nVal&(~bMask2));
        nVal=((nVal>>(1<<2))&bMask3) | (nVal&(~bMask3));
        nVal=((nVal>>(1<<3))&bMask4) | (nVal&(~bMask4));
        nVal=((nVal>>(1<<4))&bMask5) | (nVal&(~bMask5));
        return(nVal);
    }
    

    EDIT: Note on isel()
    I saw your isel() code on your website.

    // if a >= 0, return x, else y
    int isel( int a, int x, int y )
    {
        int mask = a >> 31; // arithmetic shift right, splat out the sign bit
        // mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
        return x + ((y - x) & mask);
    };
    

    FWIW, if you rewrite your isel() to do a mask and mask complement, it will be faster on your PowerPC target since the compiler is smart enough to generate an ‘andc’ opcode. It’s the same number of opcodes but there is one fewer result-to-input-register dependency in the opcodes. The two mask operations can also be issued in parallel on a superscalar processor. It can be 2-3 cycles faster if everything is lined up correctly. You just need to change the return to this for the PowerPC versions:

    return (x & (~mask)) + (y & mask);
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am trying to understand how to use SyndicationItem to display feed which is
Basically, what I'm trying to create is a page of div tags, each has
I'm new to using the Perl treebuilder module for HTML parsing and can't figure
link Im having trouble converting the html entites into html characters, (&# 8217;) i
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
this is what i have right now Drawing an RSS feed into the php,
I am reading a book about Javascript and jQuery and using one of the
I am trying to render a haml file in a javascript response like so:
I'm using v2.0 of ClassTextile.php, with the following call: $testimonial_text = $textile->TextileRestricted($_POST['testimonial']); ... and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.