Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 814023
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T01:27:19+00:00 2026-05-15T01:27:19+00:00

I’m working on converting a bit of code to SSE, and while I have

  • 0

I’m working on converting a bit of code to SSE, and while I have the correct output it turns out to be slower than standard c++ code.

The bit of code that I need to do this for is:

float ox = p2x - (px * c - py * s)*m;
float oy = p2y - (px * s - py * c)*m;

What I’ve got for SSE code is:

void assemblycalc(vector4 &p, vector4 &sc, float &m, vector4 &xy)
{
    vector4 r;
    __m128 scale = _mm_set1_ps(m);

__asm
{
    mov     eax,    p       //Load into CPU reg
    mov     ebx,    sc
    movups  xmm0,   [eax]   //move vectors to SSE regs
    movups  xmm1,   [ebx]

    mulps   xmm0,   xmm1    //Multiply the Elements

    movaps  xmm2,   xmm0    //make a copy of the array  
    shufps  xmm2,   xmm0,  0x1B //shuffle the array     

    subps   xmm0,   xmm2    //subtract the elements

    mulps   xmm0,   scale   //multiply the vector by the scale

    mov     ecx,    xy      //load the variable into cpu reg
    movups  xmm3,   [ecx]   //move the vector to the SSE regs

    subps   xmm3,   xmm0    //subtract xmm3 - xmm0

    movups  [r],    xmm3    //Save the retun vector, and use elements 0 and 3
    }
}

Since its very difficult to read the code, I’ll explain what I did:

loaded vector4 , xmm0 _____ p = [px , py , px , py ]
mult. by vector4, xmm1 _ cs = [c , c , s , s ]
__________________________mult—————————-
result,_____________ xmm0 = [pxc, pyc, pxs, pys]

reuse result, xmm0 = [pxc, pyc, pxs, pys]
shuffle result, xmm2 = [pys, pxs, pyc, pxc]
_____________________subtract—————————-
result, xmm0 = [pxc-pys, pyc-pxs, pxs-pyc, pys-pxc]

reuse result, xmm0 = [pxc-pys, pyc-pxs, pxs-pyc, pys-pxc]
load m vector4, scale = [m, m, m, m]
__________________________mult—————————-
result, xmm0 = [(pxc-pys)m, (pyc-px*s)m, (pxs-py*c)m, (pys-px*c)m]

load xy vector4, xmm3 = [p2x, p2x, p2y, p2y]
reuse, xmm0 = [(px
c-py*s)m, (pyc-px*s)m, (pxs-py*c)m, (pys-px*c)m]
_____________________subtract—————————-
result, xmm3 = [p2x-(px
c-py*s)m, p2x-(pyc-px*s)m, p2y-(pxs-py*c)m, p2y-(pys-px*c)*m]

then ox = xmm3[0] and oy = xmm3[3], so I essentially don’t use xmm3[1] or xmm3[4]

I apologize for the difficulty reading this, but I’m hoping someone might be able to provide some guidance for me, as the standard c++ code runs in 0.001444ms and the SSE code runs in 0.00198ms.

Let me know if there is anything I can do to further explain/clean this up a bit. The reason I’m trying to use SSE is because I run this calculation millions of times, and it is a part of what is slowing down my current code.

Thanks in advance for any help!
Brett

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T01:27:20+00:00Added an answer on May 15, 2026 at 1:27 am

    The usual way to do this sort of vectorization is to turn the problem “on its side”. Instead of computing a single value of ox and oy, you compute four ox values and four oy values simultaneously. This minimizes wasted computation and shuffles.

    In order to do this, you bundle up several x, y, p2x and p2y values into contiguous arrays (i.e. you might have an array of four values of x, an array of four values of y, etc). Then you can just do:

    movups  %xmm0,  [x]
    movups  %xmm1,  [y]
    movaps  %xmm2,  %xmm0
    mulps   %xmm0,  [c]    // cx
    movaps  %xmm3,  %xmm1
    mulps   %xmm1,  [s]    // sy
    mulps   %xmm2,  [s]    // sx
    mulps   %xmm3,  [c]    // cy
    subps   %xmm0,  %xmm1  // cx - sy
    subps   %xmm2,  %xmm3  // sx - cy
    mulps   %xmm0,  scale  // (cx - sy)*m
    mulps   %xmm2,  scale  // (sx - cy)*m
    movaps  %xmm1,  [p2x]
    movaps  %xmm3,  [p2y]
    subps   %xmm1,  %xmm0  // p2x - (cx - sy)*m
    subps   %xmm3,  %xmm2  // p2y - (sx - cy)*m
    movups  [ox],   %xmm1
    movups  [oy],   %xmm3
    

    Using this approach, we compute 4 results simultaneously in 18 instructions, vs. a single result in 13 instructions with your approach. We’re also not wasting any results.

    It could still be improved on; since you would have to rearrange data structures anyway to use this approach, you should align the arrays and use aligned loads and stores instead of unaligned. You should load c and s into registers and use them to process many vectors of x and y, instead of reloading them for each vector. For the best performance, two or more vectors worth of computation should be interleaved to make sure the processor has enough work to do an prevent pipeline stalls.

    (On a side note: should it be cx + sy instead of cx - sy? That would give you a standard rotation matrix)

    Edit

    Your comment on what hardware you’re doing your timings on pretty much clears everything up: “Pentium 4 HT, 2.79GHz”. That’s a very old microarchitecture, on which unaligned moves and shuffles are quite slow; you don’t have enough work in the pipeline to hide the latency of the arithmetic operations, and the reorder engine isn’t nearly as clever as it is on newer microarchitectures.

    I expect that your vector code would prove to be faster than the scalar code on i7, and probably on Core2 as well. On the other hand, doing four at a time, if you could, would be much faster still.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.