Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7775793
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 1, 20262026-06-01T17:42:59+00:00 2026-06-01T17:42:59+00:00

i am making Julia set visualisation using SSE. here is my code class and

  • 0

i am making Julia set visualisation using SSE.
here is my code
class and operators

class vec4 {
    public:
        inline vec4(void) {}
        inline vec4(__m128 val) :v(val) {}

        __m128 v;

        inline void operator=(float *a) {v=_mm_load_ps(a);}
        inline vec4(float *a) {(*this)=a;} 
        inline vec4(float a) {(*this)=a;}

        inline void operator=(float a) {v=_mm_load1_ps(&a);}

};

inline vec4 operator+(const vec4 &a,const vec4 &b) { return _mm_add_ps(a.v,b.v); }
inline vec4 operator-(const vec4 &a,const vec4 &b) { return _mm_sub_ps(a.v,b.v); }
inline vec4 operator*(const vec4 &a,const vec4 &b) { return _mm_mul_ps(a.v,b.v); }
inline vec4 operator/(const vec4 &a,const vec4 &b) { return _mm_div_ps(a.v,b.v); }
inline vec4 operator++(const vec4 &a)
{
    __declspec(align(16)) float b[4]={1.0f,1.0f,1.0f,1.0f};
    vec4 B(b);
    return _mm_add_ps(a.v,B.v); 
}

function itself:

vec4 TWO(2.0f);
vec4 FOUR(4.0f);
vec4 ZER(0.0f);

vec4 CR(cR);
vec4 CI(cI);

for (int i=0; i<320; i++) //H
{
    float *pr = (float*) _aligned_malloc(4 * sizeof(float), 16); //dynamic

    __declspec(align(16)) float pi=i*ratioY + startY;

    for (int j=0; j<420; j+=4) //W
    {

        pr[0]=j*ratioX + startX;
        for(int x=1;x<4;x++)
        {
            pr[x]=pr[x-1]+ratioX;
        }

        vec4 ZR(pr);
        vec4 ZI(pi);

        __declspec(align(16)) float color[4]={0.0f,0.0f,0.0f,0.0f};

        vec4 COLOR(color);
        vec4 COUNT(0.0f);

        __m128 MASK=ZER.v;

        int _count;
        enum {max_count=100};
        for (_count=0;_count<=max_count;_count++) 
        {

            vec4 tZR=ZR*ZR-ZI*ZI+CR;
            vec4 tZI=TWO*ZR*ZI+CI;
            vec4 LEN=tZR*tZR+tZI*tZI;

            __m128 MASKOLD=MASK;
            MASK=_mm_cmplt_ps(LEN.v,FOUR.v);

            ZR=_mm_or_ps(_mm_and_ps(MASK,tZR.v),_mm_andnot_ps(MASK,ZR.v));
            ZI=_mm_or_ps(_mm_and_ps(MASK,tZI.v),_mm_andnot_ps(MASK,ZI.v));

            __m128 CHECKNOTEQL=_mm_cmpneq_ps(MASK,MASKOLD);    
            COLOR=_mm_or_ps(_mm_and_ps(CHECKNOTEQL,COUNT.v),_mm_andnot_ps(CHECKNOTEQL,COLOR.v));

            COUNT=COUNT++;
            operations+=17;

            if (_mm_movemask_ps((LEN-FOUR).v)==0) break; 
        }
        _mm_store_ps(color,COLOR.v);

SSE needs 553k operations (mull,add,if) and takes ~320ms to finish the task
from the other hand regular function takes 1428k operations but need only ~90ms to compute?
I used vs2010 performance analyser and seems that all maths operators are running rly slow.
What I am doing wrong?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-01T17:43:00+00:00Added an answer on June 1, 2026 at 5:43 pm

    The problem you are having is that the SSE intrinics are doing far more memory operations than the non-SSE version. Using your vector class I wrote this:

    int main (int argc, char *argv [])
    {
      vec4 a (static_cast <float> (argc));
      cout << "argc = " << argc << endl;
      a=++a;
      cout << "a = (" << a.v.m128_f32 [0] << ", " << a.v.m128_f32 [1] << ", " << a.v.m128_f32 [2] << ", " << a.v.m128_f32 [3] << ", " << ")" << endl;
    }
    

    which produced the following operations in a release build (I’ve edited this to show the SSE only):

    fild        dword ptr [ebp+8] // load argc into FPU
    fstp        dword ptr [esp+10h] // save argc as a float
    
    movss       xmm0,dword ptr [esp+10h] // load argc into SSE
    shufps      xmm0,xmm0,0 // copy argc to all values in SSE register
    movaps      xmmword ptr [esp+20h],xmm0 // save to stack frame
    
    fld1 // load 1 into FPU
    fst         dword ptr [esp+20h] 
    fst         dword ptr [esp+28h] 
    fst         dword ptr [esp+30h] 
    fstp        dword ptr [esp+38h] // create a (1,1,1,1) vector
    movaps      xmm0,xmmword ptr [esp+2Ch] // load above vector into SSE
    addps       xmm0,xmmword ptr [esp+1Ch] // add to vector a
    movaps      xmmword ptr [esp+38h],xmm0 // save back to a
    

    Note: the addresses are relative to ESP and there are a few pushes which explains the weird changes of offset for the same value.

    Now, compare the code to this version:

    int main (int argc, char *argv [])
    {
      float a[4];
      for (int i = 0 ; i < 4 ; ++i)
      {
        a [i] = static_cast <float> (argc + i);
      }
      cout << "argc = " << argc << endl;
      for (int i = 0 ; i < 4 ; ++i)
      {
        a [i] += 1.0f;
      }
      cout << "a = (" << a [0] << ", " << a [1] << ", " << a [2] << ", " << a [3] << ", " << ")" << endl;
    }
    

    The compiler created this code for the above (again, edited and with weird offsets)

    fild        dword ptr [argc] // converting argc to floating point values
    fstp        dword ptr [esp+8] 
    fild        dword ptr [esp+4] // the argc+i is done in the integer unit
    fstp        dword ptr [esp+0Ch] 
    fild        dword ptr [esp+8] 
    fstp        dword ptr [esp+18h]
    fild        dword ptr [esp+10h]
    fstp        dword ptr [esp+24h] // array a now initialised
    
    fld         dword ptr [esp+8] // load a[0]
    fld1 // load 1 into FPU
    fadd        st(1),st // increment a[0]
    fxch        st(1)
    fstp        dword ptr [esp+14h] // save a[0]
    fld         dword ptr [esp+1Ch] // load a[1]
    fadd        st,st(1) // increment a[1]
    fstp        dword ptr [esp+24h] // save a[1]
    fld         dword ptr [esp+28h] // load a[2]
    fadd        st,st(1) // increment a[2]
    fstp        dword ptr [esp+28h]  // save a[2]
    fadd        dword ptr [esp+2Ch] // increment a[3]
    fstp        dword ptr [esp+2Ch] // save a[3]
    

    In terms of memory access, the increment requires:

    SSE                  FPU
    4xfloat write        1xfloat read
    1xsse read           1xfloat write
    1xsse read+add       1xfloat read
    1xsse write          1xfloat write
                         1xfloat read
                         1xfloat write
                         1xfloat read
                         1xfloat write
    
    total
    8 float reads        4 float reads
    8 float writes       4 float writes
    

    This shows the SSE is using twice the memory bandwidth of the FPU version and memory bandwidth is a major bottleneck.

    If you want to seriously maximise the SSE then you need to write the whole aglorithm in a single SSE assembler function so that you can eliminate the memory read/writes as much as possible. Using the intrinsics is not an ideal solution for optimisation.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Making a new class in VS10 gives me using System; using System.Collections.Generic; using System.Linq;
While making some final tests of a class-library that I'm writing for Windows Mobile
When making an HttpWebRequest within a CLR stored procedure (as per the code below),
Im making a login/logout class that logs users in, sets cookies based on user's
Making a new shooter game here in the vein of Galaga (my fav shooter
Im making slide show with jquery How i can set interval between two function
Making a piano/keyboard application and trying to figure out the best way to set
Making sense out of an .MSI verbose trace. I created the .MSI using VisualStudio
When making changes using SubmitChanges() , LINQ sometimes dies with a ChangeConflictException exception with
making an infinite loop in javascript, and jQuery... this is my current code: $(#bg2).css({opacity:

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.