Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8097631
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 5, 20262026-06-05T21:49:24+00:00 2026-06-05T21:49:24+00:00

I am working with SSE intrinsics for the first time and I am encountering

  • 0

I am working with SSE intrinsics for the first time and I am encountering a segmentation fault even after ensuring 16byte memory alignment. This post is an extension to my earlier question:

How to allocate 16byte memory aligned data

This is how I have declared my array:

  float *V = (float*) memalign(16,dx*sizeof(float));

When I try to do this:

  __m128 v_i = _mm_load_ps(&V[i]); //It works

But when I do this:

  __m128 u1 = _mm_load_ps(&V[(i-1)]); //There is a segmentation fault

But if I do :

  __m128 u1 = _mm_loadu_ps(&V[(i-1)]); //It works again

However I want to eliminate using _mm_loadu_ps and want to make it work using _mm_load_ps only.

I am working with the Intel icc compiler.

How do I resolve this issue?

UPDATE:

using both operations in the following code:

  void FDTD_base (float *V, float *U, int dx, float c0, float c1, float c2, float c3,     float c4)
    {
       int i, j, k;
                    for (i = 4; i < dx-4; i++)
                    {

                            U[i] = (c0 * (V[i]) //center
                                    + c1 * (V[(i-1)] + V[(i+1)] )
                                    + c2 * (V[(i-2)] + V[(i+2)] )
                                    + c3 * (V[(i-3)] + V[(i+3)] )
                                    + c4 * (V[(i-4)] + V[(i+4)] ));
                    }

       }

SSE version:

         for (i=4; i < dx-4; i+=4)
        {
            v_i = _mm_load_ps(&V[i]);
            __m128 center = _mm_mul_ps(v_i,c0_i);

            __m128 u1 = _mm_loadu_ps(&V[(i-1)]);
            u2 = _mm_loadu_ps(&V[(i+1)]);

            u3 = _mm_loadu_ps(&V[(i-2)]);
            u4 = _mm_loadu_ps(&V[(i+2)]);

            u5 = _mm_loadu_ps(&V[(i-3)]);
            u6 = _mm_loadu_ps(&V[(i+3)]);

            u7 = _mm_load_ps(&V[(i-4)]);
            u8 = _mm_load_ps(&V[(i+4)]);

            __m128 tmp1 = _mm_add_ps(u1,u2);
            __m128 tmp2 = _mm_add_ps(u3,u4);
            __m128 tmp3 = _mm_add_ps(u5,u6);
            __m128 tmp4 = _mm_add_ps(u7,u8);

            __m128 tmp5 = _mm_mul_ps(tmp1,c1_i);
            __m128 tmp6 = _mm_mul_ps(tmp2,c2_i);
            __m128 tmp7 = _mm_mul_ps(tmp3,c3_i);
            __m128 tmp8 = _mm_mul_ps(tmp4,c4_i);

            __m128 tmp9 = _mm_add_ps(tmp5,tmp6);
            __m128 tmp10 = _mm_add_ps(tmp7,tmp8);

            __m128 tmp11 = _mm_add_ps(tmp9,tmp10);
            __m128 tmp12 = _mm_add_ps(center,tmp11);

            _mm_store_ps(&U[i], tmp12);
    }

Is there a more efficient way of doing this using only _mm_load_ps() ?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-05T21:49:26+00:00Added an answer on June 5, 2026 at 9:49 pm

    Since sizeof(float) is 4, only every fourth entry in V will be properly aligned. Remember that _mm_load_ps loads four floats at a time. The argument, i.e. the pointer to the first float, needs to be aligned to 16 bytes.

    I’m assuming that in your example i is a multiple of four, otherwise _mm_load_ps(&V[i]) would fail.

    Update

    This is how I would suggest implementing the above sliding window example using aligned loads and shuffles:

    __m128 v_im1;
    __m128 v_i = _mm_load_ps( &V[0] );
    __m128 v_ip1 = _mm_load_ps( &V[4] );
    
    for ( i = 4 ; i < dx ; i += 4 ) {
    
        /* Get the three vectors in this 'frame'. */
        v_im1 = v_i; v_i = v_ip1; v_ip1 = _mm_load_ps( &V[i+4] );
    
        /* Get the u1..u8 from the example code. */
        __m128 u3 = _mm_shuffle_ps( v_im1 , v_i , 3 + (4<<2) + (0<<4) + (1<<6) );
        __m128 u4 = _mm_shuffle_ps( v_i , v_ip1 , 3 + (4<<2) + (0<<4) + (1<<6) );
    
        __m128 u1 = _mm_shuffle_ps( u3 , v_i , 1 + (2<<2) + (1<<4) + (2<<6) );
        __m128 u2 = _mm_shuffle_ps( v_i , u4 , 1 + (2<<2) + (1<<4) + (2<<6) );
    
        __m128 u5 = _mm_shuffle_ps( v_im1 , u3 , 1 + (2<<2) + (1<<4) + (2<<6) );
        __m128 u6 = _mm_shuffle_ps( u4 , v_ip1 , 1 + (2<<2) + (1<<4) + (2<<6) );
    
        __m128 u7 = v_im1;
        __m128 u8 = v_ip1;
    
        /* Do your computation and store. */
        ...
    
        }
    

    Note that this is a bit tricky since _mm_shuffle_ps can only take two values from each argument, which is why we first need to make u3 and u4 in order to re-use them for the other values with different overlaps.

    Note also that the values u1, u3, and u5 can also be recovered from u2, u4 and u6 in the previous iteration.

    Note, finally, that I have not verified the above code! Read the documentation for _mm_shuffle_ps and check that the third argument, the selector, is correct for each case.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

This is my very first time working with SSE intrinsics. I am trying to
Working with MS Access for the first time and coming across a few problems
working on this in C# Win Forms. how do set a variable and after
Working with H2 I get this error when I try to write a row
I'm working on converting a bit of code to SSE, and while I have
I'm working on a fluid dynamics Navier-Stokes solver that should run in real time.
Working with Json, how can I NSlog only the title in this code: NSDictionary
Working on my first Android app and facing a challenge. I am pulling calls
I have been working on SSE optimization for a video processing algorithm recently. I
Working on a project, I have this issue where my array slicetable returns undefined

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.