Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3849564
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 19, 20262026-05-19T16:52:26+00:00 2026-05-19T16:52:26+00:00

My (simd) implementation takes varied amount of time, though it is run for fixed

  • 0

My (simd) implementation takes varied amount of time, though it is run for fixed input. The running time varies between say 100 million clock cycles to 120 million clock cycles. The program calls a function around 600 times, and the most expensive part of the function is in it memory is accessed ~2000 times. Thus, overall memory involvement in quite high in my program.

Is the variation in running time due to memory access patterns/initial memory contents?

I used valgrind to analyze profile my program. It shows each memory access takes about 8 instructions. Is this normal?

Following is the piece of code (function) that is called 600 times. Mulprev[32][20] is the array which is accessed most number of times.

j = 15;  
u3v = _mm_set_epi64x (0xF, 0xF);
while (j + 1)  
{

    l = j << 2;  
    for (i = 0; i < 20; i++)
    {
        val1v   = _mm_load_si128 ((__m128i *) &elm1v[i]);       
        uv  = _mm_and_si128 (_mm_srli_epi64 (val1v, l), u3v);
        u1  = _mm_extract_epi16 (uv, 0);
        u2  = _mm_extract_epi16 (uv, 4) + 16;

        for (ival = i, ival1 = i + 1, k = 0; k < 20; k += 2, ival += 2, ival1 += 2)
        {
            temp11v = _mm_load_si128 ((__m128i *) &mulprev[u1][k]); 
            temp12v = _mm_load_si128 ((__m128i *) &mulprev[u2][k]);

            val1v   = _mm_load_si128 ((__m128i *) &res[ival]);
            val2v   = _mm_load_si128 ((__m128i *) &res[ival1]); 

            bv  = _mm_xor_si128 (val1v, _mm_unpacklo_epi64 (temp11v, temp12v));
            av  = _mm_xor_si128 (val2v, _mm_unpackhi_epi64 (temp11v, temp12v));

            _mm_store_si128 ((__m128i *) &res[ival], bv);                                   
            _mm_store_si128 ((__m128i *) &res[ival1], av); 
        }
    }

    if (j == 0)
        break;
    val0v = _mm_setzero_si128 ();

    for (i = 0; i < 40; i++)
    {
        testv   = _mm_load_si128 ((__m128i *)  &res[i]);
        val1v   = _mm_srli_epi64 (testv, 60);
        val2v   = _mm_xor_si128  (val0v, _mm_slli_epi64 (testv, 4));
        _mm_store_si128 (&res[i], val2v);
        val0v   = val1v;
    }
    j--;
}       

I want to reduce the computation time of my program. Any suggestions?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-19T16:52:26+00:00Added an answer on May 19, 2026 at 4:52 pm

    You are performing almost no computation in between loads and stores, hence your execution time will most likely be dominated by the cost of I/O to/from cache/memory. Even worse, your data set appears to be relatively small. Probably the only way you can optimise this further is to improve the memory access pattern (make accesses sequential where possible, and ensure that cache lines are not wasted, etc) and/or combine these operations with other code which operates on the same data set before/after this routine (so that the cost of loads/stores in amortised somewhat).

    EDIT: note that I gave a very similar answer when you asked much the same question for an apparently earlier version of this routine: How to make the following code faster – you seem to have missed the point that your main performance problem here is memory access, not computation.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am writing some code and trying to speed it up using SIMD intrinsics
The v4 series of the gcc compiler can automatically vectorize loops using the SIMD
I have some SIMD code in Altivec processing 32 bit integer values in parallel.
I google around a bit, but this is not clear to me now whether
I've heard that the iPhone 4 and the iPad have a fpu called the
How can we learn the specifications of the Virtual Machine on which we are
There is a similar post that covers regular registers. What about NEON registers. As
I am writing a feed forward net in VC++ using AVX intrinsics. I am

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.