Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3849564
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 19, 20262026-05-19T16:52:26+00:00 2026-05-19T16:52:26+00:00

My (simd) implementation takes varied amount of time, though it is run for fixed

  • 0

My (simd) implementation takes varied amount of time, though it is run for fixed input. The running time varies between say 100 million clock cycles to 120 million clock cycles. The program calls a function around 600 times, and the most expensive part of the function is in it memory is accessed ~2000 times. Thus, overall memory involvement in quite high in my program.

Is the variation in running time due to memory access patterns/initial memory contents?

I used valgrind to analyze profile my program. It shows each memory access takes about 8 instructions. Is this normal?

Following is the piece of code (function) that is called 600 times. Mulprev[32][20] is the array which is accessed most number of times.

j = 15;  
u3v = _mm_set_epi64x (0xF, 0xF);
while (j + 1)  
{

    l = j << 2;  
    for (i = 0; i < 20; i++)
    {
        val1v   = _mm_load_si128 ((__m128i *) &elm1v[i]);       
        uv  = _mm_and_si128 (_mm_srli_epi64 (val1v, l), u3v);
        u1  = _mm_extract_epi16 (uv, 0);
        u2  = _mm_extract_epi16 (uv, 4) + 16;

        for (ival = i, ival1 = i + 1, k = 0; k < 20; k += 2, ival += 2, ival1 += 2)
        {
            temp11v = _mm_load_si128 ((__m128i *) &mulprev[u1][k]); 
            temp12v = _mm_load_si128 ((__m128i *) &mulprev[u2][k]);

            val1v   = _mm_load_si128 ((__m128i *) &res[ival]);
            val2v   = _mm_load_si128 ((__m128i *) &res[ival1]); 

            bv  = _mm_xor_si128 (val1v, _mm_unpacklo_epi64 (temp11v, temp12v));
            av  = _mm_xor_si128 (val2v, _mm_unpackhi_epi64 (temp11v, temp12v));

            _mm_store_si128 ((__m128i *) &res[ival], bv);                                   
            _mm_store_si128 ((__m128i *) &res[ival1], av); 
        }
    }

    if (j == 0)
        break;
    val0v = _mm_setzero_si128 ();

    for (i = 0; i < 40; i++)
    {
        testv   = _mm_load_si128 ((__m128i *)  &res[i]);
        val1v   = _mm_srli_epi64 (testv, 60);
        val2v   = _mm_xor_si128  (val0v, _mm_slli_epi64 (testv, 4));
        _mm_store_si128 (&res[i], val2v);
        val0v   = val1v;
    }
    j--;
}       

I want to reduce the computation time of my program. Any suggestions?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-19T16:52:26+00:00Added an answer on May 19, 2026 at 4:52 pm

    You are performing almost no computation in between loads and stores, hence your execution time will most likely be dominated by the cost of I/O to/from cache/memory. Even worse, your data set appears to be relatively small. Probably the only way you can optimise this further is to improve the memory access pattern (make accesses sequential where possible, and ensure that cache lines are not wasted, etc) and/or combine these operations with other code which operates on the same data set before/after this routine (so that the cost of loads/stores in amortised somewhat).

    EDIT: note that I gave a very similar answer when you asked much the same question for an apparently earlier version of this routine: How to make the following code faster – you seem to have missed the point that your main performance problem here is memory access, not computation.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

If you are writing some SIMD code that will be run by another program,
This is specifically related to ARM Neon SIMD coding. I am using ARM Neon
Heya, I'm trying to use Mono's SIMD to handle coordinates(X,Y,Z) in my project, but
I am designing a series of Vector classes in C++ that support SSE(SIMD). The
We know that on NEON, the SIMD registers q0 ~ q7 are shared with
I'm making an vector/matrix library for Game which utilizes SIMD unit on iPhone (3GS
I am using an engine that allows SIMD code to be written, and it
Considering the fact that openmp uses simd model i.e. each instruction is executed by
I'm very new to SIMD/SSE and I'm trying to do some simple image filtering
Is it possible to get GHC to produce SIMD code for the various SSE

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.