Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 843233
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T06:05:34+00:00 2026-05-15T06:05:34+00:00

I have an inline assembler loop that cumulatively adds elements from an int32 data

  • 0

I have an inline assembler loop that cumulatively adds elements from an int32 data array with MMX instructions. In particular, it uses the fact that the MMX registers can accommodate 16 int32s to calculate 16 different cumulative sums in parallel.

I would now like to convert this piece of code to MMX intrinsics but I am afraid that I will suffer a performance penalty because one cannot explicitly intruct the compiler to use the 8 MMX registers to accomulate 16 independent sums.

Can anybody comment on this and maybe propose a solution on how to convert the piece of code below to use intrinsics?

== inline assembler (only part within the loop) ==

paddd   mm0, [esi+edx+8*0]  ; add first & second pair of int32 elements
paddd   mm1, [esi+edx+8*1]  ; add third & fourth pair of int32 elements ...
paddd   mm2, [esi+edx+8*2]
paddd   mm3, [esi+edx+8*3]
paddd   mm4, [esi+edx+8*4]
paddd   mm5, [esi+edx+8*5]
paddd   mm6, [esi+edx+8*6]
paddd   mm7, [esi+edx+8*7]  ; add 15th & 16th pair of int32 elements
  • esi points to the beginning of the data array
  • edx provides the offset in the data array for the current loop iteration
  • the data array is arranged such that the elements for the 16 independent sums are interleaved.
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T06:05:35+00:00Added an answer on May 15, 2026 at 6:05 am

    The VS2010 does a decent optimization job on the equivalent code using intrinsics. In most cases, it compiles the intrinsic:

    sum = _mm_add_pi32(sum, *(__m64 *) &intArray[i + offset]);
    

    into something like:

    movq  mm0, mmword ptr [eax+8*offset]
    paddd mm1, mm0
    

    This isn’t as concise as your padd mm1, [esi+edx+8*offset], but it arguably comes pretty close. The execution time is likely dominated by the memory fetch.

    The catch is that VS seems to like adding MMX registers only to other MMX registers. The above scheme works only for the first 7 sums. The 8th sum requires that some register be saved temporarily to memory.

    Here’s a complete program and its corresponding compiled assembly (release build):

    #include <stdio.h>
    #include <stdlib.h>
    #include <xmmintrin.h>
    
    void addWithInterleavedIntrinsics(int *interleaved, int count)
    {
        // sum up the numbers
        __m64 sum0 = _mm_setzero_si64(), sum1 = _mm_setzero_si64(),
              sum2 = _mm_setzero_si64(), sum3 = _mm_setzero_si64(),
              sum4 = _mm_setzero_si64(), sum5 = _mm_setzero_si64(),
              sum6 = _mm_setzero_si64(), sum7 = _mm_setzero_si64();
    
        for (int i = 0; i < 16 * count; i += 16) {
            sum0 = _mm_add_pi32(sum0, *(__m64 *) &interleaved[i]);
            sum1 = _mm_add_pi32(sum1, *(__m64 *) &interleaved[i + 2]);
            sum2 = _mm_add_pi32(sum2, *(__m64 *) &interleaved[i + 4]);
            sum3 = _mm_add_pi32(sum3, *(__m64 *) &interleaved[i + 6]);
            sum4 = _mm_add_pi32(sum4, *(__m64 *) &interleaved[i + 8]);
            sum5 = _mm_add_pi32(sum5, *(__m64 *) &interleaved[i + 10]);
            sum6 = _mm_add_pi32(sum6, *(__m64 *) &interleaved[i + 12]);
            sum7 = _mm_add_pi32(sum7, *(__m64 *) &interleaved[i + 14]);
        }
    
        // reset the MMX/floating-point state
        _mm_empty();
    
        // write out the sums; we have to do something with the sums so that
        // the optimizer doesn't just decide to avoid computing them.
        printf("%.8x %.8x\n", ((int *) &sum0)[0], ((int *) &sum0)[1]);
        printf("%.8x %.8x\n", ((int *) &sum1)[0], ((int *) &sum1)[1]);
        printf("%.8x %.8x\n", ((int *) &sum2)[0], ((int *) &sum2)[1]);
        printf("%.8x %.8x\n", ((int *) &sum3)[0], ((int *) &sum3)[1]);
        printf("%.8x %.8x\n", ((int *) &sum4)[0], ((int *) &sum4)[1]);
        printf("%.8x %.8x\n", ((int *) &sum5)[0], ((int *) &sum5)[1]);
        printf("%.8x %.8x\n", ((int *) &sum6)[0], ((int *) &sum6)[1]);
        printf("%.8x %.8x\n", ((int *) &sum7)[0], ((int *) &sum7)[1]);
    }
    
    void main()
    {
        int count        = 10000;
        int *interleaved = new int[16 * count];
    
        // create some random numbers to add up
        // (note that on VS2010, RAND_MAX is just 32767)
        for (int i = 0; i < 16 * count; ++i) {
            interleaved[i] = rand();
        }
    
        addWithInterleavedIntrinsics(interleaved, count);
    }
    

    Here’s the generated assembly code for the inner portion of the sum loop (without its prolog and epilog). Note how most sums are kept efficiently in mm1-mm6. Contrast that with mm0, which is used to bring the number to add to each sum, and with mm7, which serves the last two sums. The 7-sum version of this program doesn’t seem to have mm7 problem.

    012D1070  movq        mm7,mmword ptr [esp+18h]  
    012D1075  movq        mm0,mmword ptr [eax-10h]  
    012D1079  paddd       mm1,mm0  
    012D107C  movq        mm0,mmword ptr [eax-8]  
    012D1080  paddd       mm2,mm0  
    012D1083  movq        mm0,mmword ptr [eax]  
    012D1086  paddd       mm3,mm0  
    012D1089  movq        mm0,mmword ptr [eax+8]  
    012D108D  paddd       mm4,mm0  
    012D1090  movq        mm0,mmword ptr [eax+10h]  
    012D1094  paddd       mm5,mm0  
    012D1097  movq        mm0,mmword ptr [eax+18h]  
    012D109B  paddd       mm6,mm0  
    012D109E  movq        mm0,mmword ptr [eax+20h]  
    012D10A2  paddd       mm7,mm0  
    012D10A5  movq        mmword ptr [esp+18h],mm7  
    012D10AA  movq        mm0,mmword ptr [esp+10h]  
    012D10AF  movq        mm7,mmword ptr [eax+28h]  
    012D10B3  add         eax,40h  
    012D10B6  dec         ecx  
    012D10B7  paddd       mm0,mm7  
    012D10BA  movq        mmword ptr [esp+10h],mm0  
    012D10BF  jne         main+70h (12D1070h)  
    

    So what can you do?

    1. Profile the 7-sum and 8-sum intrinsic-based programs. Choose the one that executes quicker.

    2. Profile the version that adds just one MMX register at a time. It should still be able to take advantage of the fact that modern processors fetch 64 to 128 bytes into the cache at a time. It is not obvious that the 8-sum version would be faster than the 1-sum one. The 1-sum version fetches the exact same amount of memory, and does the exact same number of MMX additions. You will need to interleave the inputs accordingly though.

    3. If your target hardware allows it, consider using SSE instructions. Those can add 4 32-bit values at a time. SSE is available in intel CPU’s since the Pentium III in 1999.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 427k
  • Answers 427k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer You can get that behavior with this advice. Note, I… May 15, 2026 at 1:03 pm
  • Editorial Team
    Editorial Team added an answer Do you control the website component, or is it sucking… May 15, 2026 at 1:03 pm
  • Editorial Team
    Editorial Team added an answer Thank you all for your input. For future reference, this… May 15, 2026 at 1:03 pm

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.