Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7784459
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 1, 20262026-06-01T20:05:55+00:00 2026-06-01T20:05:55+00:00

I’m learning to use SIMD capabilities by re-writing my personal image processing library using

  • 0

I’m learning to use SIMD capabilities by re-writing my personal image processing library using vector intrinsics. One basic function is a simple “array +=,” i.e.

void arrayAdd(unsigned char* A, unsigned char* B, size_t n) {
    for(size_t i=0; i < n; i++) { B[i] += A[i] };
}

For arbitrary array lengths, the obvious SIMD code (assuming aligned by 16) is something like:

size_t i = 0;
__m128i xmm0, xmm1;
n16 = n - (n % 16);
for (; i < n16; i+=16) {
    xmm0 = _mm_load_si128( (__m128i*) (A + i) );
    xmm1 = _mm_load_si128( (__m128i*) (B + i) );
    xmm1 = _mm_add_epi8( xmm0, xmm1 );
    _mm_store_si128( (__m128i*) (B + i), xmm1 );
}
for (; i < n; i++) { B[i] += A[i]; }

But is it possible to do all the additions with SIMD instructions? I thought of trying this:

__m128i mask = (0x100<<8*(n - n16))-1;
_mm_maskmoveu_si128( xmm1, mask, (__m128i*) (B + i) );

for the extra elements, but will that result in undefined behavior? The mask should guarantee no access is actually made past the array bounds (I think). The alternative is to do the extra elements first, but then the array needs to be aligned by n-n16, which doesn’t seem right.

Is there another, more optimal pattern such vectorized loops?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-01T20:05:55+00:00Added an answer on June 1, 2026 at 8:05 pm

    One option is to pad your array to a multiple of 16 bytes. Then you can do 128 bit load/add/store and simply ignore the results following the point you care about.

    For large arrays though the overhead of the byte by byte “epilog” is going to be very small. Unrolling the loop may improve performance more, something like:

    for (; i < n32; i+=32) {
        xmm0 = _mm_load_si128( (__m128i*) (A + i) );
        xmm1 = _mm_load_si128( (__m128i*) (B + i) );
        xmm2 = _mm_load_si128( (__m128i*) (A + i + 16) );
        xmm3 = _mm_load_si128( (__m128i*) (B + i + 16) );
        xmm1 = _mm_add_epi8( xmm0, xmm1 );
        xmm3 = _mm_add_epi8( xmm2, xmm3 );
        _mm_store_si128( (__m128i*) (B + i), xmm1 );
        _mm_store_si128( (__m128i*) (B + i + 16), xmm3 );
    }
    // Do another 128 bit load/add/store here if required
    

    But it’s hard to say without doing some profiling.

    You could also do an unaligned load/store at the end (assuming you have more than 16 bytes) though this will probably not make a big difference. E.g. if you have 20 bytes you do one load/store to offset 0 and another unaligned load/add/store (_mm_storeu_si128, __mm_loadu_si128) to offset 4.

    You could use _mm_maskmoveu_si128 but you need to get the mask into an xmm register, and your sample code isn’t going to work. You probably want to set the mask register to all FF’s and then use a shift to align it. At the end of the day, it will probably be slower than the unaligned load/add/store.

    This would be something like:

    mask = _mm_cmpeq_epi8(mask, mask); // Set to all FF's
    mask = _mm_srli_si128(mask, 16-(n%16)); // Align mask
    _mm_maskmoveu_si128(xmm, mask, A + i);
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am reading a book about Javascript and jQuery and using one of the
I'm making a simple page using Google Maps API 3. My first. One marker
link Im having trouble converting the html entites into html characters, (&# 8217;) i
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I am trying to understand how to use SyndicationItem to display feed which is
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I want use html5's new tag to play a wav file (currently only supported
I'm using v2.0 of ClassTextile.php, with the following call: $testimonial_text = $textile->TextileRestricted($_POST['testimonial']); ... and
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
We're building an app, our first using Rails 3, and we're having to build

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.