Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 266121
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 11, 20262026-05-11T22:49:43+00:00 2026-05-11T22:49:43+00:00

I am using c++ , I want to do alpha blend using the following

  • 0

I am using c++ , I want to do alpha blend using the following code.

#define CLAMPTOBYTE(color) \
    if ((color) & (~255)) { \
        color = (BYTE)((-(color)) >> 31); \
    } else { \
        color = (BYTE)(color); \
    }
#define GET_BYTE(accessPixel, x, y, scanline, bpp) \
    ((BYTE*)((accessPixel) + (y) * (scanline) + (x) * (bpp))) 

    for (int y = top ; y < bottom; ++y)
    {
        BYTE* resultByte = GET_BYTE(resultBits, left, y, stride, bytepp);
        BYTE* srcByte = GET_BYTE(srcBits, left, y, stride, bytepp);
        BYTE* srcByteTop = GET_BYTE(srcBitsTop, left, y, stride, bytepp);
        BYTE* maskCurrent = GET_GREY(maskSrc, left, y, width);
        int alpha = 0;
        int red = 0;
        int green = 0;
        int blue = 0;
        for (int x = left; x < right; ++x)
        {
            alpha = *maskCurrent;
            red = (srcByteTop[R] * alpha + srcByte[R] * (255 - alpha)) / 255;
            green = (srcByteTop[G] * alpha + srcByte[G] * (255 - alpha)) / 255;
            blue = (srcByteTop[B] * alpha + srcByte[B] * (255 - alpha)) / 255;
            CLAMPTOBYTE(red);
            CLAMPTOBYTE(green);
            CLAMPTOBYTE(blue);
            resultByte[R] = red;
            resultByte[G] = green;
            resultByte[B] = blue;
            srcByte += bytepp;
            srcByteTop += bytepp;
            resultByte += bytepp;
            ++maskCurrent;
        }
    }

however I find it is still slow, it takes about 40 – 60 ms when compose two 600 * 600 image.
Is there any method to improve the speed to less then 16ms?

Can any body help me to speed this code? Many thanks!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-11T22:49:44+00:00Added an answer on May 11, 2026 at 10:49 pm

    Use SSE – start around page 131.

    The basic workflow

    1. Load 4 pixels from src (16 1 byte numbers) RGBA RGBA RGBA RGBA (streaming load)

    2. Load 4 more which you want to blend with srcbytetop RGBx RGBx RGBx RGBx

    3. Do some swizzling so that the A term in 1 fills every slot I.e

      xxxA xxxB xxxC xxxD -> AAAA BBBB CCCC DDDD

      In my solution below I opted instead to re-use your existing “maskcurrent” array but having alpha integrated into the “A” field of 1 will require less loads from memory and thus be faster. Swizzling in this case would probably be: And with mask to select A, B, C, D. Shift right 8, Or with origional, shift right 16, or again.

    4. Add the above to a vector that is all -255 in every slot

    5. Multiply 1 * 4 (source with 255-alpha) and 2 * 3 (result with alpha).

      You should be able to use the “multiply and discard bottom 8 bits” SSE2 instruction for this.

    6. add those two (4 and 5) together

    7. Store those somewhere else (if possible) or on top of your destination (if you must)

    Here is a starting point for you:

        //Define your image with __declspec(align(16)) i.e char __declspec(align(16)) image[640*480]
        // so the first byte is aligned correctly for SIMD.
        // Stride must be a multiple of 16.
    
        for (int y = top ; y < bottom; ++y)
        {
            BYTE* resultByte = GET_BYTE(resultBits, left, y, stride, bytepp);
            BYTE* srcByte = GET_BYTE(srcBits, left, y, stride, bytepp);
            BYTE* srcByteTop = GET_BYTE(srcBitsTop, left, y, stride, bytepp);
            BYTE* maskCurrent = GET_GREY(maskSrc, left, y, width);
            for (int x = left; x < right; x += 4)
            {
                //If you can't align, use _mm_loadu_si128()
                // Step 1
                __mm128i src = _mm_load_si128(reinterpret_cast<__mm128i*>(srcByte)) 
                // Step 2
                __mm128i srcTop = _mm_load_si128(reinterpret_cast<__mm128i*>(srcByteTop)) 
    
                // Step 3
                // Fill the 4 positions for the first pixel with maskCurrent[0], etc
                // Could do better with shifts and so on, but this is clear
                __mm128i mask = _mm_set_epi8(maskCurrent[0],maskCurrent[0],maskCurrent[0],maskCurrent[0],
                                            maskCurrent[1],maskCurrent[1],maskCurrent[1],maskCurrent[1],
                                            maskCurrent[2],maskCurrent[2],maskCurrent[2],maskCurrent[2],
                                            maskCurrent[3],maskCurrent[3],maskCurrent[3],maskCurrent[3],
                                            ) 
    
                // step 4
                __mm128i maskInv = _mm_subs_epu8(_mm_set1_epu8(255), mask) 
    
                //Todo : Multiply, with saturate - find correct instructions for 4..6
                //note you can use Multiply and add _mm_madd_epi16
    
                alpha = *maskCurrent;
                red = (srcByteTop[R] * alpha + srcByte[R] * (255 - alpha)) / 255;
                green = (srcByteTop[G] * alpha + srcByte[G] * (255 - alpha)) / 255;
                blue = (srcByteTop[B] * alpha + srcByte[B] * (255 - alpha)) / 255;
                CLAMPTOBYTE(red);
                CLAMPTOBYTE(green);
                CLAMPTOBYTE(blue);
                resultByte[R] = red;
                resultByte[G] = green;
                resultByte[B] = blue;
                //----
    
                // Step 7 - store result.
                //Store aligned if output is aligned on 16 byte boundrary
                _mm_store_si128(reinterpret_cast<__mm128i*>(resultByte), result)
                //Slow version if you can't guarantee alignment
                //_mm_storeu_si128(reinterpret_cast<__mm128i*>(resultByte), result)
    
                //Move pointers forward 4 places
                srcByte += bytepp * 4;
                srcByteTop += bytepp * 4;
                resultByte += bytepp * 4;
                maskCurrent += 4;
            }
        }
    

    To find out which AMD processors will run this code (currently it is using SSE2 instructions) see Wikipedia’s List of AMD Turion microprocessors. You could also look at other lists of processors on Wikipedia but my research shows that AMD cpus from around 4 years ago all support at least SSE2.

    You should expect a good SSE2 implimentation to run around 8-16 times faster than your current code. That is because we eliminate branches in the loop, process 4 pixels (or 12 channels) at once and improve cache performance by using streaming instructions. As an alternative to SSE, you could probably make your existing code run much faster by eliminating the if checks you are using for saturation. Beyond that I would need to run a profiler on your workload.

    Of course, the best solution is to use hardware support (i.e code your problem up in DirectX) and have it done on the video card.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 191k
  • Answers 191k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer Never resolved the insidious issues of incomplete checkins. So, we… May 12, 2026 at 6:11 pm
  • Editorial Team
    Editorial Team added an answer It sets the effective user id to the real user… May 12, 2026 at 6:11 pm
  • Editorial Team
    Editorial Team added an answer Although yes, it's shorter and there's ease of use, the… May 12, 2026 at 6:11 pm

Related Questions

I am using C++ ofstream to write out a file. I want to set
I am using C++ from Mingw, which is the windows version of GNC C++.
I am using Borland Builder C++ 2009. I want to add a button to
Many times when I am watching others code I just want to find where

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.