I’m very new to SIMD/SSE and I’m trying to do some simple image filtering

Question

0

Asked: May 15, 20262026-05-15T13:05:40+00:00 2026-05-15T13:05:40+00:00

I’m very new to SIMD/SSE and I’m trying to do some simple image filtering

0

I’m very new to SIMD/SSE and I’m trying to do some simple image filtering (blurring).
The code below filters each pixel of a 8-bit gray bitmap with a simple [1 2 1] weighting in horizontal direction. I’m creating sums of 16 pixels at a time.

What seems very bad about this code, at least to me, is that there is a lot of insert/extract in it, which is not very elegant and probably slows everything down as well. Is there a better way to wrap data from one reg into another when shifting?

buf is the image data, 16-byte aligned.
w/h are width and height, multiples of 16.

__m128i *p = (__m128i *) buf;
__m128i cur1, cur2, sum1, sum2, zeros, tmp1, tmp2, saved;
zeros = _mm_setzero_si128();
short shifted, last = 0, next;

// preload first row
cur1 = _mm_load_si128(p);
for (x = 1; x < (w * h) / 16; x++) {
    // unpack
    sum1 = sum2 = saved = cur1;
    sum1 = _mm_unpacklo_epi8(sum1, zeros);
    sum2 = _mm_unpackhi_epi8(sum2, zeros);
    cur1 = tmp1 = sum1;
    cur2 = tmp2 = sum2;
    // "middle" pixel
    sum1 = _mm_add_epi16(sum1, sum1);
    sum2 = _mm_add_epi16(sum2, sum2);
    // left pixel
    cur2 = _mm_slli_si128(cur2, 2);
    shifted = _mm_extract_epi16(cur1, 7);
    cur2 = _mm_insert_epi16(cur2, shifted, 0);
    cur1 = _mm_slli_si128(cur1, 2);
    cur1 = _mm_insert_epi16(cur1, last, 0);
    sum1 = _mm_add_epi16(sum1, cur1);
    sum2 = _mm_add_epi16(sum2, cur2);
    // right pixel
    tmp1 = _mm_srli_si128(tmp1, 2);
    shifted = _mm_extract_epi16(tmp2, 0);
    tmp1 = _mm_insert_epi16(tmp1, shifted, 7);
    tmp2 = _mm_srli_si128(tmp2, 2);
    // preload next row
    cur1 = _mm_load_si128(p + x);
    // we need the first pixel of the next row for the "right" pixel
    next = _mm_extract_epi16(cur1, 0) & 0xff;
    tmp2 = _mm_insert_epi16(tmp2, next, 7);
    // and the last pixel of last row for the next "left" pixel
    last = ((uint16_t) _mm_extract_epi16(saved, 7)) >> 8;
    sum1 = _mm_add_epi16(sum1, tmp1);
    sum2 = _mm_add_epi16(sum2, tmp2);
    // divide
    sum1 = _mm_srli_epi16(sum1, 2);
    sum2 = _mm_srli_epi16(sum2, 2);
    sum1 = _mm_packus_epi16(sum1, sum2);
    mm_store_si128(p + x - 1, sum1);
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T13:05:41+00:00

I suggest keeping the neighbouring pixels on the SSE register. That is, keep the result of the _mm_slli_si128 / _mm_srli_si128 in an SSE variable, and eliminate all of the insert and extract. My reasoning is that in older CPUs, the insert/extract instructions require communication between the SSE units and the general-purpose units, which is much slower than keeping the calculation within SSE, even if it spills over to the L1 cache.

When that is done, there should be only four 16-bit shifts ( _mm_slli_si128, _mm_srli_si128, not counting the divison shift ). My suggestion is to do a benchmark with your code, because by that time your code may have already hit the memory bandwidth limit .. which means you can’t optimize anymore.

If the image is large (bigger than L2 size) and the output won’t be read back soon, try use MOVNTDQ ( _mm_stream_si128 ) for writing back. According to several websites it is in SSE2, although you might want to double-check.

SIMD tutorial:

Some SIMD guru websites:

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m very new to SIMD/SSE and I’m trying to do some simple image filtering

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply