I’m very new to SIMD/SSE and I’m trying to do some simple image filtering (blurring).
The code below filters each pixel of a 8-bit gray bitmap with a simple [1 2 1] weighting in horizontal direction. I’m creating sums of 16 pixels at a time.
What seems very bad about this code, at least to me, is that there is a lot of insert/extract in it, which is not very elegant and probably slows everything down as well. Is there a better way to wrap data from one reg into another when shifting?
buf is the image data, 16-byte aligned.
w/h are width and height, multiples of 16.
__m128i *p = (__m128i *) buf;
__m128i cur1, cur2, sum1, sum2, zeros, tmp1, tmp2, saved;
zeros = _mm_setzero_si128();
short shifted, last = 0, next;
// preload first row
cur1 = _mm_load_si128(p);
for (x = 1; x < (w * h) / 16; x++) {
// unpack
sum1 = sum2 = saved = cur1;
sum1 = _mm_unpacklo_epi8(sum1, zeros);
sum2 = _mm_unpackhi_epi8(sum2, zeros);
cur1 = tmp1 = sum1;
cur2 = tmp2 = sum2;
// "middle" pixel
sum1 = _mm_add_epi16(sum1, sum1);
sum2 = _mm_add_epi16(sum2, sum2);
// left pixel
cur2 = _mm_slli_si128(cur2, 2);
shifted = _mm_extract_epi16(cur1, 7);
cur2 = _mm_insert_epi16(cur2, shifted, 0);
cur1 = _mm_slli_si128(cur1, 2);
cur1 = _mm_insert_epi16(cur1, last, 0);
sum1 = _mm_add_epi16(sum1, cur1);
sum2 = _mm_add_epi16(sum2, cur2);
// right pixel
tmp1 = _mm_srli_si128(tmp1, 2);
shifted = _mm_extract_epi16(tmp2, 0);
tmp1 = _mm_insert_epi16(tmp1, shifted, 7);
tmp2 = _mm_srli_si128(tmp2, 2);
// preload next row
cur1 = _mm_load_si128(p + x);
// we need the first pixel of the next row for the "right" pixel
next = _mm_extract_epi16(cur1, 0) & 0xff;
tmp2 = _mm_insert_epi16(tmp2, next, 7);
// and the last pixel of last row for the next "left" pixel
last = ((uint16_t) _mm_extract_epi16(saved, 7)) >> 8;
sum1 = _mm_add_epi16(sum1, tmp1);
sum2 = _mm_add_epi16(sum2, tmp2);
// divide
sum1 = _mm_srli_epi16(sum1, 2);
sum2 = _mm_srli_epi16(sum2, 2);
sum1 = _mm_packus_epi16(sum1, sum2);
mm_store_si128(p + x - 1, sum1);
}
I suggest keeping the neighbouring pixels on the SSE register. That is, keep the result of the _mm_slli_si128 / _mm_srli_si128 in an SSE variable, and eliminate all of the insert and extract. My reasoning is that in older CPUs, the insert/extract instructions require communication between the SSE units and the general-purpose units, which is much slower than keeping the calculation within SSE, even if it spills over to the L1 cache.
When that is done, there should be only four 16-bit shifts ( _mm_slli_si128, _mm_srli_si128, not counting the divison shift ). My suggestion is to do a benchmark with your code, because by that time your code may have already hit the memory bandwidth limit .. which means you can’t optimize anymore.
If the image is large (bigger than L2 size) and the output won’t be read back soon, try use MOVNTDQ ( _mm_stream_si128 ) for writing back. According to several websites it is in SSE2, although you might want to double-check.
SIMD tutorial:
Some SIMD guru websites: