I’m working on some image processing app for an iOS and thresholding is really a huge bottleneck. So I’m trying to optimize it using NEON. Here is the C version of function. Is there any way to rewrite this using NEON (unfortunately I have absolutely no experience in this) ?
static void thresh_8u( const Image& _src, Image& _dst, uchar thresh, uchar maxval, int type ) {
int i, j;
uchar tab[256];
Size roi = _src.size();
roi.width *= _src.channels();
memset(&tab[0], 0, thresh);
memset(&tab[thresh], maxval, 256-thresh);
for( i = 0; i < roi.height; i++ ) {
const uchar* src = (const uchar*)(_src.data + _src.step*i);
uchar* dst = (uchar*)(_dst.data + _dst.step*i);
j = 0;
for(; j <= roi.width; ++j) {
dst[j] = tab[src[j]];
}
}
}
It’s actually pretty easy, if you can make sure your rows are always a multiple of 16 bytes wide, because the compiler (clang) has special types representing the NEON vector registers, and knows how to apply the normal C operators to them. Here’s my little test function:
This operates on 16 bytes at a time. A
uint8x16_tis a vector register containing 16 8-bit unsigned ints. Thevdupq_n_u8returns a vectoruint8x16_tfilled with 16 copies of its argument.The
>operator, applied to twouint8x16_tvalues, does 16 comparisons between pairs of 8-bit unsigned ints. Where the left input is greater than the right input, it returns 0xff (which is different from a normal C>, which just returns 0x01). Where the left input is less than or equal to the right input, it returns 0. (It compiles into the VCGT.U8 instruction.)The
&operator, applied to twouint8x16_tvalues, computes the boolean AND of 128 pairs of bits.The loop compiles to this in a release build: