I’m working on some image processing app for an iOS and thresholding is really

Question

0

Asked: June 8, 20262026-06-08T10:55:59+00:00 2026-06-08T10:55:59+00:00

I’m working on some image processing app for an iOS and thresholding is really

0

I’m working on some image processing app for an iOS and thresholding is really a huge bottleneck. So I’m trying to optimize it using NEON. Here is the C version of function. Is there any way to rewrite this using NEON (unfortunately I have absolutely no experience in this) ?

static void thresh_8u( const Image& _src, Image& _dst, uchar thresh, uchar maxval, int type ) {
    int i, j;
    uchar tab[256];
    Size roi = _src.size();
    roi.width *= _src.channels();

    memset(&tab[0], 0, thresh);
    memset(&tab[thresh], maxval, 256-thresh);

    for( i = 0; i < roi.height; i++ ) {
        const uchar* src = (const uchar*)(_src.data + _src.step*i);
        uchar* dst = (uchar*)(_dst.data + _dst.step*i);
        j = 0;

        for(; j <= roi.width; ++j) {
            dst[j] = tab[src[j]];
        }
    }
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T10:56:01+00:00

It’s actually pretty easy, if you can make sure your rows are always a multiple of 16 bytes wide, because the compiler (clang) has special types representing the NEON vector registers, and knows how to apply the normal C operators to them. Here’s my little test function:

#ifdef __ARM_NEON__

#include <arm_neon.h>

void computeThreshold(void *input, void *output, int count, uint8_t threshold, uint8_t highValue) {
    uint8x16_t thresholdVector = vdupq_n_u8(threshold);
    uint8x16_t highValueVector = vdupq_n_u8(highValue);
    uint8x16_t *__restrict inputVector = (uint8x16_t *)input;
    uint8x16_t *__restrict outputVector = (uint8x16_t *)output;
    for ( ; count > 0; count -= 16, ++inputVector, ++outputVector) {
        *outputVector = (*inputVector > thresholdVector) & highValueVector;
    }
}

#endif

This operates on 16 bytes at a time. A uint8x16_t is a vector register containing 16 8-bit unsigned ints. The vdupq_n_u8 returns a vector uint8x16_t filled with 16 copies of its argument.

The > operator, applied to two uint8x16_t values, does 16 comparisons between pairs of 8-bit unsigned ints. Where the left input is greater than the right input, it returns 0xff (which is different from a normal C >, which just returns 0x01). Where the left input is less than or equal to the right input, it returns 0. (It compiles into the VCGT.U8 instruction.)

The & operator, applied to two uint8x16_t values, computes the boolean AND of 128 pairs of bits.

The loop compiles to this in a release build:

0x6e668:  vldmia r2, {d4, d5}
0x6e66c:  subs   r0, #16
0x6e66e:  vcgt.u8 q10, q10, q8
0x6e672:  adds   r2, #16
0x6e674:  cmp    r0, #0
0x6e676:  vand   q10, q10, q9
0x6e67a:  vstmia r1, {d4, d5}
0x6e67e:  add.w  r1, r1, #16
0x6e682:  bgt    0x6e668

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m working on some image processing app for an iOS and thresholding is really

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply