I’m using a compact struct of 2 unsigned shorts indicating a start and end position.
I need to be able to quickly determine if there are any Range objects with a length (difference from start to end) past a threshold value.
I’m going to have a huge quantity of objects each with their own Range array, so it is not feasible to track which Range objects are above the threshold in a list or something. This code is also going to be run very often (many times a second for each array), so it needs to be efficient.
struct Range
{
unsigned short start;
unsigned short end;
}
I will always have an array of Range sized 2^n. While I would like to abort as soon as I find something over the threshold, I’m pretty sure it’d be faster to simply OR it all together and check at the end… assuming I can vectorize the loop. Although if I could do an if statement on the chunk of results for each vector, that would be grand.
size_t rangecount = 1 << resolution;
Range* ranges = new Range[rangecount];
...
bool result = false;
for (size_t i = 0; i < ranges; ++i)
{
result |= (range[i].end - range[i].start) > 4;
}
Not surprisingly, the auto-vectorizer gives the 1202 error because my data type isn’t 32 or 64 bits wide. I really don’t want to double my data size and make each field an unsigned int. So I’m guessing the auto-vectorizer approach is out for this.
Are there vector instructions that can handle 16 bit variables? If there are, how could I use them in c++ to vectorize my loop?
You are wondering if any value is greater than 4?
Yes, there are SIMD instructions for this. It’s unfortunate that the auto-vectorized isn’t able to handle this scenario. Here’s a vectorized algorithm:
Use
_mm_sub_epi16with a structure of arrays or_mm_hsub_epi16with an array of structures.Actually since
startis stored first in memory, you will be working onstart_v - end_v, so use_mm_min_epi16and a vector of-4.Each SSE3 instruction will perform 8 comparisons at a time. It will still be fastest to return early instead of looping. However, unrolling the loop a bit more may buy you additional speed (pass the first set of results into the packed min/max function to combine them).
So you end up with (approximately):