I have the below snippet which is a hotspot in an application.
The for loop is not vectorized due to vector dependance.
Is there a way to rewrite this loop to make it run faster.
#define NUM_KEYS (1L << 20)
#define NUM_BUCKETS (1L << 10)
int i,k;
int shift = (1L << 11);
int key_array[NUM_KEYS],key_buff[NUM_KEYS];
int bucket_ptrs[NUM_BUCKETS];
for( i=0; i<NUM_KEYS; i++ )
{
k = key_array[i];
key_buff[bucket_ptrs[k >> shift]++] = k;
}
One approach I tried was to create a temporary array to hold the shifted values of key_array.
for( i=0; i<NUM_KEYS; i++ )
{
key_arrays[i] = key_array[i] >> shift;
}
for( i=0; i<NUM_KEYS; i++ )
{
k = key_array[i];
j = key_arrays[i];
key_buff[bucket_ptrs[j]++] = k;
}
Here the first loop gets vectorized. But overall there is no improvement in performance.
This is because you have non-sequential memory access here:
bucket_ptrsdetermines the index for accessingkey_buff. Since these indices are all over the place, the memory access is non-sequential.Currently, x86 processors only support SIMD loads/stores to contiguous chunks of memory. (ideally aligned as well)
If you want it to vectorize, you will need AVX2’s gather/scatter instructions. Those don’t exist it, but should be coming out in the next generation of Intel processors.
This is because you’re adding additional loop-overhead. So you’re now making two passes over
key_array. If anything, I’m surprised that it isn’t slower.I doubt it – at least not without altering the algorithm. At the very least, you will want
key_buffto fit comfortably into your L1 cache.AVX2 will let it vectorize, but the problem is that
key_buffis 4MB. That won’t fit into the lower level caches. So even AVX2 might not help much. You will be bound completely by memory access.