I have the below snippet which is a hotspot in an application. The for

Question

0

Asked: June 15, 20262026-06-15T14:41:15+00:00 2026-06-15T14:41:15+00:00

I have the below snippet which is a hotspot in an application. The for

0

I have the below snippet which is a hotspot in an application.
The for loop is not vectorized due to vector dependance.
Is there a way to rewrite this loop to make it run faster.

#define  NUM_KEYS  (1L << 20)
#define  NUM_BUCKETS (1L << 10)    
int i,k;
int shift = (1L << 11);
int key_array[NUM_KEYS],key_buff[NUM_KEYS];
int bucket_ptrs[NUM_BUCKETS];

for( i=0; i<NUM_KEYS; i++ )  
{
    k = key_array[i];
    key_buff[bucket_ptrs[k >> shift]++] = k;
}

One approach I tried was to create a temporary array to hold the shifted values of key_array.

for( i=0; i<NUM_KEYS; i++ )  
{
        key_arrays[i] = key_array[i] >> shift;
}
for( i=0; i<NUM_KEYS; i++ )  
{
    k = key_array[i];
    j = key_arrays[i];
    key_buff[bucket_ptrs[j]++] = k;
}

Here the first loop gets vectorized. But overall there is no improvement in performance.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T14:41:16+00:00

Why is the loop not vectorizable?

This is because you have non-sequential memory access here:

key_buff[bucket_ptrs[k >> shift]++] = k;

bucket_ptrs determines the index for accessing key_buff. Since these indices are all over the place, the memory access is non-sequential.

Currently, x86 processors only support SIMD loads/stores to contiguous chunks of memory. (ideally aligned as well)

If you want it to vectorize, you will need AVX2’s gather/scatter instructions. Those don’t exist it, but should be coming out in the next generation of Intel processors.

Here the first loop gets vectorized. But overall there is no
improvement in performance.

This is because you’re adding additional loop-overhead. So you’re now making two passes over key_array. If anything, I’m surprised that it isn’t slower.

Is there a way to rewrite this loop to make it run faster.

I doubt it – at least not without altering the algorithm. At the very least, you will want key_buff to fit comfortably into your L1 cache.

AVX2 will let it vectorize, but the problem is that key_buff is 4MB. That won’t fit into the lower level caches. So even AVX2 might not help much. You will be bound completely by memory access.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have the below snippet which is a hotspot in an application. The for

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply