I have some code in a loop
for(int i = 0; i < n; i++)
{
u[i] = c * u[i] + s * b[i];
}
So, u and b are vectors of the same length, and c and s are scalars. Is this code a good candidate for vectorization for use with SSE in order to get a speedup?
UPDATE
I learnt vectorization (turns out it’s not so hard if you use intrinsics) and implemented my loop in SSE. However, when setting the SSE2 flag in the VC++ compiler, I get about the same performance as with my own SSE code. The Intel compiler on the other hand was much faster than my SSE code or the VC++ compiler.
Here is the code I wrote for reference
double *u = (double*) _aligned_malloc(n * sizeof(double), 16);
for(int i = 0; i < n; i++)
{
u[i] = 0;
}
int j = 0;
__m128d *uSSE = (__m128d*) u;
__m128d cStore = _mm_set1_pd(c);
__m128d sStore = _mm_set1_pd(s);
for (j = 0; j <= i - 2; j+=2)
{
__m128d uStore = _mm_set_pd(u[j+1], u[j]);
__m128d cu = _mm_mul_pd(cStore, uStore);
__m128d so = _mm_mul_pd(sStore, omegaStore);
uSSE[j/2] = _mm_add_pd(cu, so);
}
for(; j <= i; ++j)
{
u[j] = c * u[j] + s * omegaCache[j];
}
Yes, this is an excellent candidate for vectorization. But, before you do so, make sure you’ve profiled your code to be sure that this is actually worth optimizing. That said, the vectorization would go something like this:
For even more performance, you can consider prefetching further array elements, and/or unrolling the loop and using software pipelining to interleave the computation in one loop with the memory accesses from a different iteration.