I’m currently trying to most efficiently do an in-place multiplication of an array of complex numbers (memory aligned the same way the std::complex would be but currently using our own ADT) by an array of scalar values that is the same size as the complex number array.
The algorithm is already parallelized, i.e. the calling object splits the work up into threads. This calculation is done on arrays in the 100s of millions – so, it can take some time to complete. CUDA is not a solution for this product, although I wish it was. I do have access to boost and thus have some potential to use BLAS/uBLAS.
I’m thinking, however, that SIMD might yield much better results, but I’m not familiar enough with how to do this with complex numbers. The code I have now is as follows (remember this is chunked up into threads which correspond to the number of cores on the target machine). The target machine is also unknown. So, a generic approach is probably best.
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (register int idx = start; idx < end; ++idx)
{
values[idx].real *= scalar[idx];
values[idx].imag *= scalar[idx];
}
}
fcomplex is defined as follows:
struct fcomplex
{
float real;
float imag;
};
I’ve tried manually unrolling the loop, as my finally loop count will always be a power of 2, but the compiler is already doing that for me (I’ve unrolled as far as 32). I’ve tried a const float reference to the scalar – in thinking I’d save one access – and that proved to be equal to the what the compiler was already doing. I’ve tried STL and transform, which game close results, but still worse. I’ve also tried casting to std::complex and allow it to use the overloaded operator for scalar * complex for the multiplication but this ultimately produced the same results.
So, anyone with any ideas? Much appreciation is given for your time in considering this! Target platform is Windows. I’m using Visual Studio 2008. Product cannot contain GPL code as well! Thanks so much.
You can do this fairly easily with SSE, e.g.
Note that
valuesandscalarneed to be 16 byte aligned.Or you could just use the Intel ICC compiler and let it do the hard work for you.
UPDATE
Here is an improved version which unrolls the loop by a factor of 2 and uses a single load instruction to get 4 scalar values which are then unpacked into two vectors: