I am a newbie on optimization. I have been reading some reference on how to optimize c++ code but I have a hard time applying it to real code. Therefore I just want to gather some real world optimization technique on how to squeeze as much juice from the CPU/Memory as possible from the loop below
double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
for(int t = 0; t < T; ++t){
sum += fun(a,b,c,d,e,f,sum);
*(array+t) = sum;
}
where a,b,c,d,e,f are double and T is int. Anything including but not limited to memory alignment, parallelism, openmp/MPI, and SSE instructions are welcome. Compiler is standard gcc, microsoft, or commonly available compiler. If the solution is compiler specific, please specific compiler and any option flag associate with your solution.
Thanks!
PS: Forgot to mention properties fun. Please assume that it is a simple function with no loop inside, and compose of only basic arithmetic operation. Simply think of it as an inline function.
EDIT2: since the details of fun is important, please forget the parameter c, d, e, f and assume fun is defined as
inline double fun(a,b, sum){
return sum + a* ( b - sum);
}
Since
sumdepends on its previous values in a non-trivial way, it is impossible to parallelize the code (so OpenMP and MPI are out).Memory alignment and SSE should be enforced/used automatically with appropriate compiler settings.
Besides inlining
funand unrolling the loop (by compiling in-O3) there is not much we can do iffunis completely general.Since
fun(a,b,sum) = sum + a*(b-sum), we have the closed formwhich can be vectorized and parallelized, but the division and exponentiation can be very expensive.
Nevertheless, with the closed form we can start the loop from any index, e.g. create 2 threads one from t = 0 to T/2-1, another from t = T/2 to T-1, which perform the original loop, but the initial
sumis computed using the above closed form solution. Also, if only a few values from the array is needed this can be computed lazily.And for SSE, you could first fill the array first with
(2-a)^(t+1), and then applyx :-> x - 1to the whole array, and then applyx :-> x * cto the whole array wherec = a*b/(1-a), but there may be automatic vectorization already.