I emailed Mike Barnett, and he told me that ILMerge…

Question

0

Asked: May 13, 20262026-05-13T21:47:01+00:00 2026-05-13T21:47:01+00:00

I am a newbie on optimization. I have been reading some reference on how

0

I am a newbie on optimization. I have been reading some reference on how to optimize c++ code but I have a hard time applying it to real code. Therefore I just want to gather some real world optimization technique on how to squeeze as much juice from the CPU/Memory as possible from the loop below

double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
for(int t = 0; t < T; ++t){
sum += fun(a,b,c,d,e,f,sum);
*(array+t) = sum;
}

where a,b,c,d,e,f are double and T is int. Anything including but not limited to memory alignment, parallelism, openmp/MPI, and SSE instructions are welcome. Compiler is standard gcc, microsoft, or commonly available compiler. If the solution is compiler specific, please specific compiler and any option flag associate with your solution.

Thanks!

PS: Forgot to mention properties fun. Please assume that it is a simple function with no loop inside, and compose of only basic arithmetic operation. Simply think of it as an inline function.

EDIT2: since the details of fun is important, please forget the parameter c, d, e, f and assume fun is defined as

inline double fun(a,b, sum){
return sum + a* ( b - sum);
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T21:47:01+00:00

Since sum depends on its previous values in a non-trivial way, it is impossible to parallelize the code (so OpenMP and MPI are out).

Memory alignment and SSE should be enforced/used automatically with appropriate compiler settings.

Besides inlining fun and unrolling the loop (by compiling in -O3) there is not much we can do if fun is completely general.

Since fun(a,b,sum) = sum + a*(b-sum), we have the closed form

            ab             t+1
array[t] = ———— ( 1 - (2-a)    )
            a-1

which can be vectorized and parallelized, but the division and exponentiation can be very expensive.

Nevertheless, with the closed form we can start the loop from any index, e.g. create 2 threads one from t = 0 to T/2-1, another from t = T/2 to T-1, which perform the original loop, but the initial sum is computed using the above closed form solution. Also, if only a few values from the array is needed this can be computed lazily.

And for SSE, you could first fill the array first with (2-a)^(t+1), and then apply x :-> x - 1 to the whole array, and then apply x :-> x * c to the whole array where c = a*b/(1-a), but there may be automatic vectorization already.

How to approach applying for a job at a company ...

What is a programmer’s life like?

How to handle personal stress caused by utterly incompetent and ...

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions