I have a following code in a most inner loop of my program
struct V {
float val [200]; // 0 <= val[i] <= 1
};
V a[600];
V b[250];
V c[250];
V d[350];
V e[350];
// ... init values in a,b,c,d,e ...
int findmax(int ai, int bi, int ci, int di, int ei) {
float best_val = 0.0;
int best_ii = -1;
for (int ii = 0; ii < 200; ii++) {
float act_val =
a[ai].val[ii] +
b[bi].val[ii] +
c[ci].val[ii] +
d[ci].val[ii] +
e[ci].val[ii];
if (act_val > best_val) {
best_val = act_val;
best_ii = ii;
}
}
return best_ii;
}
I don’t care whether it will be some clever algorithm (but this would be most interesting) or some C++ tricks or intrinsics or assembler. But I need to make findmax function more efficient.
Big thanks in advance.
Edit:
It seems that branch is the slowest operation (misprediction?).
Well, I see no obvious room for algorithmic optimizations. Theoreticaly one could only calculate the sum of the five vectors until it is obvious that the maximum cannot be reached, but this would add way to much overhead for only summing five numbers. You could try using multiple threads and assign ranges to the threads, but you have to think about the thread creation overhead when you have only 200 very short work items.
So I tend to say that using Assembler and MMX or SSE instructions on x86 or maybe a (machine specific) C++ a library providing access to this instructions is your best bet.