I have a cpu-consuming code, where some function with a loop is executed many times. Every optimization in this loop brings noticeable performance gain. Question: How would you optimize this loop (there is not much more to optimize though…)?
void theloop(int64_t in[], int64_t out[], size_t N)
{
for(uint32_t i = 0; i < N; i++) {
int64_t v = in[i];
max += v;
if (v > max) max = v;
out[i] = max;
}
}
I tried a few things, e.g. I replaced arrays with pointers that were incremented in every loop, but (surprisingly) i lost some performance instead of gaining…
Edit:
- changed name of one variable (
itsMaximums, error) - the function is an a method of a class
- in and put are
int64_t, so are negative and positive - `(v > max) can evaluate to true: consider the situation when actual max is negative
- the code runs on 32-bit pc (development) and 64-bit (production)
Nis unknown at compile time- I tried some SIMD, but I failed to increase performance… (the overhead of moving the variables to
_m128i, executing and storing back was higher than than SSE speed gain. Yet I am not an expert on SSE, so maybe I had a poor code)
Results:
I added some loop unfolding, and a nice hack from Alex’es post. Below I paste some results:
- original: 14.0s
- unfolded loop (4 iterations): 10.44s
- Alex’es trick: 10.89s
- 2) and 3) at once: 11.71s
strage, that 4) is not faster than 3) and 4). Below code for 4):
for(size_t i = 1; i < N; i+=CHUNK) {
int64_t t_in0 = in[i+0];
int64_t t_in1 = in[i+1];
int64_t t_in2 = in[i+2];
int64_t t_in3 = in[i+3];
max &= -max >> 63;
max += t_in0;
out[i+0] = max;
max &= -max >> 63;
max += t_in1;
out[i+1] = max;
max &= -max >> 63;
max += t_in2;
out[i+2] = max;
max &= -max >> 63;
max += t_in3;
out[i+3] = max;
}
>
#**Announcement** see [chat](https://chat.stackoverflow.com/rooms/5056/discussion-between-sehe-and-jakub-m)
**Edit**: ironically… that testbed had a fatal flaw, which rendered the results invalid: the heuristic version was in fact skipping parts of the input, but because existing output wasn’t being cleared, it appeared to have the correct output… (still editing…)> > _Hi Jakub, what would you say if I have found a version that uses a heuristic optimization that, for random data distributed uniformly will result in ~3.2x speed increase for `int64_t` (10.56x effective using `float`s)?_
>
I have yet to find the time to update the post, but the explanation and code can be found through the chat.
> I used the same test-bed code (below) to verify that the results are correct and exactly match the original implementation from your OP
Ok, I have published a benchmark based on your code versions, and also my proposed use of
partial_sum.Find all the code here https://gist.github.com/1368992#file_test.cpp
Features
For a default config of
It will (see output fragment here):
int64_t)Results
There are a number of (surprising or unsurprising) results:
there is no significant performance difference between any of the algorithms whatsoever (for integer data), provided you are compiling with optimizations enabled. (See Makefile; my arch is 64bit, Intel Core Q9550 with gcc-4.6.1)
The algorithms are not equivalent (you’ll see hash sums differ): notably the bit fiddle proposed by Alex doesn’t handle integer overflow in quite the same way (this can be hidden defining
which limits the input data so overflows won’t occur; Note that the
partial_sum_incorrectversion shows equivalent C++ non-bitwise _arithmetic operations that yield the same different results:Perhaps, it is ok for your purpose?)
Surprisingly It is not more expensive to calculate both definitions of the max algorithm at once. You can see this being done inside
partial_sum_correct: it calculates both ‘formulations’ of max in the same loop; This is really not more than a triva here, because none of the two methods is significantly faster…Even more surprisingly a big performance boost can be had when you are able to use
floatinstead ofint64_t. A quick and dirty hack can be applied to the benchmarkshowing that the STL based algorithm (
partial_sum_incorrect) runs aproximately 2.5x faster when usingfloatinstead ofint64_t(!!!).Note:
partial_sum_incorrectonly relates to integer overflow, which doesn’t apply to floats; this can be seen from the fact that the hashes match up, so in fact it is partial_sum_float_correct 🙂partial_sum_correctis doing double work that causes it to perform badly in floating point mode. See bullet 3.(And there was that off-by-1 bug in the loop-unrolled version from the OP I mentioned before)
Partial sum
For your interest, the partial sum application looks like this in C++11: