I have a cpu-consuming code, where some function with a loop is executed many

Question

0

Asked: May 26, 20262026-05-26T19:40:05+00:00 2026-05-26T19:40:05+00:00

I have a cpu-consuming code, where some function with a loop is executed many

0

I have a cpu-consuming code, where some function with a loop is executed many times. Every optimization in this loop brings noticeable performance gain. Question: How would you optimize this loop (there is not much more to optimize though…)?

void theloop(int64_t in[], int64_t out[], size_t N)
{
    for(uint32_t i = 0; i < N; i++) {
        int64_t v = in[i];
        max += v;
        if (v > max) max = v;
        out[i] = max;
    }
}

I tried a few things, e.g. I replaced arrays with pointers that were incremented in every loop, but (surprisingly) i lost some performance instead of gaining…

Edit:

changed name of one variable (itsMaximums, error)
the function is an a method of a class
in and put are int64_t , so are negative and positive
`(v > max) can evaluate to true: consider the situation when actual max is negative
the code runs on 32-bit pc (development) and 64-bit (production)
N is unknown at compile time
I tried some SIMD, but I failed to increase performance… (the overhead of moving the variables to _m128i, executing and storing back was higher than than SSE speed gain. Yet I am not an expert on SSE, so maybe I had a poor code)

Results:

I added some loop unfolding, and a nice hack from Alex’es post. Below I paste some results:

original: 14.0s
unfolded loop (4 iterations): 10.44s
Alex’es trick: 10.89s
2) and 3) at once: 11.71s

strage, that 4) is not faster than 3) and 4). Below code for 4):

for(size_t i = 1; i < N; i+=CHUNK) {
    int64_t t_in0 = in[i+0];
    int64_t t_in1 = in[i+1];
    int64_t t_in2 = in[i+2];
    int64_t t_in3 = in[i+3];


    max &= -max >> 63;
    max += t_in0;
    out[i+0] = max;

    max &= -max >> 63;
    max += t_in1;
    out[i+1] = max;

    max &= -max >> 63;
    max += t_in2;
    out[i+2] = max;

    max &= -max >> 63;
    max += t_in3;
    out[i+3] = max;

}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T19:40:05+00:00

_{> #**Announcement** see [chat](https://chat.stackoverflow.com/rooms/5056/discussion-between-sehe-and-jakub-m)

> > _Hi Jakub, what would you say if I have found a version that uses a heuristic optimization that, for random data distributed uniformly will result in ~3.2x speed increase for `int64_t` (10.56x effective using `float`s)?_

>

I have yet to find the time to update the post, but the explanation and code can be found through the chat.

> I used the same test-bed code (below) to verify that the results are correct and exactly match the original implementation from your OP} **Edit**: ironically… that testbed had a fatal flaw, which rendered the results invalid: the heuristic version was in fact skipping parts of the input, but because existing output wasn’t being cleared, it appeared to have the correct output… (still editing…)

Ok, I have published a benchmark based on your code versions, and also my proposed use of partial_sum.

Find all the code here https://gist.github.com/1368992#file_test.cpp

Features

For a default config of

#define MAGNITUDE     20
#define ITERATIONS    1024
#define VERIFICATION  1
#define VERBOSE       0

#define LIMITED_RANGE 0    // hide difference in output due to absense of overflows
#define USE_FLOATS    0

It will (see output fragment here):

run 100 x 1024 iterations (i.e. 100 different random seeds)
for data length 1048576 (2^20).
The random input data is uniformly distributed over the full range of the element data type (int64_t)
Verify output by generating a hash digest of the output array and comparing it to the reference implementation from the OP.

Results

There are a number of (surprising or unsurprising) results:

there is no significant performance difference between any of the algorithms whatsoever (for integer data), provided you are compiling with optimizations enabled. (See Makefile; my arch is 64bit, Intel Core Q9550 with gcc-4.6.1)
The algorithms are not equivalent (you’ll see hash sums differ): notably the bit fiddle proposed by Alex doesn’t handle integer overflow in quite the same way (this can be hidden defining
```
#define LIMITED_RANGE 1
```
which limits the input data so overflows won’t occur; Note that the partial_sum_incorrect version shows equivalent C++ non-bitwise _arithmetic operations that yield the same different results:
```
return max<0 ? v :  max + v; 
```
Perhaps, it is ok for your purpose?)
Surprisingly It is not more expensive to calculate both definitions of the max algorithm at once. You can see this being done inside partial_sum_correct: it calculates both ‘formulations’ of max in the same loop; This is really not more than a triva here, because none of the two methods is significantly faster…
Even more surprisingly a big performance boost can be had when you are able to use float instead of int64_t. A quick and dirty hack can be applied to the benchmark
```
#define USE_FLOATS    0
```
showing that the STL based algorithm (partial_sum_incorrect) runs aproximately 2.5x faster when using float instead of int64_t (!!!).
Note:
- that the naming of partial_sum_incorrect only relates to integer overflow, which doesn’t apply to floats; this can be seen from the fact that the hashes match up, so in fact it is partial_sum_float_correct 🙂
- that the current implementation of partial_sum_correct is doing double work that causes it to perform badly in floating point mode. See bullet 3.
(And there was that off-by-1 bug in the loop-unrolled version from the OP I mentioned before)

Partial sum

For your interest, the partial sum application looks like this in C++11:

std::partial_sum(data.begin(), data.end(), output.begin(), 
        [](int64_t max, int64_t v) -> int64_t
        { 
            max += v;
            if (v > max) max = v;
            return max;
        });

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a cpu-consuming code, where some function with a loop is executed many

Leave an answerCancel reply

1 Answer

Features

Results

Partial sum

Leave an answer
Cancel reply