Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6806091
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T19:40:05+00:00 2026-05-26T19:40:05+00:00

I have a cpu-consuming code, where some function with a loop is executed many

  • 0

I have a cpu-consuming code, where some function with a loop is executed many times. Every optimization in this loop brings noticeable performance gain. Question: How would you optimize this loop (there is not much more to optimize though…)?

void theloop(int64_t in[], int64_t out[], size_t N)
{
    for(uint32_t i = 0; i < N; i++) {
        int64_t v = in[i];
        max += v;
        if (v > max) max = v;
        out[i] = max;
    }
}

I tried a few things, e.g. I replaced arrays with pointers that were incremented in every loop, but (surprisingly) i lost some performance instead of gaining…

Edit:

  • changed name of one variable (itsMaximums, error)
  • the function is an a method of a class
  • in and put are int64_t , so are negative and positive
  • `(v > max) can evaluate to true: consider the situation when actual max is negative
  • the code runs on 32-bit pc (development) and 64-bit (production)
  • N is unknown at compile time
  • I tried some SIMD, but I failed to increase performance… (the overhead of moving the variables to _m128i, executing and storing back was higher than than SSE speed gain. Yet I am not an expert on SSE, so maybe I had a poor code)

Results:

I added some loop unfolding, and a nice hack from Alex’es post. Below I paste some results:

  1. original: 14.0s
  2. unfolded loop (4 iterations): 10.44s
  3. Alex’es trick: 10.89s
  4. 2) and 3) at once: 11.71s

strage, that 4) is not faster than 3) and 4). Below code for 4):

for(size_t i = 1; i < N; i+=CHUNK) {
    int64_t t_in0 = in[i+0];
    int64_t t_in1 = in[i+1];
    int64_t t_in2 = in[i+2];
    int64_t t_in3 = in[i+3];


    max &= -max >> 63;
    max += t_in0;
    out[i+0] = max;

    max &= -max >> 63;
    max += t_in1;
    out[i+1] = max;

    max &= -max >> 63;
    max += t_in2;
    out[i+2] = max;

    max &= -max >> 63;
    max += t_in3;
    out[i+3] = max;

}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T19:40:05+00:00Added an answer on May 26, 2026 at 7:40 pm


    > #**Announcement** see [chat](https://chat.stackoverflow.com/rooms/5056/discussion-between-sehe-and-jakub-m)
    > > _Hi Jakub, what would you say if I have found a version that uses a heuristic optimization that, for random data distributed uniformly will result in ~3.2x speed increase for `int64_t` (10.56x effective using `float`s)?_
    >
    I have yet to find the time to update the post, but the explanation and code can be found through the chat.

    > I used the same test-bed code (below) to verify that the results are correct and exactly match the original implementation from your OP
       **Edit**: ironically… that testbed had a fatal flaw, which rendered the results invalid: the heuristic version was in fact skipping parts of the input, but because existing output wasn’t being cleared, it appeared to have the correct output… (still editing…)


    Ok, I have published a benchmark based on your code versions, and also my proposed use of partial_sum.

    Find all the code here https://gist.github.com/1368992#file_test.cpp

    Features

    For a default config of

    #define MAGNITUDE     20
    #define ITERATIONS    1024
    #define VERIFICATION  1
    #define VERBOSE       0
    
    #define LIMITED_RANGE 0    // hide difference in output due to absense of overflows
    #define USE_FLOATS    0
    

    It will (see output fragment here):

    • run 100 x 1024 iterations (i.e. 100 different random seeds)
    • for data length 1048576 (2^20).
    • The random input data is uniformly distributed over the full range of the element data type (int64_t)
    • Verify output by generating a hash digest of the output array and comparing it to the reference implementation from the OP.

    Results

    There are a number of (surprising or unsurprising) results:

    1. there is no significant performance difference between any of the algorithms whatsoever (for integer data), provided you are compiling with optimizations enabled. (See Makefile; my arch is 64bit, Intel Core Q9550 with gcc-4.6.1)

    2. The algorithms are not equivalent (you’ll see hash sums differ): notably the bit fiddle proposed by Alex doesn’t handle integer overflow in quite the same way (this can be hidden defining

      #define LIMITED_RANGE 1
      

      which limits the input data so overflows won’t occur; Note that the partial_sum_incorrect version shows equivalent C++ non-bitwise _arithmetic operations that yield the same different results:

      return max<0 ? v :  max + v; 
      

      Perhaps, it is ok for your purpose?)

    3. Surprisingly It is not more expensive to calculate both definitions of the max algorithm at once. You can see this being done inside partial_sum_correct: it calculates both ‘formulations’ of max in the same loop; This is really not more than a triva here, because none of the two methods is significantly faster…

    4. Even more surprisingly a big performance boost can be had when you are able to use float instead of int64_t. A quick and dirty hack can be applied to the benchmark

      #define USE_FLOATS    0
      

      showing that the STL based algorithm (partial_sum_incorrect) runs aproximately 2.5x faster when using float instead of int64_t (!!!).
      Note:

      • that the naming of partial_sum_incorrect only relates to integer overflow, which doesn’t apply to floats; this can be seen from the fact that the hashes match up, so in fact it is partial_sum_float_correct 🙂
      • that the current implementation of partial_sum_correct is doing double work that causes it to perform badly in floating point mode. See bullet 3.
    5. (And there was that off-by-1 bug in the loop-unrolled version from the OP I mentioned before)

    Partial sum

    For your interest, the partial sum application looks like this in C++11:

    std::partial_sum(data.begin(), data.end(), output.begin(), 
            [](int64_t max, int64_t v) -> int64_t
            { 
                max += v;
                if (v > max) max = v;
                return max;
            });
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have some legacy code that was used to monitor my applications cpu,memory etc
I have many log files like this: ...... ...... cpu time 9.05 seconds real
I have many log files like this: ...... ...... cpu time 9.05 seconds real
I have a CPU consuming function do_long that I need to run on two
I have seen some cool slides in html, but some are CPU-consuming or browser-dependent.
I have compiled a .NET application using Any CPU option. This .NET application uses
If I understand this correctly: Current CPU developing companies like AMD and Intel have
I have got mongo db using only one collection and consuming average cpu usage
I have some python code that was written by a developer before me. It
greetings. i have an issue with the following code. for some reason whenever it

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.