I have an application where part of the inner loop was basically: double sum

Question

0

Asked: May 13, 20262026-05-13T09:51:04+00:00 2026-05-13T09:51:04+00:00

I have an application where part of the inner loop was basically: double sum

0

I have an application where part of the inner loop was basically:

double sum = 0;
for (int i = 0; i != N; ++i, ++data, ++x) sum += *data * x;

If x is an unsigned int, then the code takes 3 times as long as with int!

This was part of a larger code-base, but I got it down to the essentials:

#include <iostream>                                      
#include <cstdlib>                                       
#include <vector>
#include <time.h>

typedef unsigned char uint8;

template<typename T>
double moments(const uint8* data, int N, T wrap) {
    T pos = 0;
    double sum = 0.;
    for (int i = 0; i != N; ++i, ++data) {
        sum += *data * pos;
        ++pos;
        if (pos == wrap) pos = 0;
    }
    return sum;
}

template<typename T>
const char* name() { return "unknown"; }

template<>
const char* name<int>() { return "int"; }

template<>
const char* name<unsigned int>() { return "unsigned int"; }

const int Nr_Samples = 10 * 1000;

template<typename T>
void measure(const std::vector<uint8>& data) {
    const uint8* dataptr = &data[0];
    double moments_results[Nr_Samples];
    time_t start, end;
    time(&start);
    for (int i = 0; i != Nr_Samples; ++i) {
        moments_results[i] = moments<T>(dataptr, data.size(), 128);
    }
    time(&end);
    double avg = 0.0;
    for (int i = 0; i != Nr_Samples; ++i) avg += moments_results[i];
    avg /= Nr_Samples;
    std::cout << "With " << name<T>() << ": " << avg << " in " << (end - start) << "secs" << std::endl;
}


int main() {
    std::vector<uint8> data(128*1024);
    for (int i = 0; i != data.size(); ++i) data[i] = std::rand();
    measure<int>(data);
    measure<unsigned int>(data);
    measure<int>(data);
    return 0;
}

Compiling with no optimization:

luispedro@oakeshott:/home/luispedro/tmp/so §g++  test.cpp    
luispedro@oakeshott:/home/luispedro/tmp/so §./a.out
With int: 1.06353e+09 in 9secs
With unsigned int: 1.06353e+09 in 14secs
With int: 1.06353e+09 in 9secs

With optimization:

luispedro@oakeshott:/home/luispedro/tmp/so §g++  -O3  test.cpp
luispedro@oakeshott:/home/luispedro/tmp/so §./a.out
With int: 1.06353e+09 in 3secs
With unsigned int: 1.06353e+09 in 12secs
With int: 1.06353e+09 in 4secs

I don’t understand why such a large difference in speed. I tried figuring it out from the generated assembly, but I got nowhere. Anyone have any thoughts?

Is this something to do with the hardware or is it a limitation of gcc’s optimisation machinery? I’m betting the second.

My machine is an Intel 32 bit running Ubuntu 9.10.

Edit: Since Stephen asked, here is the de-compiled source (from a -O3 compilation). I believe I got the main loops:

int version:

40: 0f b6 14 0b             movzbl (%ebx,%ecx,1),%edx
     sum += *data * pos;
44: 0f b6 d2                movzbl %dl,%edx
47: 0f af d0                imul   %eax,%edx
      ++pos;
4a: 83 c0 01                add    $0x1,%eax
      sum += *data * pos;
4d: 89 95 54 c7 fe ff       mov    %edx,-0x138ac(%ebp)
      ++pos;
      if (pos == wrap) pos = 0;
53: 31 d2                   xor    %edx,%edx
55: 3d 80 00 00 00          cmp    $0x80,%eax
5a: 0f 94 c2                sete   %dl
  T pos = 0;
  double sum = 0.;
  for (int i = 0; i != N; ++i, ++data) {
5d: 83 c1 01                add    $0x1,%ecx
      sum += *data * pos;
60: db 85 54 c7 fe ff       fildl  -0x138ac(%ebp)
      ++pos;
      if (pos == wrap) pos = 0;
66: 83 ea 01                sub    $0x1,%edx
69: 21 d0                   and    %edx,%eax
  T pos = 0;
  double sum = 0.;
  for (int i = 0; i != N; ++i, ++data) {
6b: 39 f1                   cmp    %esi,%ecx
      sum += *data * pos;
6d: de c1                   faddp  %st,%st(1)
  T pos = 0;
  double sum = 0.;
  for (int i = 0; i != N; ++i, ++data) {
6f: 75 cf                   jne    40

unsigned version:

50: 0f b6 34 13             movzbl (%ebx,%edx,1),%esi
      sum += *data * pos;
54: 81 e6 ff 00 00 00       and    $0xff,%esi
5a: 31 ff                   xor    %edi,%edi
5c: 0f af f0                imul   %eax,%esi
      ++pos;
5f: 83 c0 01                add    $0x1,%eax
      if (pos == wrap) pos = 0;
62: 3d 80 00 00 00          cmp    $0x80,%eax
67: 0f 94 c1                sete   %cl
  T pos = 0;
  double sum = 0.;
  for (int i = 0; i != N; ++i, ++data) {
6a: 83 c2 01                add    $0x1,%edx
      sum += *data * pos;
6d: 89 bd 54 c7 fe ff       mov    %edi,-0x138ac(%ebp)
73: 89 b5 50 c7 fe ff       mov    %esi,-0x138b0(%ebp)
      ++pos;
      if (pos == wrap) pos = 0;
79: 89 ce                   mov    %ecx,%esi
7b: 81 e6 ff 00 00 00       and    $0xff,%esi
      sum += *data * pos;
81: df ad 50 c7 fe ff       fildll -0x138b0(%ebp)
      ++pos;
      if (pos == wrap) pos = 0;
87: 83 ee 01                sub    $0x1,%esi
8a: 21 f0                   and    %esi,%eax
  for (int i = 0; i != N; ++i, ++data) {
8c: 3b 95 34 c7 fe ff       cmp    -0x138cc(%ebp),%edx
      sum += *data * pos;
92: de c1                   faddp  %st,%st(1)
  for (int i = 0; i != N; ++i, ++data) {
94: 75 ba                   jne    50

This is the -O3 version, which is why the source lines jump up and down.
Thank you.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T09:51:04+00:00

Here’s why: many common architectures (including x86) have a hardware instruction to convert signed int to doubles, but do not have a hardware conversion from unsigned to double, so the compiler needs to synthesize the conversion in software. Furthermore, the only unsigned multiply on Intel is a full width multiply, whereas signed multiplies can use the signed multiply low instruction.

GCC’s software conversion from unsigned int to double may very well be suboptimal (it almost certainly is, given the magnitude of the slowdown that you observed), but it is expected behavior for the code to be faster when using signed integers.

Assuming a smart compiler, the difference should be much smaller on a 64-bit system, because a 64-bit signed integer -> double conversion can be used to efficiently do a 32-bit unsigned conversion.

Edit: to illustrate, this:

sum += *data * x;

if the integer variables are signed, should compile into something along these lines:

mov       (data),   %eax
imul      %ecx,     %eax
cvtsi2sd  %eax,     %xmm1
addsd     %xmm1,    %xmm0

on the other hand, if the integer variables are unsigned, cvtsi2sd can’t be used to do the conversion, so a software workaround is required. I would expect to see something like this:

    mov       (data),   %eax
    mul       %ecx            // might be slower than imul
    cvtsi2sd  %eax,     %xmm1 // convert as though signed integer
    test      %eax,     %eax  // check if high bit was set
    jge       1f              // if it was, we need to adjust the converted
    addsd     (2^32),   %xmm1 // value by adding 2^32
1:  addsd     %xmm1,    %xmm0

That would be “acceptable” codegen for the unsigned -> double conversion; it could easily be worse.

All of this is assuming floating-point code generation to SSE (I believe this is the default on the Ubuntu tools, but I could be wrong).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have an application where part of the inner loop was basically: double sum

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply