I’ve wrote program, and compiled it for x64 and x86 platform in Visual Studio 2010 on Intel Core i5-2500. x64 version take about 19 seconds for execution and x86 take about 17 seconds. What can be the reason of such behavior?
#include "timer.h"
#include <vector>
#include <iostream>
#include <algorithm>
#include <string>
#include <sstream>
/********************DECLARATIONS************************************************/
class Vector
{
public:
Vector():x(0),y(0),z(0){}
Vector(double x, double y, double z)
: x(x)
, y(y)
, z(z)
{
}
double x;
double y;
double z;
};
double Dot(const Vector& a, const Vector& b)
{
return a.x * b.x + a.y * b.y + a.z * b.z;
}
class Vector2
{
public:
typedef double value_type;
Vector2():x(0),y(0){}
Vector2(double x, double y)
: x(x)
, y(y)
{
}
double x;
double y;
};
/******************************TESTS***************************************************/
void Test(const std::vector<Vector>& m, std::vector<Vector2>& m2)
{
Vector axisX(0.3f, 0.001f, 0.25f);
Vector axisY(0.043f, 0.021f, 0.45f);
std::vector<Vector2>::iterator i2 = m2.begin();
std::for_each(m.begin(), m.end(),
[&](const Vector& v)
{
Vector2 r(0,0);
r.x = Dot(axisX, v);
r.y = Dot(axisY, v);
(*i2) = r;
++i2;
});
}
int main()
{
cpptask::Timer timer;
int len2 = 300;
size_t len = 5000000;
std::vector<Vector> m;
m.reserve(len);
for (size_t i = 0; i < len; ++i)
{
m.push_back(Vector(i * 0.2345, i * 2.67, i * 0.98));
}
/***********************************************************************************/
{
std::vector<Vector2> m2(m.size());
double time = 0;
for (int i = 0; i < len2; ++i)
{
timer.Start();
Test(m, m2);
time += timer.End();
}
std::cout << "Dot product double - " << time / len2 << std::endl;
}
/***********************************************************************************/
return 0;
}
Short Answer: It’s a compiler hiccup. x64 optimizer fail.
Long Answer:
This x86 version is very slow if SSE2 is disabled. But I’m able to reproduce the results with SSE2 enabled in x86.
If you dive into the assembly of that inner-most loop. The x64 version has two extra memory copies at the end.
x86:
x64:
The x64 version has a lot more (unexplained) moves at the end of the loop. It looks like some sort of memory-to-memory data-copy.
EDIT:
It turns out that the x64 optimizer isn’t able to optimize out the following copy:
This is why the inner loop has two extra memory copies. If you change the loop to this:
This eliminates the copies. Now the x64 version is just as fast as the x86 version:
Lesson Learned: Compilers aren’t perfect.