I have the following code that I use to compute the distance between two vectors:
double dist(vector<double> & vecA, vector<double> & vecB){
double curDist = 0.0;
for (size_t i = 0; i < vecA.size(); i++){
double dif = vecA[i] - vecB[i];
curDist += dif * dif;
}
return curDist;
}
This function is a major bottleneck in my application since it relies on a lot of distance calculations, consuming more than 60% of CPU time on a typical input. Additionally, the following line:
double dif = vecA[i] - vecB[i];
is responsible for more than 77% of CPU time in this function. My question is: is it possible to somehow optimize this function?
Notes:
- To profile my application I have used Intel Amplifier XE;
- Reducing the number of distance computations is not a feasible solution for
me;
There are two possible issues I can think of right now:
curDist.This computation is memory bound.
Your dataset is larger than your CPU cache. So in this case, no amount of optimization is going to help unless you can restructure your algorithm.
There is an iteration-to-iteration dependency on
curDist.You have a dependency on
curDist. This will block vectorization by the compiler. (Also, don’t always trust the profiler numbers to the line. They can be inaccurate especially after compiler optimizations.)Normally, the compiler vectorizer can split up the
curDistinto multiple partial sums to and unroll/vectorize the loop. But it can’t do that under strict-floating-point behavior. You can try relaxing your floating-point mode if you haven’t already. Or you can split the sum and unroll it yourself.For example, this kind of optimization is something the compiler can do with integers, but not necessarily with floating-point: