I have a C program that has n multiplications (single multiplication with n iterations) and I found another logic that has n/2 iterations of (1 multiplication + 2 additions). I know about the complexity that both are of O(n). but in terms of CPU cycles. which is faster ?
I have a C program that has n multiplications (single multiplication with n iterations)
Share
First of all follow Dietrich Epp’s first advice – measuring is (at least for complex optimization problems) the only way to be sure.
Now if you want to figure out why one is faster than the other, we can try. There are two different important performance measures: Latency and reciprocal throughput. A short summary of the two:
For Sandy bridge the rec. throughput for an
add r, r/i(for further notice r=register, i=immediate, m=memory) is 0.33 while the latency is 1.An
imul r, rhas a latency of 3 and a rec. throughput of 1.So as you see it completely depends on your specific algorithm – if you can just replace one imul with two independent adds this particular part of your algorithm could get a theoretical speedup of 50% (and in the best case obviously a speedup of ~350%). But on the other hand if your adds add a problematic dependency one imul could be just as fast as one add.
Also note that we’ve ignored all the additional complications like memory and cache behavior (things which will generally have a much, MUCH larger influence on the execution time) or intricate stuff like µop fusion and whatnot. In general the only people that should care about this stuff are compiler writers – it’s much simpler to just measure the result of their efforts 😉
Anyways if you want a good listing of this stuff see this here (the above description of latency/rec. throughput is also from that particular document).