This guy makes the extraordinary claim that binary search (in a C compiler) is slower than hard-coded if-branches from generated code.
(Please excuse the Clojure code and the wacky title – this claims this guy makes are related to compilers in general).
He writes
I have seen this sort of code occasionally in dark corners. When a man knows
how his processor works, knows how his C compiler works, knows his data
structures, and really, really needs his loops to be fast then he will
occasionally write this sort of thing.This is sort of code that Real Programmers write.
This is the binary search example (Please excuse the Clojure)
Start: (1 2 3 4 6 8 9 10 11 12)
Finish: ((((1) (2)) ((3) ((4) (6)))) (((8) (9)) ((10) ((11) (12)))))
Then he replaces a binary search with a generated code branched if based on hardcoded values:
(defn lookup-fn-handwritten [x]
(if (< x 6)
(if (< x 3) ; x is < 6
(if (< x 2) ; x is < 3
(if ( < x 1) ; x is < 2
0 ; < 1
1) ; 1 <= x < 2
3) ; 2 <= x < 3
(if (< x 4) ; 3 <= x < 6
4 ; 3 <= x < 4
2)) ; 4 <= x < 6
(if (< x 10) ; 6 <= x < 10
(if (< x 9) ; 6 <= x < 9
(if (< x 8)
2 ; 6 <= x < 8
3) ; 8 <= x < 9
3) ; 9 <= x < 10
(if (< x 11) ; 10 < x
(if (< x 12) ; 11 <= x
1 ; 11 <= x < 12
0)
0)))) ; 12 <= x
http://www.learningclojure.com/2010/09/clojure-faster-than-machine-code.html
My question is – will a branched hardcoded if from generated code and hardcode values be more efficient than a binary search? (In any language – but the author claims this works in C – and then seems to only demonstrate it on the JVM).
(Please again excuse the wacky title of the linked post – that’s just craziness.)
Well, the if-cascade probably does the same thing as the binary search, which means that it does the same comparisons, but without the associated “binary search management”. It’s an unrolled loop, and there is indeed a reason why compilers unroll loops. So yes, it will be faster.
But will it REALLY be faster? Now there’s a problem called “cache”. Whether you unroll a loop or anything else, your code gets larger, so the benefit might be offset by more memory accesses to run the code.
In addition, you never quite know what kind of instructions the compiler might be using to optimize the code, which it might not be using when you manually unroll the loop. Or the other way ’round, who knows. Even more so in languages that have a binary search “built in” so the compiler knows what it is dealing with.
Thus, just counting operations like “I have all the compares and none of the other stuff” may not be enough; there are other factors that affect execution time. And if you profile on one CPU to find out “my unrolled version is faster”, another CPU might still disagree.
Optimizing is a b-word, not sure whether I’m allowed to spell it out here 🙂