The following code truncates a number of type Double to one in the type Word16 (although I suspect any other word type behaves similarly, I had to choose one for the example).
truncate1 :: Double -> Word16
truncate1 = fromIntegral . (truncate :: Double -> Int)
As you can read, I first truncate it to Int and only then I cast it to Word16. I benchmarked this function agains a direct truncation:
truncate2 :: Double -> Word16
truncate2 = truncate
Surprisingly to me, the first version (going thru the Int type first) performed much better. Or the second much worse. According to the criterion output:
benchmarking truncate/truncate1
mean: 25.42399 ns, lb -47.40484 ps, ub 67.87578 ns, ci 0.950
std dev: 145.5661 ns, lb 84.90195 ns, ub 244.2057 ns, ci 0.950
found 197 outliers among 100 samples (197.0%)
97 (97.0%) low severe
100 (100.0%) high severe
variance introduced by outliers: 99.000%
variance is severely inflated by outliers
benchmarking truncate/truncate2
mean: 781.0604 ns, lb 509.3264 ns, ub 1.086767 us, ci 0.950
std dev: 1.436660 us, lb 1.218997 us, ub 1.592479 us, ci 0.950
found 177 outliers among 100 samples (177.0%)
77 (77.0%) low severe
100 (100.0%) high severe
variance introduced by outliers: 98.995%
variance is severely inflated by outliers
To be honest, I just started using Criterion, so I’m not an expert using it, but I understand that 25.42399 ns is a shorter execution time than 781.0604 ns. I suspect that some specialization is playing a role here. Is it truncate2 too slow? Being the case, can truncate be improved? Furthermore, anybody knows an even faster way? I feel like doing something wrong casting to a type I don’t really use.
Thanks in advance.
I am compiling with GHC-7.4.2, optimizations enabled (-O2).
First, note that the module
GHC.Wordincludes the followingRULEpragma:This is a simple rewrite rule to perform precisely the optimization your
truncate1provides. So we have a few questions to consider:Why is this an optimization at all?
Because the default implementation of
truncateis generic, to support anyIntegralinstance. The speed difference you see is the cost of that generality; in the specific case of truncating one primitive type to another, there are much faster methods available.So it seems that
truncate1is benefiting from a specialized form, whiletruncate2is not.Why is
truncate1faster?In
GHC.Float, where theRealFracinstance forDoubleis defined, we have the followingRULEpragma:Where
double2Intis the optimized form we want. Compare this to theRULEmentioned earlier–apparently there’s no similar primitive operation specifically for convertingDoubletoWord16.Why doesn’t
truncate2get rewritten as well?Quoth the GHC User’s Guide:
Expressions being matched are not eta-expanded, which is to say that a rule matching on
forall x. foo xwill match inbar y = foo ybut not inbar = foo.Since your definitions are all written point-free, the
RULEforDouble -> Intmatches, but theRULEforDouble -> Word16does not.