Still working on my SHA1 implementation in Haskell. I’ve now got a working implementation and this is the inner loop:
iterateBlock' :: Int -> [Word32] -> Word32 -> Word32 -> Word32 -> Word32 -> Word32 -> [Word32]
iterateBlock' 80 ws a b c d e = [a, b, c, d, e]
iterateBlock' t (w:ws) a b c d e = iterateBlock' (t+1) ws a' b' c' d' e'
where
a' = rotate a 5 + f t b c d + e + w + k t
b' = a
c' = rotate b 30
d' = c
e' = d
The profiler tells me that this function takes 1/3 of the runtime of my implementation. I can think of no way to further optimize it other than maybe inlining the temp variables but I believe -O2 will do that for me anyway.
Can anyone see a significant optimization that can be further applied?
FYI the k and f calls are below. They are so simple I don’t think there is a way to optimize these other. Unless the Data.Bits module is slow?
f :: Int -> Word32 -> Word32 -> Word32 -> Word32
f t b c d
| t <= 19 = (b .&. c) .|. ((complement b) .&. d)
| t <= 39 = b `xor` c `xor` d
| t <= 59 = (b .&. c) .|. (b .&. d) .|. (c .&. d)
| otherwise = b `xor` c `xor` d
k :: Int -> Word32
k t
| t <= 19 = 0x5A827999
| t <= 39 = 0x6ED9EBA1
| t <= 59 = 0x8F1BBCDC
| otherwise = 0xCA62C1D6
Looking at the core produced by ghc-7.2.2, the inlining works out well. What doesn’t work so well is that in each iteration a couple of
Word32values are first unboxed, to perform the work, and then reboxed for the next iteration. Unboxing and re-boxing can cost a surprisingly large amount of time (and allocation).You can probably avoid that by using
Wordinstead ofWord32. You couldn’t userotatefrom Data.Bits then, but would have to implement it yourself (not hard) to have it work also on 64-bit systems. Fora'you would have to manually mask out the high bits.Another point that looks suboptimal is that in each iteration
tis compared to 19, 39 and 59 (if it’s large enough), so that the loop body contains four branches. It will probably be faster if you splititerateBlock'into four loops (0-19, 20-39, 40-59, 60-79) and use constants k1, …, k4, and four functions f1, …, f4 (without thetparameter) to avoid branches and have smaller code-size for each loop.And, as Thomas said, using a list for the block data isn’t optimal, an unboxed Word array/vector would probably help too.
With the bang patterns, the core looks much better. Two or three less-than-ideal points remain.
See all these
narrow32Word#? They’re cheap, but not free. Only the outermost is needed, there may be a bit to harvest by hand-coding the steps and usingWord.Then the comparisons of
twith 19, …, they appear twice, once to determine thekconstant, and once for theftransform. The comparisons alone are cheap, but they cause branches and without them, further inlining may be possible. I expect a bit could be gained here too.And still, the list. That means
wcan’t be unboxed, the core could be simpler ifwwere unboxable.