I have developed a cumulative sum function as defined below in the Haskell library Repa. However, I have run into an issue when combining this function with the transpose operation. All 3 of the following operations take well under a second:
cumsum $ cumsum $ cumsum x
transpose $ transpose $ transpose x
transpose $ cumsum x
However, if I write:
cumsum $ transpose x
performance degrades horrendously. While each individual operation in isolation takes well under a second on a 1920×1080 image, when combined they now take 30+ seconds…
Any ideas on what could be causing this? My gut tells me it has something to do with delayed arrays, not forcing at the right time, etc… But I do not have enough experience to track this down quite yet.
{-# LANGUAGE TypeOperators, FlexibleContexts, TypeFamilies #-}
import Data.Array.Repa as Repa
{-# INLINE indexSlice #-}
indexSlice :: (Shape sh, Elt a) => Int -> Array (sh :. Int) a -> (sh :. Int) -> a
indexSlice from arr (z :. ix) = arr `unsafeIndex` (z :. (ix + from))
{-# INLINE sliceRange #-}
sliceRange :: (Slice sh, Shape sh, Elt a) => Int -> Int -> Array (sh :. Int) a -> Array (sh :. Int) a
sliceRange from to arr = fromFunction (z :. (to - from + 1)) $ indexSlice from arr
where (z :. _) = extent arr
{-# INLINE cumsum' #-}
cumsum' :: (Slice (SliceShape sh), Slice sh, Shape (FullShape sh), Shape (SliceShape sh), Elt a, Num a) =>
Array (FullShape sh :. Int) a -> t -> (sh :. Int) -> a
cumsum' arr f (sh :. outer) = Repa.sumAll $ sliceRange 0 outer $ Repa.slice arr (sh :. All)
{-# INLINE cumsum #-}
cumsum :: (FullShape sh ~ sh, Slice sh, Slice (SliceShape sh), Shape sh, Shape (SliceShape sh), Elt a, Num a) =>
Array (sh :. Int) a -> Array (sh :. Int) a
cumsum arr = Repa.force $ unsafeTraverse arr id $ cumsum' arr
From a library implementor’s perspective, the way to debug this is to create a wrapper for the suspect operation, then look at the core code to see if fusion has worked.
I’ve put the “solver” code in a separate module, so we only have to wade through the core code for the definitions we care about.
Compile like:
Go to the definition of
cumsumBMPand search for theletreckeyword. Searching forletrecis a quick way to find the inner loops.Not too far down I see this: (slightly reformatted)
Disaster! The
x3_a1x6binding is clearly doing some useful work (multiplications, additions and suchlike) but it’s wrapped in a long series of unboxing operations that are also executed for every loop iteration. What’s worse is that it’s unboxing the length and width (shape) of the array at every iteration, and this information will always be the same. GHC should really float these case expressions out of the loop, but it doesn’t yet. This is an instance of Issue #4081 on the GHC trac, which hopefully will be fixed sometime soon.The work around is to apply
deepSeqArrayto the incoming array. This places a demand on its value at the top level (outside the loop) which lets GHC know it’s ok to move the case matches further up. For a function likecumsumBMP, we also expect the incoming array to already be manifest, so we can add an explicit case match for this:Compiling again, the inner loop now looks much better:
That’s a tight, tail recursive loop that only uses primitive operations. Provided you compile with
-fllvm -optlo-O3, there’s no reason that won’t run as fast as an equivalent C program.There’s a slight hiccup when running it though:
This just reminds us that we need to force the array before calling
cumsumBMP.In summary:
deepSeqArrayand pattern matching goop to your top levelfunctions to work around a current infelicity in GHC. This is demonstrated by
the final version of the
cumsumBMPfunction above. If you want GHC HQ to fixthis soon then add yourself as a cc to Issue #4081 on the GHC trac. Repa programs will be much prettier when this is fixed.
indexSliceand friends. The general rule is to add the goop to functions that useforce,foldorsumAll. These functions instantiate the actual loops that operate over the array data, that is, they convert a delayed array to a manifest value.