Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6017535
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T03:08:32+00:00 2026-05-23T03:08:32+00:00

I have developed a cumulative sum function as defined below in the Haskell library

  • 0

I have developed a cumulative sum function as defined below in the Haskell library Repa. However, I have run into an issue when combining this function with the transpose operation. All 3 of the following operations take well under a second:

cumsum $ cumsum $ cumsum x
transpose $ transpose $ transpose x
transpose $ cumsum x

However, if I write:

cumsum $ transpose x

performance degrades horrendously. While each individual operation in isolation takes well under a second on a 1920×1080 image, when combined they now take 30+ seconds…

Any ideas on what could be causing this? My gut tells me it has something to do with delayed arrays, not forcing at the right time, etc… But I do not have enough experience to track this down quite yet.

{-# LANGUAGE TypeOperators, FlexibleContexts, TypeFamilies #-}

import Data.Array.Repa as Repa

{-# INLINE indexSlice #-}
indexSlice :: (Shape sh, Elt a) => Int -> Array (sh :. Int) a -> (sh :. Int) -> a
indexSlice from arr (z :. ix) = arr `unsafeIndex` (z :. (ix + from))

{-# INLINE sliceRange #-}
sliceRange :: (Slice sh, Shape sh, Elt a) => Int -> Int -> Array (sh :. Int) a -> Array (sh :. Int) a
sliceRange from to arr = fromFunction (z :. (to - from + 1)) $ indexSlice from arr
    where (z :. _) = extent arr

{-# INLINE cumsum' #-}
cumsum' :: (Slice (SliceShape sh), Slice sh, Shape (FullShape sh), Shape (SliceShape sh), Elt a, Num a) =>
     Array (FullShape sh :. Int) a -> t -> (sh :. Int) -> a
cumsum' arr f (sh :. outer) = Repa.sumAll $ sliceRange 0 outer $ Repa.slice arr (sh :. All)

{-# INLINE cumsum #-}
cumsum :: (FullShape sh ~ sh, Slice sh, Slice (SliceShape sh), Shape sh, Shape (SliceShape sh), Elt a, Num a) =>
    Array (sh :. Int) a -> Array (sh :. Int) a
cumsum arr = Repa.force $ unsafeTraverse arr id $ cumsum' arr
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T03:08:33+00:00Added an answer on May 23, 2026 at 3:08 am

    From a library implementor’s perspective, the way to debug this is to create a wrapper for the suspect operation, then look at the core code to see if fusion has worked.

    -- Main.hs ---------------------------------------------------
    import Solver
    import Data.Array.Repa.IO.BMP
    
    main 
     = do   Right img       <- readImageFromBMP "whatever.bmp"
            print $ cumsumBMP img
    
    -- Solver.hs --------------------------------------------------
    {-# LANGUAGE TypeOperators, FlexibleContexts, TypeFamilies #-}
    module Solver (cumsumBMP) where
    import Data.Array.Repa  as Repa
    import Data.Word
    
    {- all your defs -}
    
    {-# NOINLINE cumsumBMP #-}
    cumsumBMP :: Array DIM3 Word8 -> Array DIM3 Word8
    cumsumBMP img = cumsum $ transpose img
    

    I’ve put the “solver” code in a separate module, so we only have to wade through the core code for the definitions we care about.

    Compile like:

    touch Solver.hs ; ghc -O2 --make Main.hs \
     -ddump-simpl -dsuppress-module-prefixes -dsuppress-coercions  > dump
    

    Go to the definition of cumsumBMP and search for the letrec keyword. Searching for letrec is a quick way to find the inner loops.

    Not too far down I see this: (slightly reformatted)

    case gen_a1tr
    of _ {
      GenManifest vec_a1tv ->
        case sh2_a1tc  `cast` ... of _ { :. sh3_a1iu  sh4_a1iv ->
        case ix'_a1t9  `cast` ... of _ { :. sh1'_a1iz sh2'_a1iA ->
        case sh3_a1iu  `cast` ... of _ { :. sh5_X1n0  sh6_X1n2 ->
        case sh1'_a1iz `cast` ... of _ { :. sh1'1_X1n9 sh2'1_X1nb ->
        case sh5_X1n0             of _ { :. sh7_X1n8   sh8_X1na ->
        ...
        case sh2'1_X1nb           of _ { I# y3_X1nO ->
        case sh4_a1iv             of _ { I# y4_X1nP ->
        case sh2'_a1iA            of _ { I# y5_X1nX ->
        ...
        let { x3_a1x6 :: Int# [LclId]
          x3_a1x6 =
            +#
              (*#
                 (+#
                    (*#
                       y1_a1iM
                       y2_X1nG)
                    y3_X1nO)
                 y4_X1nP)
              y5_X1nX } in
        case >=#
               x3_a1x6
               0
        of ...
    

    Disaster! The x3_a1x6 binding is clearly doing some useful work (multiplications, additions and suchlike) but it’s wrapped in a long series of unboxing operations that are also executed for every loop iteration. What’s worse is that it’s unboxing the length and width (shape) of the array at every iteration, and this information will always be the same. GHC should really float these case expressions out of the loop, but it doesn’t yet. This is an instance of Issue #4081 on the GHC trac, which hopefully will be fixed sometime soon.

    The work around is to apply deepSeqArray to the incoming array. This places a demand on its value at the top level (outside the loop) which lets GHC know it’s ok to move the case matches further up. For a function like cumsumBMP, we also expect the incoming array to already be manifest, so we can add an explicit case match for this:

    {-# NOINLINE cumsumBMP #-}
    cumsumBMP :: Array DIM3 Word8 -> Array DIM3 Word8
    cumsumBMP img@(Array _ [Region RangeAll (GenManifest _)])
      = img `deepSeqArray` cumsum $ transpose img
    

    Compiling again, the inner loop now looks much better:

    letrec {
    $s$wfoldlM'_loop_s2mW [...]
      :: Int# -> Word# -> Word# [...]
    $s$wfoldlM'_loop_s2mW =
      \ (sc_s2mA :: Int#) (sc1_s2mB :: Word#) ->
        case <=# sc_s2mA a_s2ji of _ {
          False -> sc1_s2mB;
          True ->
            $s$wfoldlM'_loop_s2mW
              (+# sc_s2mA 1)
              (narrow8Word#
                 (plusWord#
                    sc1_s2mB
                    (indexWord8Array#
                       rb3_a2gZ
                       (+#
                          rb1_a2gX
                          (+#
                             (*#
                                (+#
                                   (*#
                                      wild19_X1zO
                                      ipv1_X1m5)
                                   sc_s2mA)
                                ipv2_X1m0)
                             wild20_X1Ct)))))
        }; } in
    

    That’s a tight, tail recursive loop that only uses primitive operations. Provided you compile with -fllvm -optlo-O3, there’s no reason that won’t run as fast as an equivalent C program.

    There’s a slight hiccup when running it though:

    desire:tmp benl$ ./Main 
    Main: Solver.hs:(50,1)-(51,45): Non-exhaustive patterns in function cumsumBMP
    

    This just reminds us that we need to force the array before calling cumsumBMP.

    -- Main.hs ---------------------------------------------------
    ...
    import Data.Array.Repa as Repa
    main 
     = do   Right img       <- readImageFromBMP "whatever.bmp"
            print $ cumsumBMP $ Repa.force img
    

    In summary:

    1. You need to add some deepSeqArray and pattern matching goop to your top level
      functions to work around a current infelicity in GHC. This is demonstrated by
      the final version of the cumsumBMP function above. If you want GHC HQ to fix
      this soon then add yourself as a cc to Issue #4081 on the GHC trac. Repa programs will be much prettier when this is fixed.
    2. You don’t need to add the goop to every function. In this example I didn’t need to touch indexSlice and friends. The general rule is to add the goop to functions that use force, fold or sumAll. These functions instantiate the actual loops that operate over the array data, that is, they convert a delayed array to a manifest value.
    3. The performance of a piece of Repa code is determined as much by the context in which it’s used as the actual code. If you pass your top level functions delayed arrays then they will run very slowly. There is more discussion of this in The Repa Tutorial.
    4. BMP files read with the repa-io library aren’t pre-forced, so you need to force them before use. This is probably the wrong default, so I’ll change it in the next version.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have developed an application that is to be run on android Tablet as
I have developed an android app that mainly targets smartphones. However in tablet emulator
i have developed custom widget and added into qtcreator. custom widget i have developed
I have developed a library that is being consumed by an add-in architecture in
I have developed project in android3.1 and i am trying to run in android2.1
I have developed some classes with similar behavior, they all implement the same interface.
I have developed some custom DAO-like classes to meet some very specialized requirements for
We have developed our website(Business users website) in .net Framework 2.0 Our client us
I have developed a VB.NET WCF service that recives and sends back data. When
We have developed a website that uses MVC, C#, and jQuery. In one of

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.