I have written the following program using Parallel Haskell to find the divisors of

Question

0

Asked: June 11, 20262026-06-11T17:35:44+00:00 2026-06-11T17:35:44+00:00

I have written the following program using Parallel Haskell to find the divisors of

0

I have written the following program using Parallel Haskell to find the divisors of 1 billion.

import Control.Parallel

parfindDivisors :: Integer->[Integer]
parfindDivisors n = f1 `par` (f2 `par` (f1 ++ f2))
              where f1=filter g [1..(quot n 4)]
                    f2=filter g [(quot n 4)+1..(quot n 2)]
                    g z = n `rem` z == 0

main = print (parfindDivisors 1000000000)

I’ve compiled the program with ghc -rtsopts -threaded findDivisors.hs and I run it with:
findDivisors.exe +RTS -s -N2 -RTS

I have found a 50% speedup compared to the simple version which is this:

findDivisors :: Integer->[Integer]
findDivisors n = filter g [1..(quot n 2)] 
      where  g z = n `rem` z == 0

My processor is a dual core 2 duo from Intel.
I was wondering if there can be any improvement in above code. Because in the statistics that program prints says:
Parallel GC work balance: 1.01 (16940708 / 16772868, ideal 2)
and SPARKS: 2 (1 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
What are these converted , overflowed , dud, GC’d, fizzled and how can help to improve the time.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T17:35:45+00:00

IMO, the Par monad helps for reasoning about parallelism. It’s a little higher-level than dealing with par and pseq.

Here’s a rewrite of parfindDivisors using the Par monad. Note that this is essentially the same as your algorithm:

import Control.Monad.Par

findDivisors :: Integer -> [Integer]
findDivisors n = runPar $ do
    [f0, f1] <- sequence [new, new]
    fork $ put f0 (filter g [1..(quot n 4)])
    fork $ put f1 (filter g [(quot n 4)+1..(quot n 2)])
    [f0', f1'] <- sequence [get f0, get f1]
    return $ f0' ++ f1'
  where g z  = n `rem` z == 0

Compiling that with -O2 -threaded -rtsopts -eventlog and running with +RTS -N2 -s yields the following relevant runtime stats:

  36,000,130,784 bytes allocated in the heap
       3,165,440 bytes copied during GC
          48,464 bytes maximum residency (1 sample(s))

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     35162 colls, 35161 par    0.39s    0.32s     0.0000s    0.0006s
  Gen  1         1 colls,     1 par    0.00s    0.00s     0.0002s    0.0002s

  Parallel GC work balance: 1.32 (205296 / 155521, ideal 2)

  MUT     time   42.68s  ( 21.48s elapsed)
  GC      time    0.39s  (  0.32s elapsed)
  Total   time   43.07s  ( 21.80s elapsed)

  Alloc rate    843,407,880 bytes per MUT second

  Productivity  99.1% of total user, 195.8% of total elapsed

The productivity is very high. To improve the GC work balance slightly we can increase the GC allocation area size; run with +RTS -N2 -s -A128M, for example:

  36,000,131,336 bytes allocated in the heap
          47,088 bytes copied during GC
          49,808 bytes maximum residency (1 sample(s))

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       135 colls,   134 par    0.19s    0.10s     0.0007s    0.0009s
  Gen  1         1 colls,     1 par    0.00s    0.00s     0.0010s    0.0010s

  Parallel GC work balance: 1.62 (2918 / 1801, ideal 2)

  MUT     time   42.65s  ( 21.49s elapsed)
  GC      time    0.20s  (  0.10s elapsed)
  Total   time   42.85s  ( 21.59s elapsed)

  Alloc rate    843,925,806 bytes per MUT second

  Productivity  99.5% of total user, 197.5% of total elapsed

But this is really just nitpicking. The real story comes from ThreadScope:

lots of utilisation

The utilisation is essentially maxed out for two cores, so additional significant parallelization (for two cores) is probably not going to happen.

Some good notes on the Par monad are here.

UPDATE

A rewrite of the alternative algorithm using Par looks something like this:

findDivisors ::  Integer -> [Integer]
findDivisors n = let sqrtn = floor (sqrt (fromInteger n)) in runPar $ do
    [a, b] <- sequence [new, new]
    fork $ put a [a | (a, b) <- [quotRem n x | x <- [1..sqrtn]], b == 0]
    firstDivs  <- get a
    fork $ put b [n `quot` x | x <- firstDivs, x /= sqrtn]
    secondDivs <- get b
    return $ firstDivs ++ secondDivs

But you’re right in that this will not get any gains from parallelism due to the dependence on firstDivs.

You can still incorporate parallelism here, by getting Strategies involved to evaluate the elements of the list comprehensions in parallel. Something like:

import Control.Monad.Par
import Control.Parallel.Strategies

findDivisors ::  Integer -> [Integer]
findDivisors n = let sqrtn = floor (sqrt (fromInteger n)) in runPar $ do
    [a, b] <- sequence [new, new]
    fork $ put a 
        ([a | (a, b) <- [quotRem n x | x <- [1..sqrtn]], b == 0] `using` parListChunk 2 rdeepseq)
    firstDivs  <- get a
    fork $ put b 
        ([n `quot` x | x <- firstDivs, x /= sqrtn] `using` parListChunk 2 rdeepseq)
    secondDivs <- get b
    return $ firstDivs ++ secondDivs

and running this gives some stats like

       3,388,800 bytes allocated in the heap
          43,656 bytes copied during GC
          68,032 bytes maximum residency (1 sample(s))

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         5 colls,     4 par    0.00s    0.00s     0.0000s    0.0001s
  Gen  1         1 colls,     1 par    0.00s    0.00s     0.0002s    0.0002s

  Parallel GC work balance: 1.22 (2800 / 2290, ideal 2)

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task  1 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task  2 (bound)  :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task  3 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)

  SPARKS: 50 (49 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)

  MUT     time    0.01s  (  0.00s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  Total   time    0.01s  (  0.01s elapsed)

  Alloc rate    501,672,834 bytes per MUT second

  Productivity  85.0% of total user, 95.2% of total elapsed

Here almost 50 sparks were converted – that is, meaningful parallel work was being done – but the computations were not large enough to observe any wall-clock gains from parallelism. Any gains were probably offset by the overhead of scheduling computations in the threaded runtime.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have written the following program using Parallel Haskell to find the divisors of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply