I have written the following program using Parallel Haskell to find the divisors of 1 billion.
import Control.Parallel
parfindDivisors :: Integer->[Integer]
parfindDivisors n = f1 `par` (f2 `par` (f1 ++ f2))
where f1=filter g [1..(quot n 4)]
f2=filter g [(quot n 4)+1..(quot n 2)]
g z = n `rem` z == 0
main = print (parfindDivisors 1000000000)
I’ve compiled the program with ghc -rtsopts -threaded findDivisors.hs and I run it with:
findDivisors.exe +RTS -s -N2 -RTS
I have found a 50% speedup compared to the simple version which is this:
findDivisors :: Integer->[Integer]
findDivisors n = filter g [1..(quot n 2)]
where g z = n `rem` z == 0
My processor is a dual core 2 duo from Intel.
I was wondering if there can be any improvement in above code. Because in the statistics that program prints says:
Parallel GC work balance: 1.01 (16940708 / 16772868, ideal 2)
and SPARKS: 2 (1 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
What are these converted , overflowed , dud, GC’d, fizzled and how can help to improve the time.
IMO, the
Parmonad helps for reasoning about parallelism. It’s a little higher-level than dealing withparandpseq.Here’s a rewrite of
parfindDivisorsusing theParmonad. Note that this is essentially the same as your algorithm:Compiling that with
-O2 -threaded -rtsopts -eventlogand running with+RTS -N2 -syields the following relevant runtime stats:The productivity is very high. To improve the GC work balance slightly we can increase the GC allocation area size; run with
+RTS -N2 -s -A128M, for example:But this is really just nitpicking. The real story comes from ThreadScope:
The utilisation is essentially maxed out for two cores, so additional significant parallelization (for two cores) is probably not going to happen.
Some good notes on the
Parmonad are here.UPDATE
A rewrite of the alternative algorithm using
Parlooks something like this:But you’re right in that this will not get any gains from parallelism due to the dependence on
firstDivs.You can still incorporate parallelism here, by getting
Strategiesinvolved to evaluate the elements of the list comprehensions in parallel. Something like:and running this gives some stats like
Here almost 50 sparks were converted – that is, meaningful parallel work was being done – but the computations were not large enough to observe any wall-clock gains from parallelism. Any gains were probably offset by the overhead of scheduling computations in the threaded runtime.