My aim is to have a parallel foldr function. At first, it seemed
rather straight forward to achieve and this is what I had in mind:
First break up the input list into partitions based on the number of
cores (numCapabilities). Then apply foldr to each partition, which
will result in a list of folded values for each partition. Then do a
foldr again on that list to obtain the final value.
listChunkSize = numCapabilities
chunk n [] = []
chunk n xs = ys : chunk n zs
where (ys,zs) = splitAt n xs
parfoldr f z [] = z
parfoldr f z xs = res
where
parts = chunk listChunkSize xs
partsRs = map (foldr f z) parts `using` parList rdeepseq
res = foldr f z partsRs
The above code does not work because obviously the definition of
foldr, (a -> b -> b) -> b -> [a] -> b, implies that the input list
type is (well, can be) different from the accumulator and result type.
For example,
1) foldr (+) 0 [1..10] => list type = accumulator type (Integer)
2) foldr (\i acc -> (i>5) && acc) True [1..10] => list type (Integer) !
= accumulator type (Bool)
So, looking at my code above, the map will generate a list of type b
which is then passed as argument to the second foldr. But the second
foldr accepts list of type a. So, that won’t work.
An ugly solution would be to provide a different type signature for
the parfoldr, e.g.
parfoldr :: (NFData a) => (a -> a -> a) -> a -> [a] -> a
This will work but then it is not exactly equivalent to foldr. Example
1 above will do just fine, but not example 2.
So, question 1 is: how to define parfoldr to have same type signature
as foldr?
Comparing the 2 folds:
input = [1..1000000]
seqfold = foldr (+) 0
parfold = parfoldr (+) 0
I get the foll. times on a dual core machine:
(no -threaded flag)
$ ./test
seqfold: 4.99s
parfold: 25.16s
(-threaded flag on)
$ ./test
seqfold: 5.32s
parfold: 25.55s
$ ./test +RTS -N1
seqfold: 5.32s
parfold: 25.53s
$ ./test +RTS -N2
seqfold: 3.48s
parfold: 3.68s
$ ./test +RTS -N3
seqfold: 3.57s
parfold: 2.36s
$ ./test +RTS -N4
seqfold: 3.03s
parfold: 1.70s
Observations from these measurements:
-
foldr seems to give lower runtime when num of cores is increased.
why is that? -
parfold gives better runtimes for N => 3.
Any suggestions and ideas for improvement is appreciated 🙂
foldris not in general parallelizable, as its interface allows sequential dependencies. In order to be able to rearrange the computations in the way you described you’ll need to limit yourself to associative operators with an identity element. This is known as a monoid, and what you’ve implemented is essentially a parallelmconcat.