I started looking at Haskell yesterday with the goal of actually learning it. I’ve written some trivial programs with it in programming language courses, but none of them really cared about efficiency. I’m trying to understand how to improve the running time of the following program.
My program solves the following toy problem (I know it’s simple to compute the answer by hand if you know what a factorial is, but I’m doing it the brute force way with a successor function):
http://projecteuler.net/problem=24
My algorithm for the successor function for lexicographic ordering given a list of finite length is the following:
-
If the list is already in decreasing order, then we have the maximal element in the lexicographic ordering, so there’s no successor.
-
Given a list h : t, either t is maximal in the lexicographic ordering or it’s not. In the latter case compute the successor of t. In the former case proceed as follows.
-
Pick the smallest element d in t larger than h.
-
Replace d with h in t giving a new list t’. The next element in the ordering is d : (sort t’)
My program that implements this is the following (lots of these function are probably in the standard library):
max_list :: (Ord a) => [a] -> a
max_list [] = error "Empty list has no maximum!"
max_list (h:[]) = h
max_list (h:t) = max h (max_list t)
min_list :: (Ord a) => [a] -> a
min_list [] = error "Empty list has no minimum!"
min_list (h:[]) = h
min_list (h:t) = min h (min_list t)
-- replaces first occurrence of x in list with y
replace :: (Eq a) => a -> a -> [a] -> [a]
replace _ _ [] = []
replace x y (h:t)
| h == x = y : t
| otherwise = h : (replace x y t)
-- sort in increasing order
sort_list :: (Ord a) => [a] -> [a]
sort_list [] = []
sort_list (h:t) = (sort_list (filter (\x -> x <= h) t))
++ [h]
++ (sort_list (filter (\x -> x > h) t))
-- checks if list is in descending order
descending :: (Ord a) => [a] -> Bool
descending [] = True
descending (h:[]) = True
descending (h:t)
| h > (max_list t) = descending t
| otherwise = False
succ_list :: (Ord a) => [a] -> [a]
succ_list [] = []
succ_list (h:[]) = [h]
succ_list (h:t)
| descending (h:t) = (h:t)
| not (descending t) = h : succ_list t
| otherwise = next_h : sort_list (replace next_h h t)
where next_h = min_list (filter (\x -> x > h) t)
-- apply function n times
apply_times :: (Integral n) => n -> (a -> a) -> a -> a
apply_times n _ a
| n <= 0 = a
apply_times n f a = apply_times (n-1) f (f a)
main = putStrLn (show (apply_times 999999 succ_list [0,1,2,3,4,5,6,7,8,9]))
Now the actual question. After noticing that my program took a while to run, I wrote an equivalent C program for comparison. My guess is that the lazy evaluation of Haskell causes the apply_times function to build a huge list in memory before it actually starts evaluating the result. I had to increase the runtime stack size for it to run. Since efficient Haskell programming seems to be about tricks, are there any nice tricks that could be used to minimize memory consumption? What about ways to minimize copying and garbage collection, since lists keep getting created over and over while a C implementation would do everything in place.
Since Haskell is supposedly efficient, I guess there has to be a way? One cool thing that I have to say about Haskell though is that the program worked correctly the first time it compiled, so that part of the language does seem to fill it’s promise.
Indeed. If you
import Data.List, that makessortavailable,maximumandminimumare available from thePrelude. ThesortfromData.Listis all in all more efficient than the quasi-quicksort, in particular since you have a lot of sorted chunks in the lists here.is inefficient –
O(n²)– since it traverses the entire left tail in each step, although if the list is descending, the maximum of the tail must be its head. But that has a nice consequence here. It prevents the build-up of thunks, since the first guard of the third equation ofsucc_listforces the list to be completely evaluated. However, that could be done more efficiently with an explicit forcing of the list once.would make it linear. That
That would be unusual. Few would even go so far to use a linked list in C, implementing lazy evaluation on top of that would be quite an undertaking.
Writing an equivalent programme in C would be extremely unidiomatic. In C, the natural way to implement the algorithm would use an array and in-place mutation. That is automatically much more efficient here.
Not quite, what it builds is a huge thunk,
and, after that thunk has been built, it must be evaluated. To evaluate the outermost call, the next must be evaluated far enough to find out which pattern matches in the outermost call. So the outermost call is pushed on a stack, and the next call is started to be evaluated. For that, it must be determined which pattern matches, so part of the result of the third call is needed. Thus the second call is pushed on the stack … . At the end, you have 999998 calls on the stack and start to evaluate the innermost call. Then you play a bit of ping-pong between each call and the next outer call (at least, the dependencies might spread a bit further) while bubbling up and popping calls from the stack.
Yes, force the intermediate lists to be evaluated before they become the argument of
apply_times. You need complete evaluation here, so the vanillaseqis not good enoughthat prevents the build-up of thunks, and thus you don’t need more memory than for a few short lists constructed in
succ_list, and the counter.Right, that would still allocate (and garbage collect) a lot. Now, GHC is very good in allocating and garbage collecting short-lived data (on my box, it can easily allocate at a rate of 2GB per MUT second without being slow), but still, not allocating all those lists would be faster.
So, if you want to push it, use in-place mutation. Work on an
or an unboxed mutable Vector (I prefer the interface provided by the
arraypackage, but most prefer thevectorinterface; in terms of performance, thevectorpackage has a lot of optimisations built-in for you, if you use thearraypackage, you have to write the fast code yourself, but well-written code performs equal for all practical purposes).I’ve done a bit of testing now. I have not tested the original lazy
apply_times, only the onedeepseqing each application off, and have fixed the type of all involved entities asInt.With that set-up, replacing
sort_listwithData:list.sortreduced the running time from 1.82 seconds to 1.65 (but increased the number of allocated bytes). Not too much of a difference, but the lists are not long enough to make the bad cases for the quasi-quicksort really bite.The big difference then comes from changing
descendingas proposed, that brought the time down to 0.48 seconds, Alloc rate 2,170,566,037 bytes per MUT second, 0.01 seconds GC time (and then usingsort_listinstead ofsortbrings the time up to 0.58 seconds).Replacing the sorting of the ending segment of the list with a simpler
reverse– the algorithm guarantees that it is sorted in descending order when it is sorted – brings down the time to 0.43 seconds.A fairly direct translation of the algorithm to use unboxed mutable arrays,
completes in 0.15 seconds. Replacing the sorting with a simpler reversing of the part brings it down to 0.11.
Splitting the algorithm into small top-level functions that each do one task makes it more readable, but that comes at a price. More parameters need to be passed between the functions, consequently not all can be passed in registers, and some of the passed parameters – the array bounds and element count – are not used at all, so that’s dead weight being passed. Making all other functions local functions in
solutionreduces the overall allocation and running time somewhat (0.13 seconds with sorting, 0.09 with reversing), since now only the necessary parameters need to be passed.Deviating further from the given algorithm and making it work back to front,
we can complete the task in 0.02 seconds.
The clever algorithm alluded to in the question, however, solves the task with far less code in much less time.