Last week user Masse asked a question about recursively listing files in a directory in Haskell. My first thought was to try using monadic lists from the List package to avoid building the entire list in memory before the printing can start. I implemented this as follows:
module Main where
import Prelude hiding (filter)
import Control.Applicative ((<$>))
import Control.Monad (join)
import Control.Monad.IO.Class (liftIO)
import Control.Monad.ListT (ListT)
import Data.List.Class (cons, execute, filter, fromList, mapL)
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))
main = execute . mapL putStrLn . listFiles =<< head <$> getArgs
listFiles :: FilePath -> ListT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
where
valid "." = False
valid ".." = False
valid _ = True
listIfDir False = return path
listIfDir True
= cons path
$ join
$ listFiles
<$> (path </>)
<$> (filter valid =<< fromList <$> liftIO (getDirectoryContents path))
This works beautifully in that it starts printing immediately and uses very little memory. Unfortunately it’s also dozens of times slower than a comparable FilePath -> IO [FilePath] version.
What am I doing wrong? I’ve never used the List package’s ListT outside of toy examples like this, so I don’t know what kind of performance to expect, but 30 seconds (vs. a fraction of a second) to process a directory with ~40,000 files seems much too slow.
Profiling shows that
join(together withdoesDirectoryExists) accounts for most of the time in your code. Lets see how its definition unfolds:If in the root directory of the search there are
ksubdirectories and their contents are already computed in the lists:d1, d2, ... dk, then after applyingjoinyou’ll get (roughly):(...(([] ++ d1) ++ d2) ... ++ dk). Sincex ++ ytakes timeO(length x)the whole thing will take timeO(d1 + (d1 + d2) + ... + (d1 + ... dk-1)). If we assume that the number of files isnand they are evenly distributed betweend1 ... dkthen the time to computejoinwould beO(n*k)and that is only for the first level oflistFiles.This, I think, is the main performance problem with your solution.