I’ve written a small Haskell program to print the MD5 checksums of all files in the current directory (searched recursively). Basically a Haskell version of md5deep. All is fine and dandy except if the current directory has a very large number of files, in which case I get an error like:
<program>: <currentFile>: openBinaryFile: resource exhausted (Too many open files)
It seems Haskell’s laziness is causing it not to close files, even after its corresponding line of output has been completed.
The relevant code is below. The function of interest is getList.
import qualified Data.ByteString.Lazy as BS
main :: IO ()
main = putStr . unlines =<< getList "."
getList :: FilePath -> IO [String]
getList p =
let getFileLine path = liftM (\c -> (hex $ hash $ BS.unpack c) ++ " " ++ path) (BS.readFile path)
in mapM getFileLine =<< getRecursiveContents p
hex :: [Word8] -> String
hex = concatMap (\x -> printf "%0.2x" (toInteger x))
getRecursiveContents :: FilePath -> IO [FilePath]
-- ^ Just gets the paths to all the files in the given directory.
Are there any ideas on how I could solve this problem?
The entire program is available here: http://haskell.pastebin.com/PAZm0Dcb
Edit: I have plenty of files that don’t fit into RAM, so I am not looking for a solution that reads the entire file into memory at once.
Lazy IO is very bug-prone.
As dons suggested, you should use strict IO.
You can use a tool such as Iteratee to help you structure strict IO code. My favorite tool for this job is monadic lists.
I used the “pureMD5” package here because “Crypto” doesn’t seem to offer a “streaming” md5 implementation.
Monadic lists/
ListTcome from the “List” package on hackage (transformers’ and mtl’sListTare broken and also don’t come with useful functions liketakeWhile)