I am trying to come up with equivalent of “wc -l” using Haskell Iteratee library. Below is the code for “wc” (which just counts the words – similar to the code in iteratee example on hackage), and runs very fast:
{-# LANGUAGE BangPatterns #-}
import Data.Iteratee as I
import Data.ListLike as LL
import Data.Iteratee.IO
import Data.ByteString
length1 :: (Monad m, Num a, LL.ListLike s el) => Iteratee s m a
length1 = liftI (step 0)
where
step !i (Chunk xs) = liftI (step $ i + fromIntegral (LL.length xs))
step !i stream = idone i stream
{-# INLINE length1 #-}
main = do
i' <- enumFile 1024 "/usr/share/dict/words" (length1 :: (Monad m) => Iteratee ByteString m Int)
result <- run i'
print result
{- Time measured on a linux x86 box:
$ time ./test ## above haskell compiled code
4950996
real 0m0.013s
user 0m0.004s
sys 0m0.007s
$ time wc -c /usr/share/dict/words
4950996 /usr/share/dict/words
real 0m0.003s
user 0m0.000s
sys 0m0.002s
-}
Now, how do you extend it to count the number of lines that too runs fast? I did a version using Prelude.filter to filter only “\n” to length but it is slower than linux “wc -l” because of too much memory, and gc (lazy evaluation, I guess). So, I wrote another version using Data.ListLike.filter but it won’t compile because it doesn’t type check – help here would be appreciated:
{-# LANGUAGE BangPatterns #-}
import Data.Iteratee as I
import Data.ListLike as LL
import Data.Iteratee.IO
import Data.ByteString
import Data.Char
import Data.ByteString.Char8 (pack)
numlines :: (Monad m, Num a, LL.ListLike s el) => Iteratee s m a
numlines = liftI $ step 0
where
step !i (Chunk xs) = liftI (step $i + fromIntegral (LL.length $ LL.filter (\x -> x == Data.ByteString.Char8.pack "\n") xs))
step !i stream = idone i stream
{-# INLINE numlines #-}
main = do
i' <- enumFile 1024 "/usr/share/dict/words" (numlines :: (Monad m) => Iteratee ByteString m Int)
result <- run i'
print result
There are a lot of good answers already; I have very little to offer performance-wise but a few style points.
First, I would write it this way:
The biggest difference is that this uses
Data.Iteratee.ListLike.foldl'. Note that only the individual stream elements matter to the step function, not the stream type. It’s exactly the same function as you would use with e.g.Data.ByteString.Lazy.foldl'.Using
foldl'also means that you don’t need to manually write iteratees withliftI. I would discourage users from doing so unless absolutely necessary. The result is usually longer and harder to maintain with little to no benefit.Finally, I’ve increased the buffer size significantly. On my system this is marginally faster than
enumerators default of 4096, which is again marginally faster (with iteratee) than your choice of 1024. YMMV with this setting of course.