I have a 300MB file (link) with utf-8 characters in it. I want to write a haskell program equivalent to:
cat bigfile.txt | grep "^en " | wc -l
This runs in 2.6s on my system.
Right now, I’m reading the file as a normal String (readFile), and have this:
main = do
contents <- readFile "bigfile.txt"
putStrLn $ show $ length $ lines contents
After a couple seconds I get this error:
Dictionary.hs: bigfile.txt: hGetContents: invalid argument (Illegal byte sequence)
I assume I need to use something more utf-8 friendly? How can I make it both fast, and utf-8 compatible? I read about Data.ByteString.Lazy for speed, but Real World Haskell says it doesn’t support utf-8.
Package utf8-string provides support for reading and writing UTF8 Strings. It reuses the
ByteStringinfrastructure so the interface is likely to be very similar.Another Unicode strings project which is likely to be related to the above and is also inspired by ByteStrings is discussed in this Masters thesis.