I was going to test naive bayes classification. One part of it was going to be building a histogram of the training data. The problem is, I am using a large training data, the haskell-cafe mailing list since a couple of years back, and there are over 20k files in the folder.
It takes a while over two minutes to create the histogram with python, and a little over 8 minutes with haskell. I’m using Data.Map (insertWith’), enumerators and text. What else can I do to speed up the program?
Haskell:
import qualified Data.Text as T
import qualified Data.Text.IO as TI
import System.Directory
import Control.Applicative
import Control.Monad (filterM, foldM)
import System.FilePath.Posix ((</>))
import qualified Data.Map as M
import Data.Map (Map)
import Data.List (foldl')
import Control.Exception.Base (bracket)
import System.IO (Handle, openFile, hClose, hSetEncoding, IOMode(ReadMode), latin1)
import qualified Data.Enumerator as E
import Data.Enumerator (($$), (>==>), (<==<), (==<<), (>>==), ($=), (=$))
import qualified Data.Enumerator.List as EL
import qualified Data.Enumerator.Text as ET
withFile' :: (Handle -> IO c) -> FilePath -> IO c
withFile' f fp = do
bracket
(do
h ← openFile fp ReadMode
hSetEncoding h latin1
return h)
hClose
(f)
buildClassHistogram c = do
files ← filterM doesFileExist =<< map (c </> ) <$> getDirectoryContents c
foldM fileHistogram M.empty files
fileHistogram m file = withFile' (λh → E.run_ $ enumHist h) file
where
enumHist h = ET.enumHandle h $$ EL.fold (λm' l → foldl' (λm'' w → M.insertWith' (const (+1)) w 1 m'') m' $ T.words l) m
Python:
for filename in listdir(root):
filepath = root + "/" + filename
# print(filepath)
fp = open(filepath, "r", encoding="latin-1")
for word in fp.read().split():
if word in histogram:
histogram[word] = histogram[word]+1
else:
histogram[word] = 1
Edit: Added imports
You could try using imperative hash maps from the hashtables package: http://hackage.haskell.org/package/hashtables
I remember I once got a moderate speedup compared to Data.Map. I wouldn’t expect anything spectacular though.
UPDATE
I simplified your python code so I could test it on a single big file (100 million lines):
Takes 6.06 seconds
Haskell translation using hashtables:
Run with a large allocation area:
histogram +RTS -A500MTakes 9.3 seconds, with 2.4% GC. Still quite a bit slower than Python but not too bad.
According to the GHC user guide, you can change the RTS options while compiling: