I was going to test naive bayes classification. One part of it was going

Question

0

Asked: May 31, 20262026-05-31T15:24:00+00:00 2026-05-31T15:24:00+00:00

I was going to test naive bayes classification. One part of it was going

0

I was going to test naive bayes classification. One part of it was going to be building a histogram of the training data. The problem is, I am using a large training data, the haskell-cafe mailing list since a couple of years back, and there are over 20k files in the folder.

It takes a while over two minutes to create the histogram with python, and a little over 8 minutes with haskell. I’m using Data.Map (insertWith’), enumerators and text. What else can I do to speed up the program?

Haskell:

import qualified Data.Text as T
import qualified Data.Text.IO as TI
import System.Directory
import Control.Applicative
import Control.Monad (filterM, foldM)
import System.FilePath.Posix ((</>))
import qualified Data.Map as M
import Data.Map (Map)
import Data.List (foldl')
import Control.Exception.Base (bracket)
import System.IO (Handle, openFile, hClose, hSetEncoding, IOMode(ReadMode), latin1)
import qualified Data.Enumerator as E
import Data.Enumerator (($$), (>==>), (<==<), (==<<), (>>==), ($=), (=$))
import qualified Data.Enumerator.List as EL
import qualified Data.Enumerator.Text as ET



withFile' ::  (Handle -> IO c) -> FilePath -> IO c
withFile' f fp = do
  bracket
    (do
      h ← openFile fp ReadMode
      hSetEncoding h latin1
      return h)
    hClose
    (f)

buildClassHistogram c = do
  files ← filterM doesFileExist =<< map (c </> ) <$> getDirectoryContents c
  foldM fileHistogram M.empty files

fileHistogram m file = withFile' (λh → E.run_ $ enumHist h) file
  where
    enumHist h = ET.enumHandle h $$ EL.fold (λm' l → foldl' (λm'' w → M.insertWith' (const (+1)) w 1 m'') m' $ T.words l) m

Python:

for filename in listdir(root):
    filepath = root + "/" + filename
    # print(filepath)
    fp = open(filepath, "r", encoding="latin-1")
    for word in fp.read().split():
        if word in histogram:
            histogram[word] = histogram[word]+1
        else:
            histogram[word] = 1

Edit: Added imports

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T15:24:02+00:00

You could try using imperative hash maps from the hashtables package: http://hackage.haskell.org/package/hashtables
I remember I once got a moderate speedup compared to Data.Map. I wouldn’t expect anything spectacular though.

UPDATE

I simplified your python code so I could test it on a single big file (100 million lines):

import sys
histogram={}
for word in sys.stdin.readlines():
    if word in histogram:
        histogram[word] = histogram[word]+1
    else:
        histogram[word] = 1
print histogram.get("the")

Takes 6.06 seconds

Haskell translation using hashtables:

{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Char8 as T
import  qualified Data.HashTable.IO as HT
main = do
  ls <- T.lines `fmap` T.getContents
  h <- HT.new :: IO (HT.BasicHashTable T.ByteString Int)
  flip mapM_ ls $ \w -> do
    r <- HT.lookup h w 
    case r of 
      Nothing -> HT.insert h w (1::Int)
      Just c  -> HT.insert h w (c+1)
  HT.lookup h "the" >>= print

Run with a large allocation area: histogram +RTS -A500M
Takes 9.3 seconds, with 2.4% GC. Still quite a bit slower than Python but not too bad.

According to the GHC user guide, you can change the RTS options while compiling:

GHC lets you change the default RTS options for a program at compile
time, using the -with-rtsopts flag (Section 4.12.6, “Options affecting
linking”). A common use for this is to give your program a default
heap and/or stack size that is greater than the default. For example,
to set -H128m -K64m, link with -with-rtsopts=”-H128m -K64m”.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I was going to test naive bayes classification. One part of it was going

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply