I’m trying to learn Haskell and after an article in reddit about Markov text

Question

0

Asked: May 15, 20262026-05-15T01:44:58+00:00 2026-05-15T01:44:58+00:00

I’m trying to learn Haskell and after an article in reddit about Markov text

0

I’m trying to learn Haskell and after an article in reddit about Markov text chains, I decided to implement Markov text generation first in Python and now in Haskell. However I noticed that my python implementation is way faster than the Haskell version, even Haskell is compiled to native code. I am wondering what I should do to make the Haskell code run faster and for now I believe it’s so much slower because of using Data.Map instead of hashmaps, but I’m not sure

I’ll post the Python code and Haskell as well. With the same data, Python takes around 3 seconds and Haskell is closer to 16 seconds.

It comes without saying that I’ll take any constructive criticism :).

import random
import re
import cPickle
class Markov:
    def __init__(self, filenames):
        self.filenames = filenames
        self.cache = self.train(self.readfiles())
        picklefd = open("dump", "w")
        cPickle.dump(self.cache, picklefd)
        picklefd.close()

    def train(self, text):
        splitted = re.findall(r"(\w+|[.!?',])", text)
        print "Total of %d splitted words" % (len(splitted))
        cache = {}
        for i in xrange(len(splitted)-2):
            pair = (splitted[i], splitted[i+1])
            followup = splitted[i+2]
            if pair in cache:
                if followup not in cache[pair]:
                    cache[pair][followup] = 1
                else:
                    cache[pair][followup] += 1
            else:
                cache[pair] = {followup: 1}
        return cache

    def readfiles(self):
        data = ""
        for filename in self.filenames:
            fd = open(filename)
            data += fd.read()
            fd.close()
        return data

    def concat(self, words):
        sentence = ""
        for word in words:
            if word in "'\",?!:;.":
                sentence = sentence[0:-1] + word + " "
            else:
                sentence += word + " "
        return sentence

    def pickword(self, words):
        temp = [(k, words[k]) for k in words]
        results = []
        for (word, n) in temp:
            results.append(word)
            if n > 1:
                for i in xrange(n-1):
                    results.append(word)
        return random.choice(results)

    def gentext(self, words):
        allwords = [k for k in self.cache]
        (first, second) = random.choice(filter(lambda (a,b): a.istitle(), [k for k in self.cache]))
        sentence = [first, second]
        while len(sentence) < words or sentence[-1] is not ".":
            current = (sentence[-2], sentence[-1])
            if current in self.cache:
                followup = self.pickword(self.cache[current])
                sentence.append(followup)
            else:
                print "Wasn't able to. Breaking"
                break
        print self.concat(sentence)

Markov(["76.txt"])

—

module Markov
( train
, fox
) where

import Debug.Trace
import qualified Data.Map as M
import qualified System.Random as R
import qualified Data.ByteString.Char8 as B


type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)

train :: [B.ByteString] -> Database
train (x:y:[]) = M.empty
train (x:y:z:xs) = 
     let l = train (y:z:xs)
     in M.insertWith' (\new old -> M.insertWith' (+) z 1 old) (x, y) (M.singleton z 1) `seq` l

main = do
  contents <- B.readFile "76.txt"
  print $ train $ B.words contents

fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T01:44:59+00:00

I tried to avoid doing anything fancy or subtle. These are just two approaches to doing the grouping; the first emphasizes pattern matching, the second doesn’t.

import Data.List (foldl')
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B

type Database2 = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)

train2 :: [B.ByteString] -> Database2
train2 words = go words M.empty
    where go (x:y:[]) m = m
          go (x:y:z:xs) m = let addWord Nothing   = Just $ M.singleton z 1
                                addWord (Just m') = Just $ M.alter inc z m'
                                inc Nothing    = Just 1
                                inc (Just cnt) = Just $ cnt + 1
                            in go (y:z:xs) $ M.alter addWord (x,y) m

train3 :: [B.ByteString] -> Database2
train3 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
    where update m (x,y,z) = M.alter (addWord z) (x,y) m
          addWord word = Just . maybe (M.singleton word 1) (M.alter inc word)
          inc = Just . maybe 1 (+1)

main = do contents <- B.readFile "76.txt"
          let db = train3 $ B.words contents
          print $ "Built a DB of " ++ show (M.size db) ++ " words"

I think they are both faster than the original version, but admittedly I only tried them against the first reasonable corpus I found.

EDIT
As per Travis Brown’s very valid point,

train4 :: [B.ByteString] -> Database2
train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
    where update m (x,y,z) = M.insertWith (inc z) (x,y) (M.singleton z 1) m
          inc k _ = M.insertWith (+) k 1

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to learn Haskell and after an article in reddit about Markov text

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply