I’m writing a program where a large number of agents listen for events and react on them. Since Control.Concurrent.Chan.dupChan is deprecated I decided to use TChan’s as advertised.
The performance of TChan is much worse than I expected. I have the following program that illustrates the issue:
{-# LANGUAGE BangPatterns #-}
module Main where
import Control.Concurrent.STM
import Control.Concurrent
import System.Random(randomRIO)
import Control.Monad(forever, when)
allCoords :: [(Int,Int)]
allCoords = [(x,y) | x <- [0..99], y <- [0..99]]
randomCoords :: IO (Int,Int)
randomCoords = do
x <- randomRIO (0,99)
y <- randomRIO (0,99)
return (x,y)
main = do
chan <- newTChanIO :: IO (TChan ((Int,Int),Int))
let watcher p = do
chan' <- atomically $ dupTChan chan
forkIO $ forever $ do
r@(p',_counter) <- atomically $ readTChan chan'
when (p == p') (print r)
return ()
mapM_ watcher allCoords
let go !cnt = do
xy <- randomCoords
atomically $ writeTChan chan (xy,cnt)
go (cnt+1)
go 1
When compiled (-O) and run the program first will output something like this:
./tchantest ((0,25),341) ((0,33),523) ((0,33),654) ((0,35),196) ((0,48),181) ((0,48),446) ((1,15),676) ((1,50),260) ((1,78),561) ((2,30),622) ((2,38),383) ((2,41),365) ((2,50),596) ((2,57),194) ((3,19),259) ((3,27),344) ((3,33),65) ((3,37),124) ((3,49),109) ((3,72),91) ((3,87),637) ((3,96),14) ((4,0),34) ((4,17),390) ((4,73),381) ((4,74),217) ((4,78),150) ((5,7),476) ((5,27),207) ((5,47),197) ((5,49),543) ((5,53),641) ((5,58),175) ((5,70),497) ((5,88),421) ((5,89),617) ((6,0),15) ((6,4),322) ((6,16),661) ((6,18),405) ((6,30),526) ((6,50),183) ((6,61),528) ((7,0),74) ((7,28),479) ((7,66),418) ((7,72),318) ((7,79),101) ((7,84),462) ((7,98),669) ((8,5),126) ((8,64),113) ((8,77),154) ((8,83),265) ((9,4),253) ((9,26),220) ((9,41),255) ((9,63),51) ((9,64),229) ((9,73),621) ((9,76),384) ((9,92),569) ...
And then, at some point, will stop writing anything, while still consuming 100% cpu.
((20,56),186) ((20,58),558) ((20,68),277) ((20,76),102) ((21,5),396) ((21,7),84)
With -threaded the lockup is even faster and occurs after only a handful of lines. It will also consume whatever number of cores are made available through RTS’ -N flag.
Additionally the performance seems rather poor – only about 100 events per second are processed.
Is this a bug in STM or am I misunderstanding something about semantics of STM?
The program is going to perform quite badly. You’re spawning off 10,000 threads all of which will queue up waiting for a single TVar to be written to. So once they’re all going, you may well get this happening:
So each item will cause processing O(10,000). If you see 100 events per second, that means that each thread requires about 1 microsecond to wake up, read a couple of TVars, write to one and queue up again. That doesn’t seem so unreasonable. I don’t understand why the program would grind to a complete halt, though.
In general, I would scrap this design and replace it as follows:
Have a single thread reading the event channel, which maintains a map from coordinate to interested-receiver-channel. The single thread can then pick out the receiver(s) from the map in O(log N) time (much better than O(N), and with a much smaller constant factor involved), and send the event to just the interested receiver. So you perform just one or two communications to the interested party, rather than 10,000 communications to everyone. A list-based form of the idea is written in CHP in section 5.4 of this paper: http://chplib.files.wordpress.com/2011/05/chp.pdf