I’m looking at the ‘Monte Carlo Tree Search’ algorithm’s ‘Upper Confidence Bounds’.
C is a weight for exploration over exploitation.
score = wins / played
sum = wins + played
UCB = score + C * sqrt(naturalLog(parent's sum) / sum)
The issue occurs when played is 0. I’m considering these possibilities.
score = 0
Because the node has never won, although it's never lost either.
score = 0.5
Because the node's value is completly uncertain and 0.5 is half way.
Does anyone have an answer?
The first step in every bandit algorithm, including MCTS, is to pull every arm once. Since this would obviously result in exhaustive search if you do this at every node, you instead only use MCTS up to a fixed depth and use a roll-out policy for the rest. You can use a prior of course, but then you lose all the nice theoretical properties of the UCB algorithm, primarily logarithmic regret.