I am using Boltzman exploration in Q-learning where I have at least 10 actions in each state. I know that with only two actions, Boltzman exploration can be applied quite simply as follows:
- Calculate pr1 and pr2 for the two actions with the Boltzman exploration equation.
- Generate a random number r
- Assuming pr1>pr2. If r<=pr1, take action corresponding to probability pr1. If r>pr1, take action corresponding to pr2.
However, how can I do this with 10 actions? At each decision step, I update the probabilities of all the actions. This gives me a probability distribution of all the actions where the probability of best action is highest. How do I select action in this case using the Boltzman exploration?
Here is an excellent discussion of weighted random sampling: Darts, Dice, and Coins.
And here is my implementation of the Vose’s Alias method: