I’ve started toying with reinforcement learning (using the Sutton book). I fail to fully

Question

0

Asked: May 20, 20262026-05-20T04:51:58+00:00 2026-05-20T04:51:58+00:00

I’ve started toying with reinforcement learning (using the Sutton book). I fail to fully

0

I’ve started toying with reinforcement learning (using the Sutton book). I fail to fully understand is the paradox between having to reduce the markov state space while on the other hand not making assumptions about what’s important and what’s not.

background

Eg. the checkers example, Sutton says that one should not assign rewards to certain actions in the game, such as defeating an opponents piece. He claims this may optimize the AI for taking pieces not win the game. Thus, rewards should only be given to the result you want to achieve (eg win the game).

Question 1

Assume a (Texas hold’em) Poker AI with a markov state only of the players hand and the cards on the table. This has around 52*51*50*49*48*47*46/1*2*3*4*5*6*7 states. Now assume we want the AI to take players money pool + their bets into consideration. This will make the Markov state space approach “infinite number of combinations” if we assume 8 players each having between $1-200.000.

Question 2

One state-reducing-strategy could be to divide players cash into either poor, medium or rich. This seriously reduces our state space, however, how do I know that a) 3 groups is sufficient? b) what are the discriminating limits for each group?

cheers,

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T04:51:59+00:00

A proposed approach to reducing state space in RL is through use of a state-action hierarchy. Instead of having a single state variable X, you would break that up into smaller variables, say, x1, x2, x3. Then you measure their transition frequencies and determine dependencies between them (e.g. x1 usually changes when x2=abc). You can then form a policy explaining how best to transition the faster-changing variable in order to change the slower-changing variable in order to maximize the reward.

This approach is still relatively new, and I’m not aware of any public implementations of it. However, there are several papers proposing possible implementations. The MAXQ algorithm assumes a human-defined hierarchy, whereas the HEXQ algorithm describes a method of learning the hierarchy as well as the policies.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve started toying with reinforcement learning (using the Sutton book). I fail to fully

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply