Although I know that SARSA is on-policy while Q-learning is off-policy, when looking at

Question

0

Asked: May 24, 20262026-05-24T05:28:46+00:00 2026-05-24T05:28:46+00:00

Although I know that SARSA is on-policy while Q-learning is off-policy, when looking at

0

Although I know that SARSA is on-policy while Q-learning is off-policy, when looking at their formulas it’s hard (to me) to see any difference between these two algorithms.

According to the book Reinforcement Learning: An Introduction (by Sutton and Barto). In the SARSA algorithm, given a policy, the corresponding action-value function Q (in the state s and action a, at timestep t), i.e. Q(s_t, a_t), can be updated as follows

Q(s_t, a_t) = Q(s_t, a_t) + α*(r_t + γ*Q(s_t+1, a_t+1) – Q(s_t, a_t))

On the other hand, the update step for the Q-learning algorithm is the following

Q(s_t, a_t) = Q(s_t, a_t) + α*(r_t + γ*max_a Q(s_t+1, a) – Q(s_t, a_t))

which can also be written as

Q(s_t, a_t) = (1 – α) * Q(s_t, a_t) + α * (r_t + γ*max_a Q(s_t+1, a))

where γ (gamma) is the discount factor and r_t is the reward received from the environment at timestep t.

Is the difference between these two algorithms the fact that SARSA only looks up the next policy value while Q-learning looks up the next maximum policy value?

TLDR (and my own answer)

Thanks to all those answering this question since I first asked it. I’ve made a github repo playing with Q-Learning and empirically understood what the difference is. It all amounts to how you select your next best action, which from an algorithmic standpoint can be a mean, max or best action depending on how you chose to implement it.

The other main difference is when this selection is happening (e.g., online vs offline) and how/why that affects learning. If you are reading this in 2019 and are more of a hands-on person, playing with a RL toy problem is probably the best way to understand the differences.

One last important note is that both Suton & Barto as well as Wikipedia often have mixed, confusing or wrong formulaic representations with regards to the next state best/max action and reward:

r(t+1)

is in fact

r(t)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T05:28:47+00:00

Yes, this is the only difference. On-policy SARSA learns action values relative to the policy it follows, while off-policy Q-Learning does it relative to the greedy policy. Under some common conditions, they both converge to the real value function, but at different rates. Q-Learning tends to converge a little slower, but has the capabilitiy to continue learning while changing policies. Also, Q-Learning is not guaranteed to converge when combined with linear approximation.

In practical terms, under the ε-greedy policy, Q-Learning computes the difference between Q(s,a) and the maximum action value, while SARSA computes the difference between Q(s,a) and the weighted sum of the average action value and the maximum:

Q-Learning: Q(s_t+1,a_t+1) = max_aQ(s_t+1,a)

SARSA: Q(s_t+1,a_t+1) = ε·mean_aQ(s_t+1,a) + (1-ε)·max_aQ(s_t+1,a)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Although I know that SARSA is on-policy while Q-learning is off-policy, when looking at

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply