Temporal Difference Learning

Introduction

Temporal Difference (TD) learning is a class of model-free reinforcement learning methods that learn by bootstrapping from the current estimate of the value function. TD methods update the value function based on other learned estimates, without waiting for a final outcome (as in Monte Carlo methods). TD learning combines the sampling of Monte Carlo with the bootstrapping of dynamic programming.

Key characteristics of TD learning:

Model-free: They learn directly from experience without requiring a model of the environment.
Bootstrapping: They update value function estimates based on other learned estimates.
Online learning: They can learn from incomplete episodes.
Lower variance, potentially higher bias: Compared to Monte Carlo methods, they typically have lower variance but can introduce bias due to bootstrapping.

TD Prediction

TD prediction involves estimating the value function for a given policy. The simplest TD method is TD(0), which updates the value function as follows:

$V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]$

Where:

$V(S_t)$ is the estimated value of state $S_t$ .
$\alpha$ is the learning rate (0 < $\alpha$ ≤ 1).
$R_{t+1}$ is the reward received after transitioning from $S_t$ to $S_{t+1}$ .
$\gamma$ is the discount factor (0 ≤ $\gamma$ ≤ 1).
$V(S_{t+1})$ is the estimated value of the next state $S_{t+1}$ .

The term $R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is known as the TD error, which represents the difference between the actual reward received and the expected reward based on the current value function.

TD Control

TD control aims to find the optimal policy by iteratively improving the policy based on the estimated action values. Two main TD control algorithms are:

SARSA (State-Action-Reward-State-Action): An on-policy TD control algorithm that updates the action-value function $Q(S_t, A_t)$ based on the next state and action $(S_{t+1}, A_{t+1})$ .

$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]$
Q-learning: An off-policy TD control algorithm that updates the action-value function $Q(S_t, A_t)$ based on the maximum action-value in the next state.

$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)]$

Eligibility Traces

Eligibility traces provide a mechanism for TD learning to assign credit to past states and actions. They are used in algorithms like TD( $\lambda$ ) and SARSA( $\lambda$ ). The eligibility trace for a state $s$ at time $t$ is denoted by $e_t(s)$ .

Conclusion

Temporal Difference learning offers an efficient way to learn value functions and optimal policies in reinforcement learning. By bootstrapping and updating estimates based on experience, TD methods can effectively solve a wide range of control problems.

References

Reinforcement Learning: An Introduction