Temporal Difference Learning
Introduction
Temporal Difference (TD) learning is a class of model-free reinforcement learning methods that learn by bootstrapping from the current estimate of the value function. TD methods update the value function based on other learned estimates, without waiting for a final outcome (as in Monte Carlo methods). TD learning combines the sampling of Monte Carlo with the bootstrapping of dynamic programming.
Key characteristics of TD learning:
Model-free: They learn directly from experience without requiring a model of the environment.
Bootstrapping: They update value function estimates based on other learned estimates.
Online learning: They can learn from incomplete episodes.
Lower variance, potentially higher bias: Compared to Monte Carlo methods, they typically have lower variance but can introduce bias due to bootstrapping.
TD Prediction
TD prediction involves estimating the value function for a given policy. The simplest TD method is TD(0), which updates the value function as follows:
Where:
is the estimated value of state
.
is the learning rate (0 <
≤ 1).
is the reward received after transitioning from
to
.
is the discount factor (0 ≤
≤ 1).
is the estimated value of the next state
.
The term is known as the TD error, which represents the difference between the actual reward received and the expected reward based on the current value function.
TD Control
TD control aims to find the optimal policy by iteratively improving the policy based on the estimated action values. Two main TD control algorithms are:
SARSA (State-Action-Reward-State-Action): An on-policy TD control algorithm that updates the action-value function
based on the next state and action
.
Q-learning: An off-policy TD control algorithm that updates the action-value function
based on the maximum action-value in the next state.
Eligibility Traces
Eligibility traces provide a mechanism for TD learning to assign credit to past states and actions. They are used in algorithms like TD() and SARSA(
). The eligibility trace for a state
at time
is denoted by
.
Conclusion
Temporal Difference learning offers an efficient way to learn value functions and optimal policies in reinforcement learning. By bootstrapping and updating estimates based on experience, TD methods can effectively solve a wide range of control problems.