Temporal Difference Learning

Introduction

Temporal Difference (TD) learning is a class of model-free reinforcement learning methods that learn by bootstrapping from the current estimate of the value function. TD methods update the value function based on other learned estimates, without waiting for a final outcome (as in Monte Carlo methods). TD learning combines the sampling of Monte Carlo with the bootstrapping of dynamic programming.

Key characteristics of TD learning:

  • Model-free: They learn directly from experience without requiring a model of the environment.

  • Bootstrapping: They update value function estimates based on other learned estimates.

  • Online learning: They can learn from incomplete episodes.

  • Lower variance, potentially higher bias: Compared to Monte Carlo methods, they typically have lower variance but can introduce bias due to bootstrapping.

TD Prediction

TD prediction involves estimating the value function for a given policy. The simplest TD method is TD(0), which updates the value function as follows:

V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]

Where:

  • V(S_t) is the estimated value of state S_t.

  • \alpha is the learning rate (0 < \alpha ≤ 1).

  • R_{t+1} is the reward received after transitioning from S_t to S_{t+1}.

  • \gamma is the discount factor (0 ≤ \gamma ≤ 1).

  • V(S_{t+1}) is the estimated value of the next state S_{t+1}.

The term R_{t+1} + \gamma V(S_{t+1}) - V(S_t) is known as the TD error, which represents the difference between the actual reward received and the expected reward based on the current value function.

TD Control

TD control aims to find the optimal policy by iteratively improving the policy based on the estimated action values. Two main TD control algorithms are:

  1. SARSA (State-Action-Reward-State-Action): An on-policy TD control algorithm that updates the action-value function Q(S_t, A_t) based on the next state and action (S_{t+1}, A_{t+1}).

    Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]

  2. Q-learning: An off-policy TD control algorithm that updates the action-value function Q(S_t, A_t) based on the maximum action-value in the next state.

    Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)]

Eligibility Traces

Eligibility traces provide a mechanism for TD learning to assign credit to past states and actions. They are used in algorithms like TD(\lambda) and SARSA(\lambda). The eligibility trace for a state s at time t is denoted by e_t(s).

Conclusion

Temporal Difference learning offers an efficient way to learn value functions and optimal policies in reinforcement learning. By bootstrapping and updating estimates based on experience, TD methods can effectively solve a wide range of control problems.

References