Temporal Difference Learning ============================= Introduction ------------ Temporal Difference (TD) learning is a class of model-free reinforcement learning methods that learn by bootstrapping from the current estimate of the value function. TD methods update the value function based on other learned estimates, without waiting for a final outcome (as in Monte Carlo methods). TD learning combines the sampling of Monte Carlo with the bootstrapping of dynamic programming. Key characteristics of TD learning: - **Model-free**: They learn directly from experience without requiring a model of the environment. - **Bootstrapping**: They update value function estimates based on other learned estimates. - **Online learning**: They can learn from incomplete episodes. - **Lower variance, potentially higher bias**: Compared to Monte Carlo methods, they typically have lower variance but can introduce bias due to bootstrapping. TD Prediction ------------- TD prediction involves estimating the value function for a given policy. The simplest TD method is TD(0), which updates the value function as follows: .. math:: V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)] Where: - :math:`V(S_t)` is the estimated value of state :math:`S_t`. - :math:`\alpha` is the learning rate (0 < :math:`\alpha` ≤ 1). - :math:`R_{t+1}` is the reward received after transitioning from :math:`S_t` to :math:`S_{t+1}`. - :math:`\gamma` is the discount factor (0 ≤ :math:`\gamma` ≤ 1). - :math:`V(S_{t+1})` is the estimated value of the next state :math:`S_{t+1}`. The term :math:`R_{t+1} + \gamma V(S_{t+1}) - V(S_t)` is known as the TD error, which represents the difference between the actual reward received and the expected reward based on the current value function. TD Control ---------- TD control aims to find the optimal policy by iteratively improving the policy based on the estimated action values. Two main TD control algorithms are: 1. **SARSA (State-Action-Reward-State-Action)**: An on-policy TD control algorithm that updates the action-value function :math:`Q(S_t, A_t)` based on the next state and action :math:`(S_{t+1}, A_{t+1})`. .. math:: Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)] 2. **Q-learning**: An off-policy TD control algorithm that updates the action-value function :math:`Q(S_t, A_t)` based on the maximum action-value in the next state. .. math:: Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)] Eligibility Traces ------------------ Eligibility traces provide a mechanism for TD learning to assign credit to past states and actions. They are used in algorithms like TD(:math:`\lambda`) and SARSA(:math:`\lambda`). The eligibility trace for a state :math:`s` at time :math:`t` is denoted by :math:`e_t(s)`. Conclusion ---------- Temporal Difference learning offers an efficient way to learn value functions and optimal policies in reinforcement learning. By bootstrapping and updating estimates based on experience, TD methods can effectively solve a wide range of control problems. References ---------- - `Reinforcement Learning: An Introduction `_