Notations ========== This section provides the notations and definitions commonly used in reinforcement learning. The following table outlines the symbols and their meanings. .. list-table:: :widths: 20 80 :header-rows: 1 * - Symbol - Meaning * - :math:`s \in \mathcal{S}` - State space. * - :math:`a \in \mathcal{A}` - Action space. * - :math:`r \in \mathcal{R}` - Reward space, being equal to the space of values of the reward function. * - :math:`\mathcal{R}^a_s` - Reward funciton, :math:`\mathcal{R}_s^a=\mathbb{E}\left[R_{t+1} \mid S_t=s, A_t=a\right]`. * - :math:`\mathcal{R}_s` - Reward funciton, :math:`\mathcal{R}_s=\mathbb{E}\left[R_{t+1} \mid S_t=s\right]`. * - :math:`\mathcal{H}(\cdot)` - Entropy of the source, :math:`\mathcal{H}(X):=-\sum_{x \in \mathcal{X}} p(x) \log p(x)`. * - :math:`\mathcal{D}` - Replay buffer. * - :math:`S_t, A_t, R_t` - State, action, and reward at time step :math:`t` of one trajectory. * - :math:`\gamma` - Discount factor (:math:`0 < \gamma \leq 1`). * - :math:`G_t` - Return (:math:`G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}`). * - :math:`P(s' | s, a)` - Transition probability of getting to the next state :math:`s'` from the current state :math:`s` with action :math:`a`. * - :math:`\pi(a|s)` - Stochastic policy (agent behavior strategy), :math:`\pi_\theta(.)` is a policy parameterized by :math:`\theta`. * - :math:`\mu(s)` - Deterministic policy. * - :math:`V(s)` - State-value function of a given state :math:`s`, :math:`V_w(.)` is parameterized by :math:`w`. * - :math:`V^\pi(s)` - The value of state :math:`s` when we follow a policy :math:`\pi`, :math:`V^\pi(s) = \mathbb{E}_{\pi}[G_t | S_t = s]`. * - :math:`Q(s, a)` - Action-value function of the given state and action :math:`(s, a)`, :math:`Q_w(.)` is parameterized by :math:`w`. * - :math:`Q^\pi(s, a)` - Action-value function when we follow a policy :math:`\pi`, :math:`Q^\pi(s, a) = \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a]`. * - :math:`A(s, a)` - Advantage function, :math:`A(s, a) = Q(s, a) - V(s)`. .. note:: In this document, we adopt the following conventions: - **Uppercase letters** represent **random variables** or **functions**, such as :math:`S, A, R`, etc. - **Calligraphic uppercase letters** represent **sets**, such as :math:`\mathcal{S}, \mathcal{A}, \mathcal{R}`, etc. - **Lowercase letters** represent **deterministic values**, such as :math:`s, a , r`, etc. Bellman Expectation Equation ------------------------------ .. important:: .. math:: \begin{aligned} &V^{\pi}(s)=\sum_{a\in \mathcal{A}}{\pi}(a\mid s)Q^{\pi}(s,a)\\[5pt] &Q^{\pi}(s,a)=\mathcal{R} _{s}^{a}+\gamma \sum_{s^{\prime}\in \mathcal{S}}{P}\left( s^{\prime}\mid s,a \right) V^{\pi}\left( s^{\prime} \right)\\[5pt] &V^{\pi}(s)=\sum_{a\in \mathcal{A}}{\pi}(a\mid s)\left( \mathcal{R} _{s}^{a}+\gamma \sum_{s^{\prime}\in \mathcal{S}}{P}\left( s^{\prime}\mid s,a \right) V^{\pi}\left( s^{\prime} \right) \right)\\[5pt] &Q^{\pi}(s,a)=\mathcal{R} _{s}^{a}+\gamma \sum_{s^{\prime}\in \mathcal{S}}{P}\left( s^{\prime}\mid s,a \right) \sum_{a^{\prime}\in \mathcal{A}}{\pi}\left( a^{\prime}\mid s^{\prime} \right) Q^{\pi}\left( s^{\prime},a^{\prime} \right)\\ \end{aligned} References ---------------- - https://www.davidsilver.uk/teaching/