Notations

This section provides the notations and definitions commonly used in reinforcement learning. The following table outlines the symbols and their meanings.

Symbol	Meaning
$s \in \mathcal{S}$	State space.
$a \in \mathcal{A}$	Action space.
$r \in \mathcal{R}$	Reward space, being equal to the space of values of the reward function.
$\mathcal{R}^a_s$	Reward funciton, $\mathcal{R}_s^a=\mathbb{E}\left[R_{t+1} \mid S_t=s, A_t=a\right]$ .
$\mathcal{R}_s$	Reward funciton, $\mathcal{R}_s=\mathbb{E}\left[R_{t+1} \mid S_t=s\right]$ .
$\mathcal{H}(\cdot)$	Entropy of the source, $\mathcal{H}(X):=-\sum_{x \in \mathcal{X}} p(x) \log p(x)$ .
$\mathcal{D}$	Replay buffer.
$S_t, A_t, R_t$	State, action, and reward at time step $t$ of one trajectory.
$\gamma$	Discount factor ( $0 < \gamma \leq 1$ ).
$G_t$	Return ( $G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$ ).
$P(s' \| s, a)$	Transition probability of getting to the next state $s'$ from the current state $s$ with action $a$ .
$\pi(a\|s)$	Stochastic policy (agent behavior strategy), $\pi_\theta(.)$ is a policy parameterized by $\theta$ .
$\mu(s)$	Deterministic policy.
$V(s)$	State-value function of a given state $s$ , $V_w(.)$ is parameterized by $w$ .
$V^\pi(s)$	The value of state $s$ when we follow a policy $\pi$ , $V^\pi(s) = \mathbb{E}_{\pi}[G_t \| S_t = s]$ .
$Q(s, a)$	Action-value function of the given state and action $(s, a)$ , $Q_w(.)$ is parameterized by $w$ .
$Q^\pi(s, a)$	Action-value function when we follow a policy $\pi$ , $Q^\pi(s, a) = \mathbb{E}_{\pi}[G_t \| S_t = s, A_t = a]$ .
$A(s, a)$	Advantage function, $A(s, a) = Q(s, a) - V(s)$ .

Note

In this document, we adopt the following conventions:

Uppercase letters represent random variables or functions, such as $S, A, R$ , etc.
Calligraphic uppercase letters represent sets, such as $\mathcal{S}, \mathcal{A}, \mathcal{R}$ , etc.
Lowercase letters represent deterministic values, such as $s, a , r$ , etc.

Bellman Expectation Equation

Important

$\begin{aligned} &V^{\pi}(s)=\sum_{a\in \mathcal{A}}{\pi}(a\mid s)Q^{\pi}(s,a)\\[5pt] &Q^{\pi}(s,a)=\mathcal{R} _{s}^{a}+\gamma \sum_{s^{\prime}\in \mathcal{S}}{P}\left( s^{\prime}\mid s,a \right) V^{\pi}\left( s^{\prime} \right)\\[5pt] &V^{\pi}(s)=\sum_{a\in \mathcal{A}}{\pi}(a\mid s)\left( \mathcal{R} _{s}^{a}+\gamma \sum_{s^{\prime}\in \mathcal{S}}{P}\left( s^{\prime}\mid s,a \right) V^{\pi}\left( s^{\prime} \right) \right)\\[5pt] &Q^{\pi}(s,a)=\mathcal{R} _{s}^{a}+\gamma \sum_{s^{\prime}\in \mathcal{S}}{P}\left( s^{\prime}\mid s,a \right) \sum_{a^{\prime}\in \mathcal{A}}{\pi}\left( a^{\prime}\mid s^{\prime} \right) Q^{\pi}\left( s^{\prime},a^{\prime} \right)\\ \end{aligned}$

References

https://www.davidsilver.uk/teaching/