Notations

This section provides the notations and definitions commonly used in reinforcement learning. The following table outlines the symbols and their meanings.

Symbol

Meaning

s \in \mathcal{S}

State space.

a \in \mathcal{A}

Action space.

r \in \mathcal{R}

Reward space, being equal to the space of values of the reward function.

\mathcal{R}^a_s

Reward funciton, \mathcal{R}_s^a=\mathbb{E}\left[R_{t+1} \mid S_t=s, A_t=a\right].

\mathcal{R}_s

Reward funciton, \mathcal{R}_s=\mathbb{E}\left[R_{t+1} \mid S_t=s\right].

\mathcal{H}(\cdot)

Entropy of the source, \mathcal{H}(X):=-\sum_{x \in \mathcal{X}} p(x) \log p(x).

\mathcal{D}

Replay buffer.

S_t, A_t, R_t

State, action, and reward at time step t of one trajectory.

\gamma

Discount factor (0 < \gamma \leq 1).

G_t

Return (G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}).

P(s' | s, a)

Transition probability of getting to the next state s' from the current state s with action a.

\pi(a|s)

Stochastic policy (agent behavior strategy), \pi_\theta(.) is a policy parameterized by \theta.

\mu(s)

Deterministic policy.

V(s)

State-value function of a given state s, V_w(.) is parameterized by w.

V^\pi(s)

The value of state s when we follow a policy \pi, V^\pi(s) = \mathbb{E}_{\pi}[G_t | S_t = s].

Q(s, a)

Action-value function of the given state and action (s, a), Q_w(.) is parameterized by w.

Q^\pi(s, a)

Action-value function when we follow a policy \pi, Q^\pi(s, a) = \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a].

A(s, a)

Advantage function, A(s, a) = Q(s, a) - V(s).

Note

In this document, we adopt the following conventions:

  • Uppercase letters represent random variables or functions, such as S, A, R, etc.

  • Calligraphic uppercase letters represent sets, such as \mathcal{S}, \mathcal{A}, \mathcal{R}, etc.

  • Lowercase letters represent deterministic values, such as s, a , r, etc.

Bellman Expectation Equation

Important

\begin{aligned}
  &V^{\pi}(s)=\sum_{a\in \mathcal{A}}{\pi}(a\mid s)Q^{\pi}(s,a)\\[5pt]
  &Q^{\pi}(s,a)=\mathcal{R} _{s}^{a}+\gamma \sum_{s^{\prime}\in \mathcal{S}}{P}\left( s^{\prime}\mid s,a \right) V^{\pi}\left( s^{\prime} \right)\\[5pt]
  &V^{\pi}(s)=\sum_{a\in \mathcal{A}}{\pi}(a\mid s)\left( \mathcal{R} _{s}^{a}+\gamma \sum_{s^{\prime}\in \mathcal{S}}{P}\left( s^{\prime}\mid s,a \right) V^{\pi}\left( s^{\prime} \right) \right)\\[5pt]
  &Q^{\pi}(s,a)=\mathcal{R} _{s}^{a}+\gamma \sum_{s^{\prime}\in \mathcal{S}}{P}\left( s^{\prime}\mid s,a \right) \sum_{a^{\prime}\in \mathcal{A}}{\pi}\left( a^{\prime}\mid s^{\prime} \right) Q^{\pi}\left( s^{\prime},a^{\prime} \right)\\
\end{aligned}

References