Markov Decision Processes

Introduction

Markov Decision Processes (MDPs) provide a mathematical framework for modeling decision-making in environments where outcomes are partly random and partly under the control of a decision maker. This framework is fundamental in fields such as reinforcement learning, operations research, and economics.

This document explores key concepts of MDPs, starting with Markov Processes (MP), then Markov Reward Processes (MRP), and finally MDPs. Each section provides definitions and the corresponding Bellman equations.

Markov Processes

A Markov Process (MP) is a stochastic process characterized by the Markov property, which states that the future state depends only on the current state and not on the sequence of states that preceded it. A state S_t is Markov (or with Markov property) if and only if

P\left[S_{t+1} \mid S_t\right]=P\left[S_{t+1} \mid S_1, \ldots, S_t\right].

Definition

Note

A Markov Process (or Markov Chain) is a tuple \langle\mathcal{S}, P\rangle

  • \mathcal{S} is a (finite) set of states.

  • P is a state transition probability matrix, P=P\left[S_{t+1}=s^{\prime} \mid S_t=s\right]

Bellman Equation

Tip

For a Markov Process, the Bellman equation describes the recursive relationship of state transition probabilities,

P(s^{\prime}) = \sum_{s} P(s^{\prime} \mid s) P(s).

Markov Reward Processes

A Markov Reward Process (MRP) extends an MP by associating rewards (values) with state transitions.

Definition

Note

An MRP is a tuple \langle\mathcal{S}, P,\mathcal{R},\gamma\rangle

  • \mathcal{S} is a (finite) set of states.

  • P is a state transition probability matrix, P=P\left[S_{t+1}=s^{\prime} \mid S_t=s\right]

  • \mathcal{R}_s is a reward function.

  • \gamma \in [0, 1] is a discount factor.

Bellman Equation

The state value function V(s) of an MRP is the expected return starting from state s,

V(s)=\mathbb{E}\left[G_t \mid S_t=s\right].

The Bellman equation for the value function V(s) is,

\begin{aligned}
V(s) & =\mathbb{E}\left[G_t \mid S_t=s\right] \\
& =\mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\ldots \mid S_t=s\right] \\
& =\mathbb{E}\left[R_{t+1}+\gamma\left(R_{t+2}+\gamma R_{t+3}+\ldots\right) \mid S_t=s\right] \\
& =\mathbb{E}\left[R_{t+1}+\gamma G_{t+1} \mid S_t=s\right] \\
& =\mathbb{E}\left[R_{t+1}+\gamma V\left(S_{t+1}\right) \mid S_t=s\right].
\end{aligned}

Tip

V(s)=\mathcal{R}_s+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s\right) V\left(s^{\prime}\right).

Markov Decision Processes

Definition

A Markov Decision Process (MDP) introduces decision-making into an MRP. It is defined by the tuple \langle\mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma\rangle where,

  • \mathcal{S} is a finite set of states.

  • \mathcal{A} is a finite set of actions, \pi(a \mid s)=P\left[A_t=a \mid S_t=s\right]

  • P(s' \mid s, a) is the transition probability given action a, P=P\left[S_{t+1}=s^{\prime} \mid S_t=s, A_t=a\right]

  • \mathcal{R}^a_s is the expected reward for taking action a in state s.

  • \gamma \in [0, 1] is the discount factor.

Bellman Expectation Equation

The Bellman expectation equation for the state value function V^{\pi}(s) and action value function Q^{\pi}(s,a) are,

Tip

\begin{aligned}
V^\pi(s) & =\mathbb{E}_\pi\left[R_{t+1}+\gamma V^\pi\left(S_{t+1}\right) \mid S_t=s\right] \\
& =\sum_{a \in A} \pi(a \mid s)\left(\mathcal{R}^a_s+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s, a\right) V^\pi\left(s^{\prime}\right)\right), \\[5pt]
Q^\pi(s, a) & =\mathbb{E}_\pi\left[R_{t+1}+\gamma Q^\pi\left(S_{t+1}, A_{t+1}\right) \mid S_t=s, A_t=a\right] \\
& =\mathcal{R}^a_s+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s, a\right) \sum_{a^{\prime} \in A} \pi\left(a^{\prime} \mid s^{\prime}\right) Q^\pi\left(s^{\prime}, a^{\prime}\right).
\end{aligned}

Bellman Optimality Equation

For the optimal state value function V^*(s) and action value function Q^*(s,a) are,

V^*(s)=\max_\pi V^\pi(s), \text{ and }Q^*(s, a)=\max_\pi Q^\pi(s, a),

and we have,

Tip

\begin{aligned}
V^*(s) & =\max_{a \in \mathcal{A}}\left\{\mathcal{R}^a_s+\gamma \sum_{s^{\prime} \in \mathcal{S}} P\left(s^{\prime} \mid s, a\right) V^*\left(s^{\prime}\right)\right\}, \\[5pt]
Q^*(s, a) & =\mathcal{R}^a_s+\gamma \sum_{s^{\prime} \in \mathcal{S}} P\left(s^{\prime} \mid s, a\right) \max _{a^{\prime} \in \mathcal{A}} Q^*\left(s^{\prime}, a^{\prime}\right).
\end{aligned}

Important

In sum, the Bellman equation recursively defines a value function as the sum of the immediate reward and the discounted expected value of future states.

References