Soft Actor-Critic (SAC) Algorithm

Introduction

Soft Actor-Critic (SAC) is an advanced reinforcement learning algorithm that integrates the principle of maximum entropy into continuous control tasks. It is designed to balance the exploitation of high-reward actions with the exploration of diverse strategies by maximizing both the cumulative reward and the entropy of the policy.

Key features of SAC include:

Maximum entropy framework: Encourages diverse action selection.
Stochastic policies: Enables better exploration compared to deterministic approaches.
Off-policy learning: Utilizes a replay buffer for data efficiency.
Twin soft Q-network: Uses two separate Q-networks to mitigate overestimation bias.
Automatic temperature adjustment: Adaptively tunes the temperature parameter.
Continuous action space via reparameterization: Employs the reparameterization trick to optimize stochastic policies in continuous action spaces.

Theoretical Derivation

For ease of proof, consider a finite-horizon undiscounted return MDP, ignoring $\gamma$ , the objective function of SAC is as follows,

Important

$J(\theta)=\sum_{t=0}^T \mathbb{E}_{\pi_\theta}\left[\mathcal{R}\left(\mathbf{s}_t, \mathbf{a}_t\right)+\alpha \mathcal{H}\left(\pi_\theta\left(\cdot \mid \mathbf{s}_t\right)\right)\right],$

where $\alpha$ controls how important the entropy term is, known as temperature parameter. The inclusion of the entropy term allows SAC to balance exploitation and exploration effectively.

Soft Q-Function and Soft Value Function

The soft Q-function and soft value function are defined as follows:

Soft Q-Function:

$Q\left(s_t, a_t\right)=\mathcal{R}\left(s_t, a_t\right)+\gamma \mathbb{E}_{\pi_\theta}\left[V\left(s_{t+1}\right)\right],$

where the soft value function is given by:

$V\left(s_t\right)=\mathbb{E}_{\pi_\theta}\left[Q\left(s_t, a_t\right)-\alpha \log \pi_\theta\left(a_t \mid s_t\right)\right].$

Thus,

$Q\left(s_t, a_t\right)=\mathcal{R}\left(s_t, a_t\right)+\gamma \mathbb{E}_{\pi_\theta}\left[Q\left(s_{t+1}, a_{t+1}\right)-\alpha \log \pi_\theta\left(a_{t+1} \mid s_{t+1}\right)\right]$
Soft Q-function Update: The soft Q-function parameters can be trained to minimize the soft Bellman residual,

Important

$J_Q(w)=\mathbb{E}_{\mathcal{D}}\left[\frac{1}{2}\left(Q_w\left(s_t, a_t\right)-\left(\mathcal{R}\left(s_t, a_t\right)+\gamma \mathbb{E}_{\pi_\theta}\left[V_{\bar{\psi}}\left(s_{t+1}\right)\right]\right)\right)^2\right],$

with gradient,

$\begin{aligned} \nabla _wJ_Q(w)=&\nabla _wQ_w\left( s_t,a_t \right)\\ &\left( Q_w\left( s_t,a_t \right) -\left( \mathcal{R} \left( s_t,a_t \right) +\gamma \left( Q_{\bar{w}}\left( s_{t+1},a_{t+1} \right) -\alpha \log \left( \pi _{\theta}\left( a_{t+1}\mid s_{t+1} \right) \right) \right) \right) \right) , \end{aligned}$

where $\bar{\psi}$ and $\bar{w}$ are the target state-value function and action-value function which are the exponential moving average.

Policy Objective: The policy is updated to minimize the KL-divergence between the policy distribution and the exponentiated soft Q-function:

Important

$\begin{aligned} \pi_{\text {new }} & =\arg \min _{\pi^{\prime} \in \Pi} D_{\mathrm{KL}}\left(\pi^{\prime}\left(. \mid s_t\right) \| \frac{\exp \left(Q^{\pi_{\text {old }}}\left(s_t, .\right)\right)}{Z^{\pi_{\text {old }}}\left(s_t\right)}\right) \\ & =\arg \min _{\pi^{\prime} \in \Pi} D_{\mathrm{KL}}\left(\pi^{\prime}\left(. \mid s_t\right) \| \exp \left(Q^{\pi_{\text {old }}}\left(s_t, .\right)-\log Z^{\pi_{\text {old }}}\left(s_t\right)\right)\right) \end{aligned}$

with gradient,

$\begin{aligned} J_\pi(\theta) & =\nabla_\theta D_{\mathrm{KL}}\left(\pi_\theta\left(\cdot \mid s_t\right) \| \exp \left(Q_w\left(s_t, .\right)-\log Z_w\left(s_t\right)\right)\right) \\ & =\mathbb{E}_{\pi_\theta}\left[-\log \left(\frac{\exp \left(Q_w\left(s_t, a_t\right)-\log Z_w\left(s_t\right)\right)}{\pi_\theta\left(a_t \mid s_t\right)}\right)\right] \\ & =\mathbb{E}_{\pi_\theta}\left[\log \pi_\theta\left(a_t \mid s_t\right)-Q_w\left(s_t, a_t\right)+\log Z_w\left(s_t\right)\right], \end{aligned}$

where $Z^{\pi_{\text {old }}}$ is the partition function to normalize the distribution and $\Pi$ denotes a set of policies that can be readily tractable.

Automating Entropy Adjustment

Consider the following constrained optimization problem,

Important

$\max _{\pi_0, \ldots, \pi_T} \mathbb{E}\left[\sum_{t=0}^T \mathcal{R}\left(s_t, a_t\right)\right] \text { s.t. } \; \mathcal{H}\left(\pi_t\right) \geq \mathcal{H}_0,\quad \forall t,$

where $\mathcal{H}_0$ is a predefined minimum policy entropy threshold.

Since the policy at time $t$ can only affect the future objective value, i.e.,

$\max_{\pi_0}\left(\mathbb{E}\left[\mathcal{R}\left(s_0, a_0\right)\right]+\max_{\pi_1}\left(\mathbb{E}[\ldots]+\max_{\pi_T} \mathbb{E}\left[\mathcal{R}\left(s_T, a_T\right)\right]\right)\right).$

Starting from the last time step, we want to maximize rewards and encourage exploration, but at the same time we want to get close to the target entropy,

$\max_{\pi_T} \mathbb{E}_{\pi}\left[\mathcal{R}\left(s_T, a_T\right)\right]=\min_{\alpha_T \geq 0} \max_{\pi_T} \mathbb{E}_{\pi}\left[\mathcal{R}\left(s_T, a_T\right)-\alpha_T \log \pi\left(a_T \mid s_T\right)\right]-\alpha_T \mathcal{H}_0.$

Based on the above equation, we can solve for the optimal dual variable $\alpha_T^*$ as

Attention

$\alpha_T^*=\arg \min_{\alpha_T} \mathbb{E}_{\pi_t^*}\left[-\alpha_T \log \pi_T^*\left(a_T \mid s_T\right)-\alpha_T \mathcal{H}_0\right].$

Go back to the soft Q value function with optimal $\pi^*_T$ ,

$Q_{T-1}^*\left(s_{T-1}, a_{T-1}\right)=\mathcal{R}\left(s_{T-1}, a_{T-1}\right)+\max_{\pi_T} \mathbb{E}\left[\mathcal{R}\left(s_T, a_T\right)\right]+\alpha_T \mathcal{H}\left(\pi_T^*\right),$

then we can get,

$\begin{aligned} & \max_{\pi_{T-1}}\left(\mathbb{E}\left[\mathcal{R}\left(s_{T-1}, a_{T-1}\right)\right]+\max_{\pi_T} \mathbb{E}\left[\mathcal{R}\left(s_T, a_T\right]\right)\right) \\ & =\max_{\pi_{T-1}}\left(Q_{T-1}^*\left(s_{T-1}, a_{T-1}\right)-\alpha_T^* \mathcal{H}\left(\pi_T^*\right)\right) \\ & =\min_{\alpha_{T-1}} \max_{\pi_{T-1}}\left(\mathbb{E}\left[Q_{T-1}^*\left(s_{T-1}, a_{T-1}\right)\right]-\alpha_T^* \mathcal{H}\left(\pi_T^*\right)+\alpha_{T-1}\left(\mathcal{H}\left(\pi_{T-1}\right)-\mathcal{H}_0\right)\right) \\ & =\min_{\alpha_{T-1}} \max_{\pi_{T-1}}\left(\mathbb{E}\left[Q_{T-1}^*\left(s_{T-1}, a_{T-1}\right)\right]-\mathbb{E}\left[\alpha_{T-1} \log\pi_{T-1}^*\left(a_{T-1} \mid s_{T-1}\right)\right]-\alpha_{T-1} \mathcal{H}_0\right)-\alpha_T^* \mathcal{H}\left(\pi_T^*\right) \end{aligned}$

Similarly, we can solve for the optimal dual variable $\alpha_{T-1}^*$ as

Attention

$\alpha_{T-1}^*=\arg \min _{\alpha_{T-1}} \mathbb{E}_{\pi_{t-1}^*}\left[-\alpha_{T-1}\log\pi_{T-1}^*\left(a_{T-1} \mid s_{T-1}\right)-\alpha_{T-1} \mathcal{H}_0\right]$

By repeating this process, we can learn the optimal temperature parameter in every step by minimizing the same objective function,

Important

$J(\alpha)=\mathbb{E}_{\pi^*_t}\left[-\alpha_t \log \pi^*_t\left(a_t \mid s_t\right)-\alpha_t \mathcal{H}_0\right]$

Algorithmic flow

$\begin{algorithm}[H] \caption{Soft Actor-Critic} \label{alg1} \begin{algorithmic}[1] \STATE Input: initial policy parameters $\theta$, Q-function parameters $w_1$, $w_2$, empty replay buffer $\mathcal{D}$ \STATE Set target parameters equal to main parameters $\bar{w}_1 \leftarrow w_1$, $\bar{w}_2 \leftarrow w_2$ \REPEAT \STATE Observe state $s$ and select action $a \sim \pi_{\theta}(\cdot|s)$ \STATE Execute $a$ in the environment \STATE Observe next state $s'$, reward $r$, and done signal $d$ to indicate whether $s'$ is terminal \STATE Store $(s,a,r,s',d)$ in replay buffer $\mathcal{D}$ \STATE If $s'$ is terminal, reset environment state. \IF{it's time to update} \FOR{$j$ in range(however many updates)} \STATE Randomly sample a batch of transitions, $B = \{ (s,a,r,s',d) \}$ from $\mathcal{D}$ \STATE Compute targets for the Q functions: \begin{align*} y (r,s',d) &= r + \gamma (1-d) \left(\min_{i=1,2} Q_{\bar{w}_i} (s', \tilde{a}') - \alpha \log \pi_{\theta}(\tilde{a}'|s')\right), && \tilde{a}' \sim \pi_{\theta}(\cdot|s') \end{align*} \STATE Update Q-functions by one step of gradient descent using \begin{align*} & \nabla_{w_i} \frac{1}{|B|}\sum_{(s,a,r,s',d) \in B} \left( Q_{w_i}(s,a) - y(r,s',d) \right)^2 && \text{for } i=1,2 \end{align*} \STATE Update policy by one step of gradient descent using \begin{equation*} \nabla_{\theta} \frac{1}{|B|}\sum_{s \in B} \Big(\alpha \log \pi_{\theta} \left(\left. \tilde{a}_{\theta}(s) \right| s\right)-\min_{i=1,2} Q_{w_i}(s, \tilde{a}_{\theta}(s)) \Big), \end{equation*} where $\tilde{a}_{\theta}(s)$ is a sample from $\pi_{\theta}(\cdot|s)$ which is differentiable wrt $\theta$ via the reparametrization trick. \STATE Update the coefficients of the entropy regular term $\alpha$ \STATE Soft update target networks with \begin{align*} \bar{w}_i &\leftarrow \rho w_i + (1-\rho) w_i && \text{for } i=1,2 \end{align*} \ENDFOR \ENDIF \UNTIL{convergence} \end{algorithmic} \end{algorithm}$

Soft Actor-Critic (SAC) Algorithm

Introduction

Theoretical Derivation

Soft Q-Function and Soft Value Function

Automating Entropy Adjustment

Algorithmic flow

References