Soft Actor-Critic (SAC) Algorithm

Introduction

Soft Actor-Critic (SAC) is an advanced reinforcement learning algorithm that integrates the principle of maximum entropy into continuous control tasks. It is designed to balance the exploitation of high-reward actions with the exploration of diverse strategies by maximizing both the cumulative reward and the entropy of the policy.

Key features of SAC include:

  • Maximum entropy framework: Encourages diverse action selection.

  • Stochastic policies: Enables better exploration compared to deterministic approaches.

  • Off-policy learning: Utilizes a replay buffer for data efficiency.

  • Twin soft Q-network: Uses two separate Q-networks to mitigate overestimation bias.

  • Automatic temperature adjustment: Adaptively tunes the temperature parameter.

  • Continuous action space via reparameterization: Employs the reparameterization trick to optimize stochastic policies in continuous action spaces.

Theoretical Derivation

For ease of proof, consider a finite-horizon undiscounted return MDP, ignoring \gamma, the objective function of SAC is as follows,

Important

J(\theta)=\sum_{t=0}^T \mathbb{E}_{\pi_\theta}\left[\mathcal{R}\left(\mathbf{s}_t, \mathbf{a}_t\right)+\alpha \mathcal{H}\left(\pi_\theta\left(\cdot \mid \mathbf{s}_t\right)\right)\right],

where \alpha controls how important the entropy term is, known as temperature parameter. The inclusion of the entropy term allows SAC to balance exploitation and exploration effectively.

Soft Q-Function and Soft Value Function

The soft Q-function and soft value function are defined as follows:

  1. Soft Q-Function:

    Q\left(s_t, a_t\right)=\mathcal{R}\left(s_t, a_t\right)+\gamma \mathbb{E}_{\pi_\theta}\left[V\left(s_{t+1}\right)\right],

    where the soft value function is given by:

    V\left(s_t\right)=\mathbb{E}_{\pi_\theta}\left[Q\left(s_t, a_t\right)-\alpha \log \pi_\theta\left(a_t \mid s_t\right)\right].

    Thus,

    Q\left(s_t, a_t\right)=\mathcal{R}\left(s_t, a_t\right)+\gamma \mathbb{E}_{\pi_\theta}\left[Q\left(s_{t+1}, a_{t+1}\right)-\alpha \log \pi_\theta\left(a_{t+1} \mid s_{t+1}\right)\right]

  2. Soft Q-function Update: The soft Q-function parameters can be trained to minimize the soft Bellman residual,

    Important

    J_Q(w)=\mathbb{E}_{\mathcal{D}}\left[\frac{1}{2}\left(Q_w\left(s_t, a_t\right)-\left(\mathcal{R}\left(s_t, a_t\right)+\gamma \mathbb{E}_{\pi_\theta}\left[V_{\bar{\psi}}\left(s_{t+1}\right)\right]\right)\right)^2\right],

with gradient,

\begin{aligned}
     \nabla _wJ_Q(w)=&\nabla _wQ_w\left( s_t,a_t \right)\\
     &\left( Q_w\left( s_t,a_t \right) -\left( \mathcal{R} \left( s_t,a_t \right) +\gamma \left( Q_{\bar{w}}\left( s_{t+1},a_{t+1} \right) -\alpha \log \left( \pi _{\theta}\left( a_{t+1}\mid s_{t+1} \right) \right) \right) \right) \right) ,
     \end{aligned}

where \bar{\psi} and \bar{w} are the target state-value function and action-value function which are the exponential moving average.

  1. Policy Objective: The policy is updated to minimize the KL-divergence between the policy distribution and the exponentiated soft Q-function:

    Important

    \begin{aligned}
\pi_{\text {new }} & =\arg \min _{\pi^{\prime} \in \Pi} D_{\mathrm{KL}}\left(\pi^{\prime}\left(. \mid s_t\right) \| \frac{\exp \left(Q^{\pi_{\text {old }}}\left(s_t, .\right)\right)}{Z^{\pi_{\text {old }}}\left(s_t\right)}\right) \\
& =\arg \min _{\pi^{\prime} \in \Pi} D_{\mathrm{KL}}\left(\pi^{\prime}\left(. \mid s_t\right) \| \exp \left(Q^{\pi_{\text {old }}}\left(s_t, .\right)-\log Z^{\pi_{\text {old }}}\left(s_t\right)\right)\right)
\end{aligned}

with gradient,

\begin{aligned}
J_\pi(\theta) & =\nabla_\theta D_{\mathrm{KL}}\left(\pi_\theta\left(\cdot \mid s_t\right) \| \exp \left(Q_w\left(s_t, .\right)-\log Z_w\left(s_t\right)\right)\right) \\
& =\mathbb{E}_{\pi_\theta}\left[-\log \left(\frac{\exp \left(Q_w\left(s_t, a_t\right)-\log Z_w\left(s_t\right)\right)}{\pi_\theta\left(a_t \mid s_t\right)}\right)\right] \\
& =\mathbb{E}_{\pi_\theta}\left[\log \pi_\theta\left(a_t \mid s_t\right)-Q_w\left(s_t, a_t\right)+\log Z_w\left(s_t\right)\right],
\end{aligned}

where Z^{\pi_{\text {old }}} is the partition function to normalize the distribution and \Pi denotes a set of policies that can be readily tractable.

Automating Entropy Adjustment

Consider the following constrained optimization problem,

Important

\max _{\pi_0, \ldots, \pi_T} \mathbb{E}\left[\sum_{t=0}^T \mathcal{R}\left(s_t, a_t\right)\right] \text { s.t. } \; \mathcal{H}\left(\pi_t\right) \geq \mathcal{H}_0,\quad \forall t,

where \mathcal{H}_0 is a predefined minimum policy entropy threshold.

Since the policy at time t can only affect the future objective value, i.e.,

\max_{\pi_0}\left(\mathbb{E}\left[\mathcal{R}\left(s_0, a_0\right)\right]+\max_{\pi_1}\left(\mathbb{E}[\ldots]+\max_{\pi_T} \mathbb{E}\left[\mathcal{R}\left(s_T, a_T\right)\right]\right)\right).

Starting from the last time step, we want to maximize rewards and encourage exploration, but at the same time we want to get close to the target entropy,

\max_{\pi_T} \mathbb{E}_{\pi}\left[\mathcal{R}\left(s_T, a_T\right)\right]=\min_{\alpha_T \geq 0} \max_{\pi_T} \mathbb{E}_{\pi}\left[\mathcal{R}\left(s_T, a_T\right)-\alpha_T \log \pi\left(a_T \mid s_T\right)\right]-\alpha_T \mathcal{H}_0.

Based on the above equation, we can solve for the optimal dual variable \alpha_T^* as

Attention

\alpha_T^*=\arg \min_{\alpha_T} \mathbb{E}_{\pi_t^*}\left[-\alpha_T \log \pi_T^*\left(a_T \mid s_T\right)-\alpha_T \mathcal{H}_0\right].

Go back to the soft Q value function with optimal \pi^*_T,

Q_{T-1}^*\left(s_{T-1}, a_{T-1}\right)=\mathcal{R}\left(s_{T-1}, a_{T-1}\right)+\max_{\pi_T} \mathbb{E}\left[\mathcal{R}\left(s_T, a_T\right)\right]+\alpha_T \mathcal{H}\left(\pi_T^*\right),

then we can get,

\begin{aligned}
& \max_{\pi_{T-1}}\left(\mathbb{E}\left[\mathcal{R}\left(s_{T-1}, a_{T-1}\right)\right]+\max_{\pi_T} \mathbb{E}\left[\mathcal{R}\left(s_T, a_T\right]\right)\right) \\
& =\max_{\pi_{T-1}}\left(Q_{T-1}^*\left(s_{T-1}, a_{T-1}\right)-\alpha_T^* \mathcal{H}\left(\pi_T^*\right)\right) \\
& =\min_{\alpha_{T-1}} \max_{\pi_{T-1}}\left(\mathbb{E}\left[Q_{T-1}^*\left(s_{T-1}, a_{T-1}\right)\right]-\alpha_T^* \mathcal{H}\left(\pi_T^*\right)+\alpha_{T-1}\left(\mathcal{H}\left(\pi_{T-1}\right)-\mathcal{H}_0\right)\right) \\
& =\min_{\alpha_{T-1}} \max_{\pi_{T-1}}\left(\mathbb{E}\left[Q_{T-1}^*\left(s_{T-1}, a_{T-1}\right)\right]-\mathbb{E}\left[\alpha_{T-1} \log\pi_{T-1}^*\left(a_{T-1} \mid s_{T-1}\right)\right]-\alpha_{T-1} \mathcal{H}_0\right)-\alpha_T^* \mathcal{H}\left(\pi_T^*\right)
\end{aligned}

Similarly, we can solve for the optimal dual variable \alpha_{T-1}^* as

Attention

\alpha_{T-1}^*=\arg \min _{\alpha_{T-1}} \mathbb{E}_{\pi_{t-1}^*}\left[-\alpha_{T-1}\log\pi_{T-1}^*\left(a_{T-1} \mid s_{T-1}\right)-\alpha_{T-1} \mathcal{H}_0\right]

By repeating this process, we can learn the optimal temperature parameter in every step by minimizing the same objective function,

Important

J(\alpha)=\mathbb{E}_{\pi^*_t}\left[-\alpha_t \log \pi^*_t\left(a_t \mid s_t\right)-\alpha_t \mathcal{H}_0\right]

Algorithmic flow

\begin{algorithm}[H]
    \caption{Soft Actor-Critic}
    \label{alg1}
\begin{algorithmic}[1]
    \STATE Input: initial policy parameters $\theta$, Q-function parameters $w_1$, $w_2$, empty replay buffer $\mathcal{D}$
    \STATE Set target parameters equal to main parameters $\bar{w}_1 \leftarrow w_1$, $\bar{w}_2 \leftarrow w_2$
    \REPEAT
        \STATE Observe state $s$ and select action $a \sim \pi_{\theta}(\cdot|s)$
        \STATE Execute $a$ in the environment
        \STATE Observe next state $s'$, reward $r$, and done signal $d$ to indicate whether $s'$ is terminal
        \STATE Store $(s,a,r,s',d)$ in replay buffer $\mathcal{D}$
        \STATE If $s'$ is terminal, reset environment state.
        \IF{it's time to update}
            \FOR{$j$ in range(however many updates)}
                \STATE Randomly sample a batch of transitions, $B = \{ (s,a,r,s',d) \}$ from $\mathcal{D}$
                \STATE Compute targets for the Q functions:
                \begin{align*}
                    y (r,s',d) &= r + \gamma (1-d) \left(\min_{i=1,2} Q_{\bar{w}_i} (s', \tilde{a}') - \alpha \log \pi_{\theta}(\tilde{a}'|s')\right), && \tilde{a}' \sim \pi_{\theta}(\cdot|s')
                \end{align*}
                \STATE Update Q-functions by one step of gradient descent using
                \begin{align*}
                    & \nabla_{w_i} \frac{1}{|B|}\sum_{(s,a,r,s',d) \in B} \left( Q_{w_i}(s,a) - y(r,s',d) \right)^2 && \text{for } i=1,2
                \end{align*}
                \STATE Update policy by one step of gradient descent using
                \begin{equation*}
                    \nabla_{\theta} \frac{1}{|B|}\sum_{s \in B} \Big(\alpha \log \pi_{\theta} \left(\left. \tilde{a}_{\theta}(s) \right| s\right)-\min_{i=1,2} Q_{w_i}(s, \tilde{a}_{\theta}(s)) \Big),
                \end{equation*}
                where $\tilde{a}_{\theta}(s)$ is a sample from $\pi_{\theta}(\cdot|s)$ which is differentiable wrt $\theta$ via the reparametrization trick.
                \STATE Update the coefficients of the entropy regular term $\alpha$
                \STATE Soft update target networks with
                \begin{align*}
                    \bar{w}_i &\leftarrow \rho w_i + (1-\rho) w_i && \text{for } i=1,2
                \end{align*}
            \ENDFOR
        \ENDIF
    \UNTIL{convergence}
\end{algorithmic}
\end{algorithm}

References