Soft Actor-Critic (SAC) Algorithm =================================== Introduction ------------ Soft Actor-Critic (SAC) is an advanced reinforcement learning algorithm that integrates the principle of maximum entropy into continuous control tasks. It is designed to balance the **exploitation of high-reward actions** with the **exploration of diverse strategies** by maximizing both the cumulative reward and the entropy of the policy. Key features of SAC include: - **Maximum entropy framework**: Encourages diverse action selection. - **Stochastic policies**: Enables better exploration compared to deterministic approaches. - **Off-policy learning**: Utilizes a replay buffer for data efficiency. - **Twin soft Q-network**: Uses two separate Q-networks to mitigate overestimation bias. - **Automatic temperature adjustment**: Adaptively tunes the temperature parameter. - **Continuous action space via reparameterization**: Employs the reparameterization trick to optimize stochastic policies in continuous action spaces. Theoretical Derivation ----------------------- For ease of proof, consider a finite-horizon undiscounted return MDP, ignoring :math:`\gamma`, the objective function of SAC is as follows, .. important:: .. math:: J(\theta)=\sum_{t=0}^T \mathbb{E}_{\pi_\theta}\left[\mathcal{R}\left(\mathbf{s}_t, \mathbf{a}_t\right)+\alpha \mathcal{H}\left(\pi_\theta\left(\cdot \mid \mathbf{s}_t\right)\right)\right], where :math:`\alpha` controls how important the entropy term is, known as temperature parameter. The inclusion of the entropy term allows SAC to balance exploitation and exploration effectively. Soft Q-Function and Soft Value Function ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The soft Q-function and soft value function are defined as follows: 1. **Soft Q-Function**: .. math:: Q\left(s_t, a_t\right)=\mathcal{R}\left(s_t, a_t\right)+\gamma \mathbb{E}_{\pi_\theta}\left[V\left(s_{t+1}\right)\right], where the soft value function is given by: .. math:: V\left(s_t\right)=\mathbb{E}_{\pi_\theta}\left[Q\left(s_t, a_t\right)-\alpha \log \pi_\theta\left(a_t \mid s_t\right)\right]. Thus, .. math:: Q\left(s_t, a_t\right)=\mathcal{R}\left(s_t, a_t\right)+\gamma \mathbb{E}_{\pi_\theta}\left[Q\left(s_{t+1}, a_{t+1}\right)-\alpha \log \pi_\theta\left(a_{t+1} \mid s_{t+1}\right)\right] 2. **Soft Q-function Update**: The soft Q-function parameters can be trained to minimize the soft Bellman residual, .. important:: .. math:: J_Q(w)=\mathbb{E}_{\mathcal{D}}\left[\frac{1}{2}\left(Q_w\left(s_t, a_t\right)-\left(\mathcal{R}\left(s_t, a_t\right)+\gamma \mathbb{E}_{\pi_\theta}\left[V_{\bar{\psi}}\left(s_{t+1}\right)\right]\right)\right)^2\right], with gradient, .. math:: \begin{aligned} \nabla _wJ_Q(w)=&\nabla _wQ_w\left( s_t,a_t \right)\\ &\left( Q_w\left( s_t,a_t \right) -\left( \mathcal{R} \left( s_t,a_t \right) +\gamma \left( Q_{\bar{w}}\left( s_{t+1},a_{t+1} \right) -\alpha \log \left( \pi _{\theta}\left( a_{t+1}\mid s_{t+1} \right) \right) \right) \right) \right) , \end{aligned} where :math:`\bar{\psi}` and :math:`\bar{w}` are the target state-value function and action-value function which are the exponential moving average. 3. **Policy Objective**: The policy is updated to minimize the KL-divergence between the policy distribution and the exponentiated soft Q-function: .. important:: .. math:: \begin{aligned} \pi_{\text {new }} & =\arg \min _{\pi^{\prime} \in \Pi} D_{\mathrm{KL}}\left(\pi^{\prime}\left(. \mid s_t\right) \| \frac{\exp \left(Q^{\pi_{\text {old }}}\left(s_t, .\right)\right)}{Z^{\pi_{\text {old }}}\left(s_t\right)}\right) \\ & =\arg \min _{\pi^{\prime} \in \Pi} D_{\mathrm{KL}}\left(\pi^{\prime}\left(. \mid s_t\right) \| \exp \left(Q^{\pi_{\text {old }}}\left(s_t, .\right)-\log Z^{\pi_{\text {old }}}\left(s_t\right)\right)\right) \end{aligned} with gradient, .. math:: \begin{aligned} J_\pi(\theta) & =\nabla_\theta D_{\mathrm{KL}}\left(\pi_\theta\left(\cdot \mid s_t\right) \| \exp \left(Q_w\left(s_t, .\right)-\log Z_w\left(s_t\right)\right)\right) \\ & =\mathbb{E}_{\pi_\theta}\left[-\log \left(\frac{\exp \left(Q_w\left(s_t, a_t\right)-\log Z_w\left(s_t\right)\right)}{\pi_\theta\left(a_t \mid s_t\right)}\right)\right] \\ & =\mathbb{E}_{\pi_\theta}\left[\log \pi_\theta\left(a_t \mid s_t\right)-Q_w\left(s_t, a_t\right)+\log Z_w\left(s_t\right)\right], \end{aligned} where :math:`Z^{\pi_{\text {old }}}` is the partition function to normalize the distribution and :math:`\Pi` denotes a set of policies that can be readily tractable. Automating Entropy Adjustment ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Consider the following constrained optimization problem, .. important:: .. math:: \max _{\pi_0, \ldots, \pi_T} \mathbb{E}\left[\sum_{t=0}^T \mathcal{R}\left(s_t, a_t\right)\right] \text { s.t. } \; \mathcal{H}\left(\pi_t\right) \geq \mathcal{H}_0,\quad \forall t, where :math:`\mathcal{H}_0` is a predefined minimum policy entropy threshold. Since the policy at time :math:`t` can only affect the future objective value, i.e., .. math:: \max_{\pi_0}\left(\mathbb{E}\left[\mathcal{R}\left(s_0, a_0\right)\right]+\max_{\pi_1}\left(\mathbb{E}[\ldots]+\max_{\pi_T} \mathbb{E}\left[\mathcal{R}\left(s_T, a_T\right)\right]\right)\right). Starting from the last time step, we want to maximize rewards and encourage exploration, but at the same time we want to get close to the target entropy, .. math:: \max_{\pi_T} \mathbb{E}_{\pi}\left[\mathcal{R}\left(s_T, a_T\right)\right]=\min_{\alpha_T \geq 0} \max_{\pi_T} \mathbb{E}_{\pi}\left[\mathcal{R}\left(s_T, a_T\right)-\alpha_T \log \pi\left(a_T \mid s_T\right)\right]-\alpha_T \mathcal{H}_0. Based on the above equation, we can solve for the optimal dual variable :math:`\alpha_T^*` as .. attention:: .. math:: \alpha_T^*=\arg \min_{\alpha_T} \mathbb{E}_{\pi_t^*}\left[-\alpha_T \log \pi_T^*\left(a_T \mid s_T\right)-\alpha_T \mathcal{H}_0\right]. Go back to the soft Q value function with optimal :math:`\pi^*_T`, .. math:: Q_{T-1}^*\left(s_{T-1}, a_{T-1}\right)=\mathcal{R}\left(s_{T-1}, a_{T-1}\right)+\max_{\pi_T} \mathbb{E}\left[\mathcal{R}\left(s_T, a_T\right)\right]+\alpha_T \mathcal{H}\left(\pi_T^*\right), then we can get, .. math:: \begin{aligned} & \max_{\pi_{T-1}}\left(\mathbb{E}\left[\mathcal{R}\left(s_{T-1}, a_{T-1}\right)\right]+\max_{\pi_T} \mathbb{E}\left[\mathcal{R}\left(s_T, a_T\right]\right)\right) \\ & =\max_{\pi_{T-1}}\left(Q_{T-1}^*\left(s_{T-1}, a_{T-1}\right)-\alpha_T^* \mathcal{H}\left(\pi_T^*\right)\right) \\ & =\min_{\alpha_{T-1}} \max_{\pi_{T-1}}\left(\mathbb{E}\left[Q_{T-1}^*\left(s_{T-1}, a_{T-1}\right)\right]-\alpha_T^* \mathcal{H}\left(\pi_T^*\right)+\alpha_{T-1}\left(\mathcal{H}\left(\pi_{T-1}\right)-\mathcal{H}_0\right)\right) \\ & =\min_{\alpha_{T-1}} \max_{\pi_{T-1}}\left(\mathbb{E}\left[Q_{T-1}^*\left(s_{T-1}, a_{T-1}\right)\right]-\mathbb{E}\left[\alpha_{T-1} \log\pi_{T-1}^*\left(a_{T-1} \mid s_{T-1}\right)\right]-\alpha_{T-1} \mathcal{H}_0\right)-\alpha_T^* \mathcal{H}\left(\pi_T^*\right) \end{aligned} Similarly, we can solve for the optimal dual variable :math:`\alpha_{T-1}^*` as .. attention:: .. math:: \alpha_{T-1}^*=\arg \min _{\alpha_{T-1}} \mathbb{E}_{\pi_{t-1}^*}\left[-\alpha_{T-1}\log\pi_{T-1}^*\left(a_{T-1} \mid s_{T-1}\right)-\alpha_{T-1} \mathcal{H}_0\right] By repeating this process, we can learn the optimal temperature parameter in every step by minimizing the same objective function, .. important:: .. math:: J(\alpha)=\mathbb{E}_{\pi^*_t}\left[-\alpha_t \log \pi^*_t\left(a_t \mid s_t\right)-\alpha_t \mathcal{H}_0\right] Algorithmic flow ------------------ .. math:: :nowrap: \begin{algorithm}[H] \caption{Soft Actor-Critic} \label{alg1} \begin{algorithmic}[1] \STATE Input: initial policy parameters $\theta$, Q-function parameters $w_1$, $w_2$, empty replay buffer $\mathcal{D}$ \STATE Set target parameters equal to main parameters $\bar{w}_1 \leftarrow w_1$, $\bar{w}_2 \leftarrow w_2$ \REPEAT \STATE Observe state $s$ and select action $a \sim \pi_{\theta}(\cdot|s)$ \STATE Execute $a$ in the environment \STATE Observe next state $s'$, reward $r$, and done signal $d$ to indicate whether $s'$ is terminal \STATE Store $(s,a,r,s',d)$ in replay buffer $\mathcal{D}$ \STATE If $s'$ is terminal, reset environment state. \IF{it's time to update} \FOR{$j$ in range(however many updates)} \STATE Randomly sample a batch of transitions, $B = \{ (s,a,r,s',d) \}$ from $\mathcal{D}$ \STATE Compute targets for the Q functions: \begin{align*} y (r,s',d) &= r + \gamma (1-d) \left(\min_{i=1,2} Q_{\bar{w}_i} (s', \tilde{a}') - \alpha \log \pi_{\theta}(\tilde{a}'|s')\right), && \tilde{a}' \sim \pi_{\theta}(\cdot|s') \end{align*} \STATE Update Q-functions by one step of gradient descent using \begin{align*} & \nabla_{w_i} \frac{1}{|B|}\sum_{(s,a,r,s',d) \in B} \left( Q_{w_i}(s,a) - y(r,s',d) \right)^2 && \text{for } i=1,2 \end{align*} \STATE Update policy by one step of gradient descent using \begin{equation*} \nabla_{\theta} \frac{1}{|B|}\sum_{s \in B} \Big(\alpha \log \pi_{\theta} \left(\left. \tilde{a}_{\theta}(s) \right| s\right)-\min_{i=1,2} Q_{w_i}(s, \tilde{a}_{\theta}(s)) \Big), \end{equation*} where $\tilde{a}_{\theta}(s)$ is a sample from $\pi_{\theta}(\cdot|s)$ which is differentiable wrt $\theta$ via the reparametrization trick. \STATE Update the coefficients of the entropy regular term $\alpha$ \STATE Soft update target networks with \begin{align*} \bar{w}_i &\leftarrow \rho w_i + (1-\rho) w_i && \text{for } i=1,2 \end{align*} \ENDFOR \ENDIF \UNTIL{convergence} \end{algorithmic} \end{algorithm} References ----------- - `Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor `_ - `Soft Actor-Critic Algorithms and Applications `_ - https://docs.cleanrl.dev/rl-algorithms/sac/ - https://hrl.boyuai.com/chapter/2/sac%E7%AE%97%E6%B3%95 - https://spinningup.openai.com/en/latest/algorithms/sac.html