Soft Actor-Critic (SAC) Algorithm
Introduction
Soft Actor-Critic (SAC) is an advanced reinforcement learning algorithm that integrates the principle of maximum entropy into continuous control tasks. It is designed to balance the exploitation of high-reward actions with the exploration of diverse strategies by maximizing both the cumulative reward and the entropy of the policy.
Key features of SAC include:
Maximum entropy framework: Encourages diverse action selection.
Stochastic policies: Enables better exploration compared to deterministic approaches.
Off-policy learning: Utilizes a replay buffer for data efficiency.
Twin soft Q-network: Uses two separate Q-networks to mitigate overestimation bias.
Automatic temperature adjustment: Adaptively tunes the temperature parameter.
Continuous action space via reparameterization: Employs the reparameterization trick to optimize stochastic policies in continuous action spaces.
Theoretical Derivation
For ease of proof, consider a finite-horizon undiscounted return MDP, ignoring , the objective function of SAC is as follows,
Important
where controls how important the entropy term is, known as temperature parameter. The inclusion of the entropy term allows SAC to balance exploitation and exploration effectively.
Soft Q-Function and Soft Value Function
The soft Q-function and soft value function are defined as follows:
Soft Q-Function:
where the soft value function is given by:
Thus,
Soft Q-function Update: The soft Q-function parameters can be trained to minimize the soft Bellman residual,
Important
with gradient,
where and
are the target state-value function and action-value function which are the exponential moving average.
Policy Objective: The policy is updated to minimize the KL-divergence between the policy distribution and the exponentiated soft Q-function:
Important
with gradient,
where is the partition function to normalize the distribution and
denotes a set of policies that can be readily tractable.
Automating Entropy Adjustment
Consider the following constrained optimization problem,
Important
where is a predefined minimum policy entropy threshold.
Since the policy at time can only affect the future objective value, i.e.,
Starting from the last time step, we want to maximize rewards and encourage exploration, but at the same time we want to get close to the target entropy,
Based on the above equation, we can solve for the optimal dual variable as
Attention
Go back to the soft Q value function with optimal ,
then we can get,
Similarly, we can solve for the optimal dual variable as
Attention
By repeating this process, we can learn the optimal temperature parameter in every step by minimizing the same objective function,
Important