Introduction
Classic deep RL algorithms like Deep Q Network (DQN) and Vanilla Policy Gradient (VPG) were generally impractical for most of the problems we hope to use deep RL for. Value-based methods like DQN had low variance, but were unable to control continuous action spaces. On the other hand, policy-based methods like VPG were able to handle continuous action spaces, but had high variances. However, these issues were solved when deep RL faced a massive advancement after the actor-critic method was introduced.
I will first go over some famous deep RL algorithms that use the actor-critic method, starting from classic Q Actor-Critic and Advantage Actor-Critic (A2C), then more advanced methods like Proximal Policy Optimization (PPO), Deep Deterministic Policy Gradient (DDPG), and Soft Actor-Critic (SAC).
Actor-Critic Approach
Actor-Critic is a deep RL method where an “actor” controls the agent and a “critic” evaluates its performance.
- It’s like playing a game with a coach that checks your performance
- It’s important to note that the coach (critic) also learns along
Setup
First of all, we prepare two networks:
- The Actor - the “agent” that tries to play the game
- Policy network (transforms input state into action probabilities)
- Network updated based on the Q-value estimation from the critic
- The Critic - the “coach” that evalutates how well the actor is performing
- Q-value network (transforms input state&action into Q-value estimations)
- State, action (from the actor), and it’s corresponding reward are feeded to update the network
- Update using SGD with gradient step scaled by learning rate and TD error
<aside>
🖊️ Different learning rates are used for each network
</aside>
Algorithm (Concept)
Following is the conceptual algorithm of the artor-critic method:
- Initialize both actor and critic network parameters
- Act on current state based on the actor policy network
- Feed the state and action to estimate the Q-value using the critic network
- Get the corresponding reward
- Update the actor policy network using the estimated Q-value
- Act on the next time step with the updated actor network
- Update the critic Q-value network using the state & action of the next time step, and reward