Introduction

Classic deep RL algorithms like Deep Q Network (DQN) and Vanilla Policy Gradient (VPG) were generally impractical for most of the problems we hope to use deep RL for. Value-based methods like DQN had low variance, but were unable to control continuous action spaces. On the other hand, policy-based methods like VPG were able to handle continuous action spaces, but had high variances. However, these issues were solved when deep RL faced a massive advancement after the actor-critic method was introduced.

I will first go over some famous deep RL algorithms that use the actor-critic method, starting from classic Q Actor-Critic and Advantage Actor-Critic (A2C), then more advanced methods like Proximal Policy Optimization (PPO), Deep Deterministic Policy Gradient (DDPG), and Soft Actor-Critic (SAC).

Actor-Critic Approach

Actor-Critic is a deep RL method where an “actor” controls the agent and a “critic” evaluates its performance.

It’s like playing a game with a coach that checks your performance
- It’s important to note that the coach (critic) also learns along

Setup

First of all, we prepare two networks:

The Actor - the “agent” that tries to play the game
- Policy network (transforms input state into action probabilities)
- Network updated based on the Q-value estimation from the critic
The Critic - the “coach” that evalutates how well the actor is performing
- Q-value network (transforms input state&action into Q-value estimations)
- State, action (from the actor), and it’s corresponding reward are feeded to update the network
  - Update using SGD with gradient step scaled by learning rate and TD error

<aside> 🖊️ Different learning rates are used for each network

</aside>

Algorithm (Concept)

Following is the conceptual algorithm of the artor-critic method:

Initialize both actor and critic network parameters
Act on current state based on the actor policy network
Feed the state and action to estimate the Q-value using the critic network
Get the corresponding reward
Update the actor policy network using the estimated Q-value
Act on the next time step with the updated actor network
Update the critic Q-value network using the state & action of the next time step, and reward