Introduction

Classic deep RL algorithms like Deep Q Network (DQN) and Vanilla Policy Gradient (VPG) were generally impractical for most of the problems we hope to use deep RL for. Value-based methods like DQN had low variance, but were unable to control continuous action spaces. On the other hand, policy-based methods like VPG were able to handle continuous action spaces, but had high variances. However, these issues were solved when deep RL faced a massive advancement after the actor-critic method was introduced.

I will first go over some famous deep RL algorithms that use the actor-critic method, starting from classic Q Actor-Critic and Advantage Actor-Critic (A2C), then more advanced methods like Proximal Policy Optimization (PPO), Deep Deterministic Policy Gradient (DDPG), and Soft Actor-Critic (SAC).

Actor-Critic Approach

Actor-Critic is a deep RL method where an “actor” controls the agent and a “critic” evaluates its performance.

Setup

First of all, we prepare two networks:

<aside> 🖊️ Different learning rates are used for each network

</aside>

Algorithm (Concept)

Following is the conceptual algorithm of the artor-critic method:

  1. Initialize both actor and critic network parameters
  2. Act on current state based on the actor policy network
  3. Feed the state and action to estimate the Q-value using the critic network
  4. Get the corresponding reward
  5. Update the actor policy network using the estimated Q-value
  6. Act on the next time step with the updated actor network
  7. Update the critic Q-value network using the state & action of the next time step, and reward