Value-based methods like Q-learning have an intermediate step of learning a value function to optimize policy. With policy-based methods, we can skip this step and directly optimize the policy.
If you know how neural networks learn, the idea behind policy gradient is very simple. As the name suggests, we’re going to use a similar approach to stochastic gradient descent to search for the optimal policy.
Like how we defined the loss function for SGD, we need a function that measures the performance of a policy. We call this the objective function, and this gives us the expected cumulative reward given a specific trajectory (a sequence of actions and states).
This can be calculated quite intuitively:
$$ \boxed{J(\theta) = \Sigma_{\tau}P(\tau ; \theta)R(\tau)} $$
where $R(\tau) = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots$
The function is simply getting the product of the trajectory’s reward $R(\tau)$ and its probability $P(\tau;\theta)$ under policy $\theta$, then taking the sum over all trajectories.
Now, we want to get the gradient of the objective function. The policy gradient can be calculated with the following equation:
$$ \nabla_{\theta}J(\theta) = \mathop{\mathbb{E}}{\pi{\theta}}[\Sigma_{t=0}^{T}\nabla_{\theta}log\pi_{\theta}(a_t|s_t)R(\tau)] $$
where $\pi_{\theta}(a_t|s_t)$ is the probability of taking action $a_t$ from state $s_t$.
the detailed derivation can be found here:
(Optional) the Policy Gradient Theorem - Hugging Face Deep RL Course
The policy gradient function looks right, but let’s think about it. Essentially, we’re trying to update the policy parameters by this gradient (SGD), so according to this function, we’re going to change the policy proportionally to the trajectory’s cumulative reward $R(\tau)$. This cumulative reward is going to be used to adjust the probability of an action being taken at a certain state. However, does this really accurately measure whether the action was good or not?