I originally thought RL was only for robots/agents, but I recently found out that RL could be also utilized for language models, which is quite interesting. This blog will cover two major papers that introduced the use of RL for language models; Reinforcement Leaning with Human Feedback (RLHF) and Direct Preference Optimization.

Reinforcement Learning with Human Feeback (RLHF)

Training language models to follow instructions with human feedback

RLHF was proposed by OpenAI as a way to “align language models to human preferences”, and it does this by following the following steps:

Fine-tune a language model for a specific task
Output several predictions and rank them (human feedback)
Use the ranked outputs to learn a reward function for the task
Use the learned reward function to learn with RL

Formulation

From step 2, we will have a ranked database, which they called “preference data”:

$$ \mathcal{D} = \{x^{(i)}, y_w^{(i)}, y_l^{(i)}\} $$

where $x$ is the prompt and $y_w$ and $y_l$ are the winner and loser predictions respectively.

Now the loss function is defined as:

$$ \mathcal{L}(r_\phi, \mathcal{D}) = -\mathbb{E}{(x, y_w, y_l)\sim\mathcal{D}}[\log\sigma(r\phi(x, y_w) - r_\phi(x, y_l))] $$

where $r_\phi$ is the reward function and $\sigma()$ is the sigmoid function.

In case you don’t know what a sigmoid function is, it’s a function shaped like this:

Untitled

Simply, the greater the value is, the closer the output is to 1 and the the lower the value is, the closer the output is to 0.

The loss function is quite intuitive:

The term $r_\phi(x, y_w) - r_\phi(x, y_l)$ becomes big when the reward is given properly where $r_\phi(x, y_w) > r_\phi(x, y_l)$, and putting that through a sigmoid function makes it approach 1, and the logarithm of that makes it approach 0 (small loss if the reward is being represented properly between the winner and lower predictions).