Tuan Anh Le

Bird’s eye view of Large Language Models

15 May 2023

This is a rough note on the bird’s eye view of large language models (LLMs) to help me get started on understanding them, and hopefully making them safer and more helpful to us. Many details are omitted in service of having a clear and concise high level picture.

In this note, I draw on

There are three steps for training a ChatGPT-like system:

  1. Unsupervised pre-training,
  2. Supervised fine-tuning (SFT),
  3. Reinforcement learning from human feedback (RLHF).

Unsupervised pre-training

The dataset \(D_{\text{pretrain}}\) used in this step is unlabeled text collected from the Internet:

The training objective is \begin{align} \theta_{\text{pretrain}} = \argmax_\theta \E_{x \sim D_{\text{pretrain}}}[\log p_\theta(x)], \end{align} where \(p_\theta(x)\) is our auto-regressive language model, parameterized by \(\theta\). These models usually have roughly tens to hundreds of billions of parameters. All parameters are trained from scratch.

I found Andrej Karpathy’s “Let’s build GPT” the most helpful for understanding this step in more detail.

Supervised fine-tuning

This step makes sure that the language model is useful for dialog, filters it for saying undesirable stuff, etc. instead of just predicting the next token.

We use a dataset \(D_{\text{SFT}}\) in this step.

The training objective is \begin{align} \theta_{\text{SFT}} = \argmax_\theta \E_{x, y \sim D_{\text{SFT}}}[\log p_\theta(y \given x)], \end{align} where the parameters are initialized at \(\theta_{\text{pretrain}}\).

I’m not sure but we’re likely not updating all of the parameters, and probably not training for many epochs since that could lead to catastrophic forgetting. Updating all parameters is also expensive.

Reinforcement learning from human feedback

This step fine-tunes the model parameters further to make the model be more aligned with human preferences. Why is SFT not enough? One explanation (from Ari Seff’s video tutorial) is that during test time, since the model might act (i.e. choose responses) slightly differently to how a human would in the SFT dataset, we will slowly but surely get out of training distribution where the model will fail. If we train a model using data that it generates, it is less likely to fail this way.

First, we train a reward model to approximate the latent reward function that reflects human preferences. Then, we use this reward model to fine-tune the language model parameters.

Training a reward model based on preferences

We use a dataset \(D_{\text{reward}}\) in this step.

The training objective is \begin{align} \varphi = \argmax_\varphi \E_{x, y_W, y_L \sim D_{\text{reward}}}\left[\log \sigma(R_\varphi(x, y_W) - R_\varphi(x, y_L))\right], \end{align} where \(R_\varphi(x, y)\) is the reward model parameterized by \(\varphi\) and outputs a scalar. I’m not sure about the reward model’s architecture. Hugging Face’s blog seems to suggest it is similar to the base language model, with similar parameter sizes.

The objective could be interpreted as maximizing the log probability of the model ranking \(y_W\) higher than \(y_L\), where the probability is given by the sigmoid. This could be seen as a likelihood model based on \(R_\varphi(x, y_W)\) and \(R_\varphi(x, y_L)\). We could have chosen something else.

Another way of seeing the objective is writing reward difference as \(d = R_\varphi(x, y_W) - R_\varphi(x, y_L)\) and rewriting the log sigmoid term as \(-\log (1 + \exp(-d))\). Without the “1 + “ term, the objective becomes \(d\). This makes it clear that objective can be seen as pushing the reward of the winning response up, and the reward of the losing response down. Notably, the reward model is invariant to a constant additive factor which cancels out in the reward difference \(d\).

Fine-tuning based on the reward model

Here we fine-tune the LLM parameters based on the reward model \(R_\varphi(x, y)\).

In addition to \(D_{\text{pretrain}}\), we use a dataset \(D_{\text{RL}}\).

The training objective is \begin{align} \theta_{\text{RL}} = \argmax_\theta \left\{ \E_{x \sim D_{\text{RL}}, y \sim p_\theta(y \given x)} \left[R_\varphi(x, y)\right] - \beta \E_{x \sim D_{\text{RL}}}\left[\KL{p_\theta(y \given x)}{p_{\theta_{\text{SFT}}}(y \given x)}\right] + \gamma \E_{x \sim D_{\text{pretrain}}}[\log p_\theta(x)] \right\}, \end{align} where the parameters \(\theta\) are initialized at where they ended up after the SFT stage, \(\theta_{\text{SFT}}\). Again, I’m not sure which subset of \(\theta\) is updated and for how many epochs.

The first part of the objective maximizes the expected reward. The parameters to be optimized are in the distribution of the expectation so we resort to a score function estimator like \begin{align} \nabla_\theta \E_{x \sim D_{\text{RL}}, y \sim p_\theta(y \given x)} \left[R_\varphi(x, y)\right] = \E_{x \sim D_{\text{RL}}, y \sim p_\theta(y \given x)} \left[R_\varphi(x, y) \nabla_\theta \log p_\theta(y \given x)\right]. \end{align} OpenAI uses proximal policy optimization. The second part is a KL divergence that makes sure the final model isn’t too far from the SFT model. The last part makes sure that the final model isn’t too far from the model from the unsupervised pre-training stage. These could be seen as ways to mitigate catastrophic forgetting (?).

Reward model parameters \(\varphi\) and the SFT parameters \(\theta_{\text{SFT}}\) are fixed, \(\beta\) and \(\gamma\) are hyperparameters.