*15 May 2023*

This is a rough note on the bird’s eye view of large language models (LLMs) to help me get started on understanding them, and hopefully making them safer and more helpful to us. Many details are omitted in service of having a clear and concise high level picture.

In this note, I draw on

- Chip Huyen’s RLHF blog post. This is a great overview of the whole training pipeline. I especially liked the description of datasets used, including their rough sizes, how they were collected, and the illustrative examples of datapoints. I also liked the mathematical formulations. I’m partly summarizing Chip’s blog post for myself.
- Ari Seff’s video tutorial. A 3-Blue-1-Brown style explanation of ChatGPT. I liked the explanations of motivations behind the individual steps in the ChatGPT training pipeline.

There are three steps for training a ChatGPT-like system:

- Unsupervised pre-training,
- Supervised fine-tuning (SFT),
- Reinforcement learning from human feedback (RLHF).

The dataset \(D_{\text{pretrain}}\) used in this step is unlabeled text collected from the Internet:

- There is roughly a trillion (\(10^{12}\)) tokens.
- At each training iteration we subsample a token sequence \(x\) up to some maximum length.

The training objective is \begin{align} \theta_{\text{pretrain}} = \argmax_\theta \E_{x \sim D_{\text{pretrain}}}[\log p_\theta(x)], \end{align} where \(p_\theta(x)\) is our auto-regressive language model, parameterized by \(\theta\). These models usually have roughly tens to hundreds of billions of parameters. All parameters are trained from scratch.

I found Andrej Karpathy’s “Let’s build GPT” the most helpful for understanding this step in more detail.

This step makes sure that the language model is useful for dialog, filters it for saying undesirable stuff, etc. instead of just predicting the next token.

We use a dataset \(D_{\text{SFT}}\) in this step.

- This dataset comprises 10k-100k datapoints.
- Each datapoint is a prompt-response pair \((x, y)\).
- A prompt-response pair collected by having two people talk to each other in a dialog-like setting. A response is chosen as \(y\), with the whole history up until that point being \(x\).

The training objective is \begin{align} \theta_{\text{SFT}} = \argmax_\theta \E_{x, y \sim D_{\text{SFT}}}[\log p_\theta(y \given x)], \end{align} where the parameters are initialized at \(\theta_{\text{pretrain}}\).

I’m not sure but we’re likely not updating all of the parameters, and probably not training for many epochs since that could lead to catastrophic forgetting. Updating all parameters is also expensive.

This step fine-tunes the model parameters further to make the model be more aligned with human preferences. Why is SFT not enough? One explanation (from Ari Seff’s video tutorial) is that during test time, since the model might act (i.e. choose responses) slightly differently to how a human would in the SFT dataset, we will slowly but surely get out of training distribution where the model will fail. If we train a model using data that it generates, it is less likely to fail this way.

First, we train a reward model to approximate the latent reward function that reflects human preferences. Then, we use this reward model to fine-tune the language model parameters.

We use a dataset \(D_{\text{reward}}\) in this step.

- This dataset comprises roughly 100k-1M datapoints.
- Each datapoint is a tuple \((\text{prompt } x, \text{winning response } y_W, \text{losing response } y_L)\).
- Such a tuple is derived from humans ranking \(N\) responses \(y_1, y_2, \dotsc, y_N\) to a prompt \(x\), where we take all the \(N\)-choose-2 response pairs \((y_i, y_j)\) and consider the better-ranked response to be the winning one, and the worse-ranked response to be the losing one.
- I’m not sure how the prompt \(x\) and the responses \((y_1, y_2, \dotsc, y_N)\) are generated. The prompt could come from a dataset similar to \(D_{\text{SFT}}\) and the responses could be generated from the model and/or from humans.

The training objective is \begin{align} \varphi = \argmax_\varphi \E_{x, y_W, y_L \sim D_{\text{reward}}}\left[\log \sigma(R_\varphi(x, y_W) - R_\varphi(x, y_L))\right], \end{align} where \(R_\varphi(x, y)\) is the reward model parameterized by \(\varphi\) and outputs a scalar. I’m not sure about the reward model’s architecture. Hugging Face’s blog seems to suggest it is similar to the base language model, with similar parameter sizes.

The objective could be interpreted as maximizing the log probability of the model ranking \(y_W\) higher than \(y_L\), where the probability is given by the sigmoid. This could be seen as a likelihood model based on \(R_\varphi(x, y_W)\) and \(R_\varphi(x, y_L)\). We could have chosen something else.

Another way of seeing the objective is writing reward difference as \(d = R_\varphi(x, y_W) - R_\varphi(x, y_L)\) and rewriting the log sigmoid term as \(-\log (1 + \exp(-d))\). Without the “1 + “ term, the objective becomes \(d\). This makes it clear that objective can be seen as pushing the reward of the winning response up, and the reward of the losing response down. Notably, the reward model is invariant to a constant additive factor which cancels out in the reward difference \(d\).

Here we fine-tune the LLM parameters based on the reward model \(R_\varphi(x, y)\).

In addition to \(D_{\text{pretrain}}\), we use a dataset \(D_{\text{RL}}\).

- This dataset comprises roughly 10k-100k prompts \(x\) (which could be the full conversation history up until the response).
- This is could be from the same dataset as \(D_{\text{SFT}}\).

The training objective is \begin{align} \theta_{\text{RL}} = \argmax_\theta \left\{ \E_{x \sim D_{\text{RL}}, y \sim p_\theta(y \given x)} \left[R_\varphi(x, y)\right] - \beta \E_{x \sim D_{\text{RL}}}\left[\KL{p_\theta(y \given x)}{p_{\theta_{\text{SFT}}}(y \given x)}\right] + \gamma \E_{x \sim D_{\text{pretrain}}}[\log p_\theta(x)] \right\}, \end{align} where the parameters \(\theta\) are initialized at where they ended up after the SFT stage, \(\theta_{\text{SFT}}\). Again, I’m not sure which subset of \(\theta\) is updated and for how many epochs.

The first part of the objective maximizes the expected reward. The parameters to be optimized are in the distribution of the expectation so we resort to a score function estimator like \begin{align} \nabla_\theta \E_{x \sim D_{\text{RL}}, y \sim p_\theta(y \given x)} \left[R_\varphi(x, y)\right] = \E_{x \sim D_{\text{RL}}, y \sim p_\theta(y \given x)} \left[R_\varphi(x, y) \nabla_\theta \log p_\theta(y \given x)\right]. \end{align} OpenAI uses proximal policy optimization. The second part is a KL divergence that makes sure the final model isn’t too far from the SFT model. The last part makes sure that the final model isn’t too far from the model from the unsupervised pre-training stage. These could be seen as ways to mitigate catastrophic forgetting (?).

Reward model parameters \(\varphi\) and the SFT parameters \(\theta_{\text{SFT}}\) are fixed, \(\beta\) and \(\gamma\) are hyperparameters.

[back]