How are large language models (LLMs) trained?

Large language models (LLMs) are neural networks, so they rely on finding patterns in data to learn. The training of an LLM can be understood as three phases:

  1. The core network is trained with large amounts of data from the internet and the task of predicting the next word.

  2. Once the network has achieved a base level of capabilities in generating many different types of text, the network is made better at generating specific types of text using imitation learning. In this phase, the network is given a supervised demonstration by an expert, and then it imitates the type of text that the expert generates.

  3. Eventually, the network performs the task and the experts only give feedback.

Following is a little bit more technical explanation of the above.

  • Step 0: Tokenization + Vectorization: Training data is gathered from many different sources - Wikipedia, stack overflow, Reddit, etc. Then, every word is tokenized, i.e. is broken up into constituent parts. For example, differing might be split into two tokens, differ + ing. The tokens are then vectorized – turned into vectors, or collections of numbers that can be processed by a neural network. The resulting vectors are called embeddings.

Once we have the embeddings we can begin training the network. The training process updates both the token space so that semantically similar tokens cluster together, as well as the parameters of the actual neural network.

Although the training sequence differs slightly between different organizations, most of the labs follow the general outline of pre-training followed by some sort of fine-tuning.

One possible path to training large language models (LLMs) can be seen by observing the InstructGPT training process. As an example, here is a short outline of the steps used to train InstructGPT

  • Step 1: Semi-Supervised Generative Pre-training (Create the shoggoth): The LLM is initially trained using a large amount of internet text data to predict the next word on a natural language task.

  • Step 2: Supervised Fine-tuning (Mold it to be human-like): A fine-tuning dataset is created, by giving a human a prompt and asking them to write an output to that prompt. This gives us a (prompt, output) pair dataset. We now use this dataset with supervised learning (behavioral cloning) to fine-tune the LLM.

  • Step 3: Reinforcement learning from human feedback (Put a smiley face on it):

    • Step 3a: Reward Model: We train an additional reward model. We first prompt the fine-tuned LLM and collect several output samples for the same prompt. A human manually ranks the samples from best to worst. Then we use this ranking to train the reward model to predict what a human would rank higher.

    • Step 3b: Reinforcement learning: Once we have both a fine-tuned LLM and a reward model, we can use Proximal Policy Optimization (PPO) based reinforcement learning to tell the fine-tuned model to maximize the reward that the reward model mimicking human rankings provides.

Here is a short podcast clip that talks about the RLHF process and how ChatGPT was trained: