What are scaling laws?

3 min read

Suggest changes in Google Docs

"Scaling laws", in the context of training an AI model, express the relationship between the model’s performance, the model’s size (number of parameters), the length of the training run, and the amount of data it was trained on. These last three quantities determine how much compute is used in the training process, and scaling laws are used to allocate a fixed amount of compute between them so as to produce the most capable model.

Scaling laws are used to decide on trade-offs like: Should I pay Stack Overflow to train on their data? Or should I buy more GPUs? Or should I pay the higher electricity bills I would get by training my model longer? If my compute goes up by 10×, how many parameters should I add to my model to make the best possible use of my GPUs?

In the case of very large language models like GPT-4, these trade-offs look more like training a 20-billion parameter model on 40% of an archive of the Internet vs. training a 200-billion parameter model on 4% of an archive of the Internet, or any of an infinite number of points along the same boundary.

In 2020, OpenAI published the first generation of scaling laws and found that, given how models were being trained at the time, increasing model size was more effective than using more data. Subsequent researchers took this idea to heart — note the acceleration of growth in model size, with parameter count increasing to half a trillion, but the amount of training data staying constant.

(Source)

DeepMind researchers proposed new scaling laws in 2022. They found that to make the most effective use of an increase in compute, instead of mostly increasing model size, you should increase the size of the model and the dataset by roughly the same factor. To test the new scaling law, DeepMind trained a 70-billion parameter model called "Chinchilla" using the same amount of compute as the 280-billion parameter Gopher. Because of Chinchilla’s smaller size, they were able to reallocate this compute to train on 1.4 trillion tokens compared to Gopher’s 300 billion. As the new scaling laws predicted, Chinchilla performed significantly better than Gopher.