What is the "Bitter Lesson"?

3 min read

The Bitter Lesson is a thesis introduced by Rich Sutton stating that, for improving AI capabilities, “general methods that leverage computation are ultimately the most effective, and by a large margin”, compared to approaches that use human ingenuity.

Historically, AI research has mostly designed systems to use a fixed amount of computing power, improving their performance by applying domain-specific human knowledge. In theory, this approach is compatible with also improving performance by scaling (increasing computing power), but in practice, the complications introduced by leveraging human ingenuity make it harder to also leverage computation. Available computing power keeps growing steadily in accordance with Moore’s law, and past trends suggest that leveraging this growth is what increases performance in the long run.

Historical examples of the Bitter Lesson include:

Games: Deep Blue beat chess world champion Garry Kasparov by leveraging a massive deep search. Similarly, AlphaGo beat Go world champion Lee Sedol using deep learning plus Monte Carlo tree search to find its moves instead of using human-engineered Go techniques. Within one year, instead of leveraging any human-generated Go data at all, AlphaZero used self-play to beat AlphaGo. None of these successive improvements in game playing capabilities relied on any fundamental breakthroughs in human knowledge.
Vision: Early computer vision methods worked with human-engineered features and convolution kernels to perform image recognition tasks, but over the years, it has been found that leveraging more compute and letting convolutional neural nets (CNNs) learn their own features yields much better performance.

Modern AI has learned to favor general purpose methods of search and learning. These continue to scale with increasing compute. Over the last couple of generations of transformer models, simply scaling language models has been so effective that it has led OpenAI to propose new scaling laws for language models in 2020 (updated in 2022 by DeepMind). However, despite the scaling up of models often leading to increased capabilities, this does not necessarily imply that scaling up will be sufficient to lead to AGI.

Can we get AGI by scaling up architectures similar to current ones, or are we missing key insights?

What are scaling laws?

What is compute?