What are foundation models?

3 min read

The field of machine learning (ML) has recently seen a trend toward large research labs training a “base model” which acts as a “jack of all trades, but master of none” model. These types of models are called foundation models. These form the base layer to which fine-tuning can be applied in order to produce models which can perform a wide range of downstream tasks.

The main reason for this trend is the increasing cost and compute requirements to train ML models from scratch. In order to achieve competitive state-of-the-art performance on many tasks, we need systems that have learned from thousands, or millions, of examples. Each time we want to create and deploy any such machine learning system, we need a large, well-labeled dataset for the specific task, as well as access to sufficient compute. This costs both money and time. Fine-tuning a system broad enough to have useful knowledge is often much cheaper and faster.

Concerns regarding foundation models

Foundation models potentially provide everyone access to state-of-the-art abilities, as well as the potential to train their own models, using very little data, to perform extremely specialized tasks.

However, use of foundation models risks homogenization. Since most models are now just fine-tuned versions of foundation models, downstream AI systems might inherit the same biases from a few foundation models. Failure modes we see in a base model might spread through all models trained with it as the foundation. This means that in many domains, centralization allows greater efficiency, while creating single points of failure.

Reinforcement Learning (RL) foundation models

While most current foundation models are large language models (LLMs), there have also been recent efforts by DeepMind to train a reinforcement learning (RL) foundation model called an “adaptive agent” (AdA). If language-based foundation models are general purpose text generators, then this model is a general purpose task follower. The model uses a combination of techniques, including distillation (as in IDA), while being trained in a detailed open-ended 3D environment. This new model can then be fine-tuned like other foundation models to complete more specific tasks.

What are scaling laws?