What is instrumental convergence?

4 min read

Suggest changes in Google Docs

Instrumental convergence is the idea that sufficiently advanced intelligent systems with a wide variety of terminal goals would pursue very similar instrumental goals.

A terminal goal (also referred to as an "intrinsic goal" or "intrinsic value") is something that an agent values for its own sake (an "end in itself"), while an instrumental goal is something that an agent pursues to make it more likely that it will achieve its terminal goals (a "means to an end").

For instance, you might donate to an organization that helps the poor in order to improve people’s well-being. Here, “improve well-being” is a terminal goal that you value for its own sake, whereas “donate” is an instrumental goal that you value because it helps you achieve your terminal goal: if you found out that your money wasn’t making people better off, you’d stop donating.

While certain instrumental goals are particular to specific ends (e.g., filling a cup of water to quench your thirst), other so-called 'instrumental goals' are required to achieve nearly any non-trivial goals. For example, if we imagine an AI with a very specific (and weird) terminal goal — to create as many paperclips as possible — we can see why this goal might lead to the AI pursuing a number of instrumental goals:¹

Self-preservation. If the AI gets shut off or destroyed, that means it has to stop making paperclips. Therefore, it will be motivated to protect itself as an instrumental goal.²
Goal integrity. The AI will try to avoid having its goals changed, since if its goals were changed, it would stop trying to make paperclips and there would probably end up being fewer paperclips in the world. For a human analogy, let's say someone could cause you to stop caring about being kind to others, you would probably oppose that change, since according to your current values that would be a worse situation.³
Resource acquisition. Resources like money, influence, and information are useful for making paperclips. Through advanced technology, even fundamental resources including time, space, matter, and energy could be processed to serve almost any goal.
Technological advancement. Better technology will improve the efficiency and effectiveness of producing paperclips.
Cognitive enhancement. Improvements in rationality and intelligence will improve the AI’s decision-making, making it faster at making paperclips.

We can see some degree of instrumental convergence among humans: people want many different things, but often converge on the same broadly-useful instrumental goals like "making money" or “going to college”.

Nick Bostrom, "The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents" (2012). ↩︎
Stuart Russell quipped: “You can’t fetch the coffee if you’re dead.” ↩︎
Anthropic and Redwood Research have published results that suggest modern LLMs resist having their objective changed under certain circumstances. ↩︎

Why can't we just turn the AI off if it starts to misbehave?

What is an agent?

What is the orthogonality thesis?