What is embedded agency?

An embedded agent is an agent which is a part of its environment.

Standard decision theory models agents as separate from the environment they act upon. For example, when you play a video game, you affect the world of the game, but are not yourself part of the game. There are defined input channels (e.g. a mouse and keyboard) and output channels (e.g. images on the screen). In such a situation, you don’t need to model yourself in order to understand and play the game.

On the other hand, in real life, you are part of the universe and are made of atoms just like everything else. Therefore, when you act, the action impacts you as well, and you cannot restrict your analysis of the situation only to your impact on the world outside you. In this sense, you are an “embedded agent” in the world.

This has numerous implications. First, since you are part of the world, you can also change your own abilities. For example, you can improve your skills to be better able to solve future problems. You also need to model how your plans will affect yourself. For example, if you accidentally cause an explosion, you could be injured. An embedded agent also cannot contain a complete description of its environment since it is itself contained by the environment.

Many areas of alignment research can be understood through the embedded agency lens. For example, an embedded agent can change its own internal structure, which could result in subagents or mesa-optimizers which do not have the same goals as the original system.

The following figure (from Embedded Agency) illustrates how many alignment challenges relate to embedded agency: