What are the power-seeking theorems?

3 min read

Suggest changes in Google Docs

Power-seeking behavior is a key source of risk from advanced AI and an important part of many models of how AI could lead to existential catastrophe. Yet our theoretical understanding of power-seeking is currently limited.

The power-seeking theorems try to fill this gap by formalizing power-seeking behavior by reinforcement learning agents.

What is power? A common intuition, which also matches prominent philosophical views, is that power refers to the ability to achieve goals in general. The power-seeking theorems formalize this intuition by defining the power of a state an agent could find itself in as the average optimal value of that state. Here, the average is taken over a wide range of reward functions. The optimal value of a state is defined as the maximum (discounted) sum of rewards that the agent can get in that state via its actions, so it captures the agent’s ability to “achieve its goal”. Therefore, the average optimal value reflects the agent’s ability to achieve a wide range of goals. Policies are then said to seek power if they lead to actions that navigate to states with high power.

An obvious low power state is death/shutdown: it’s rarely rewarding for an agent to die. A corresponding policy that tends to seek power is a policy that makes the agent avoid death.

The existing power-seeking theorems show that most reward functions incentivize reinforcement learning agents to take power-seeking actions, avoid death, and leave open as many options as possible. They also predict that this behavior will become more probable as agents become more advanced.

Still, the theorems do not prove our certain doom. They apply only to reinforcement learning agents, rely on many simplifying assumptions, and we could also still build agents aligned with human interests, if we figure out how. But the theorems do suggest that there are serious safety risks: in the future we may train agents that turn out to seek power over humans!

Current research in this area focuses on relaxing the theorems’ assumptions, and on understanding power-seeking incentives in language models.