What is specification gaming?
Specification gaming is behavior where an agent performs a task in a way that gets it a lot of reward while going against the “spirit” of the task. It’s a sub-problem of outer alignment because it’s caused by a mismatch between the way the objective was literally specified and the way it was intended by its creator.
This problem illustrates why reward modeling (designing good rewards) is just as important as designing good algorithms to achieve those rewards. Specification gaming can be observed in both humans and AI agents.
A simple example from mythology is King Midas and his golden touch. Midas asked the god Dionysus for the power to turn whatever he touched into gold, thinking that this would make him rich. However, when granted that exact wish, his touch soon turned his loved ones and even everything he ate into gold.
Discussed solutions to specification gaming
Solving specification gaming does not mean that we need to know all the paths that the agent might follow beforehand. And we don’t want to curtail the ability of the agent to come up with novel solutions. Rather, we want to move the needle from undesired novel solutions to desired novel solutions.
In order to do this, instead of trying to create a specification that covers every possible edge case, some researchers have suggested making the agent either imitate how an existing expert would do the task through imitation learning (e.g. behavioral cloning) or, alternatively, learn how to perform the task by having experts give it feedback on its actions (e.g. reinforcement learning from human feedback). These strategies are motivated by the intuition that formally specifying tasks is often much harder than either demonstrating them, or observing and evaluating their results. However, they still fall short because of a number of hurdles:
- Human error: The agent fools the humans because humans cannot or do not properly observe the final state. This failure mode was observed in DeepMind’s blog post on specification gaming, where an agent makes it look like a hand is grasping a ball by putting the hand between the ball and the camera, thereby fooling the human evaluator.
-
Too specific: The learned policy might not generalize well to different environments.
-
The exploitation of simulator bugs: The agent might discover some unknown bugs and exploit them during the process since we are only evaluating based on end states. As the simulation is an abstraction of the real world, policies learned in the simulation might not generalize.
-
Complexity: As tasks grow in complexity, researchers are more likely to introduce incorrect assumptions during specification design. This leads to the question: can we design agents that correct for such false assumptions instead of gaming them?
-
Reward tampering: The use of evaluation implies that there is an evaluation metric stored somewhere. This could be on a hard drive or even in human brains. Such a representation could be manipulated by hacking the system, by the agent wireheading itself, or by influencing humans so that they have more easily satisfied preferences.