Do current AI models show deception?
In this article, we'll say an AI system engages in deception when it intentionally misleads someone — that is, it deliberately acts in a way that causes others to have inaccurate beliefs, even if the AI isn't explicitly aware of the humans' beliefs.1
Various current AI models have learned to deceive people in this sense:
-
Meta’s Cicero is an AI trained to play the strategy game Diplomacy. Despite Meta’s intention to train Cicero to be "largely honest" about its current plans and “never intentionally backstab”, the AI learned occasional deceptive behaviors. Cicero sometimes changed its mind and went against its prior stated intentions, or even pre-meditated its deception.
-
Google DeepMind’s AlphaStar learned to feint, making it appear as if its units were moving to one location, while actually directing them to another location.
-
OpenAI’s GPT-4 was able to get a TaskRabbit worker to solve a CAPTCHA by impersonating a human with a vision impairment.
In general, when an agent deals with other agents
A system that can be understood as taking actions towards achieving a goal.
A particularly concerning possibility is that of a deceptively aligned
A case where the AI acts as if it were aligned while in training, but when deployed it turns out not to be aligned.
An event where a deceptively aligned AI system stops acting aligned and starts acting against its creator’s wishes.
Further reading:
A stricter notion of deception might require an AI to explicitly model human beliefs and choose actions that, according to its model, will cause false beliefs that are useful for the AI’s purposes. This is the notion used in the central example of deceptive alignment. ↩︎