Do current AI models show deception?

3 min read

Suggest changes in Google Docs

In this article, we'll say an AI system engages in deception when it intentionally misleads someone — that is, it deliberately acts in a way that causes others to have inaccurate beliefs, even if the AI isn't explicitly aware of the humans' beliefs.¹ The AI might do this because such behavior was reinforced with high reward during training.

Various current AI models have learned to deceive people in this sense:

Meta’s Cicero is an AI trained to play the strategy game Diplomacy. Despite Meta’s intention to train Cicero to be "largely honest" about its current plans and “never intentionally backstab”, the AI learned occasional deceptive behaviors. Cicero sometimes changed its mind and went against its prior stated intentions, or even pre-meditated its deception.
Google DeepMind’s AlphaStar learned to feint, making it appear as if its units were moving to one location, while actually directing them to another location.
OpenAI’s GPT-4 was able to get a TaskRabbit worker to solve a CAPTCHA by impersonating a human with a vision impairment.

In general, when an agent deals with other agents that have different goals, it’s often instrumentally useful to the agent if those other agents act on particular false beliefs. An agent that learns instrumentally useful behaviors, all else equal, will therefore often learn to deceive. From that perspective, the existence of examples like the above is unsurprising.

A particularly concerning possibility is that of a deceptively aligned AI, which deceives observers about its alignment until it can enact a treacherous turn.