Do current AI models show deception?

In this article, we'll say an AI system engages in deception when it intentionally misleads someone — that is, it deliberately acts in a way that causes others to have inaccurate beliefs, even if the AI isn't explicitly aware of the humans' beliefs.1 The AI might do this because such behavior was reinforced with high reward during training. Various current AI models have learned to deceive people in this sense:

In general, when an agent deals with other agents that have different goals, it’s often instrumentally useful to the agent if those other agents act on particular false beliefs. An agent that learns instrumentally useful behaviors, all else equal, will therefore often learn to deceive. From that perspective, the existence of examples like the above is unsurprising.

A particularly concerning possibility is that of a deceptively aligned AI, which deceives observers about its alignment until it can enact a treacherous turn.

Further reading:


  1. A stricter notion of deception might require an AI to explicitly model human beliefs and choose actions that, according to its model, will cause false beliefs that are useful for the AI’s purposes. This is the notion used in the central example of deceptive alignment. ↩︎