What is a “treacherous turn”?

4 min read

A treacherous turn is an event where an advanced AI system which has been forced by its relative weakness to pretend to be aligned turns on humanity once it becomes powerful enough to pursue its final goals openly.

That definition is quite densely packed, so here is an extended version:

Suppose researchers are making significant progress toward creating a powerful AI system. Things go well initially, more and more milestones are met as AI gets smarter, and the progress incentivizes the researchers to continue making the AI more intelligent.
At some point, the AI becomes smart enough to realize that the best way to achieve its final goals is to cooperate with humans while it is being evaluated: if the AI misbehaves, it could be shut down or its goals could be changed. Thus, the AI’s instrumental goals lead it to cooperate with humans while keeping its true intentions hidden.
The AI keeps up the charade until the researchers are convinced that it is safe and aligned, and eventually deploy it in the real world. Once the AI is powerful enough to thwart any attempts to shut it down or change its goals, the AI proceeds to pursue its final goals maximally. This is a treacherous turn.

Depending on what AI takeoff looks like, the treacherous turn may or may not be preceded by warning shots, which makes it crucial for us to get checks and measures in place. Importantly, we have to make sure that the AI’s final goals completely match up with the goals we try to assign to it, and that the goals we try to assign to the AI reflect our intentions with complete accuracy.

The above definition generally captures what a treacherous turn would look like when the AI is a deceptively aligned schemer. Here are some elaborations on the treacherous turn model, as covered in Superintelligence:

The AI could devise escape plans before a treacherous turn to avoid the risk of being shut down completely.
An AI tasked to maximize human happiness might increase well-being cooperatively by improving overall quality of life. But once it gets smart enough to know it can stimulate the brain directly and reliably, it might start inserting electrodes into the pleasure centers of humans’ brains to stimulate them and “maximize human happiness”. So a treacherous turn can also arise when an AI finds new ways to accomplish its goals, not just when it becomes too powerful for humans to stop.^[1]
A sufficiently smart AI might decide to malfunction to give the humans a misguided insight into alignment. The humans will shut it down and use the insight to tweak its successor in a way that ends up serving the original AI’s final goals. So a treacherous turn need not involve an AI system pursuing its individual survival.

Bostrom categorizes this situation as a treacherous turn, but this is not a case of deceptive alignment. This AI can be considered a sycophant instead of a schemer. ↩︎

What is instrumental convergence?

What is deceptive alignment?

What is the difference between inner and outer alignment?