Why can't we just use a friendly AI to stop bad AIs?

2 min read

Suggest changes in Google Docs

Some people, including Yann LeCun, have suggested that misuse of AI¹ could be countered with some form of more powerful defensive AIs controlled by the “good guys”.

This plan holds some promise, but requires a few things to go right:

We need to learn how to reliably align such a defensive AI so that it is robustly “good”. Currently, we don’t know how to do this.
The defensive AI must consistently maintain a strong strategic advantage over any unaligned AI. Whether that’s likely depends on considerations like whether the relevant technologies favored offense or defense (e.g., infecting people might be intrinsically much easier than curing them), and how much of an alignment tax the defensive AI faced.

A collection of such defensive AIs, if they coordinate, might amount to what Dan Hendrycks calls an “AI Leviathan”². Such an AI Leviathan might end up with uncontested power, effectively making it a singleton. If so, the singleton could permanently prevent misaligned AI from causing ruin. But these outcomes would also involve an extreme and potentially concerning concentration of power.

This argument might also apply to misaligned AI. ↩︎
The name is a reference to Thomas Hobbes’ book of the same name. ↩︎

What are "pivotal acts"?

Might someone use AI to destroy human civilization?

Would AI technology favor offense or defense in a conflict?

Why can't we just use a friendly AI to stop bad AIs?

Unlisted