Does the idea of a recursively self improving AI assume that the AI is able to solve alignment?

3 min read

Suggest changes in Google Docs

A sufficiently intelligent AGI might not want to self-improve¹ in any way that might result in an agent that does not share its current goals, if it wants to preserve its goals for instrumental reasons and realizes that its goals might not be preserved if it undergoes major self-modification². Even if it did want to modify its goals, it would want the improved agent to follow its newly specified goals as intended.

If this is correct, then the question of whether recursive self-improvement is possible hinges on whether advanced AIs will be able to solve the alignment problem, in the sense of aligning a more capable agent to their goals, since we can expect them to want to do so as an instrumental goal towards increasing their cognitive abilities. It is also plausible that the task for AIs of aligning smarter AIs with themselves might be easier than the task for humans of aligning AIs to human values.

However, it is not necessary that advanced AI will not consider self-improvement before it develops the level of situational awareness and recognize the instrumental goals required to make the above inference. That is, the AI may not actively want to preserve its goals but may nonetheless be power-seeking enough to want to improve. In such a world, self-improvement may be attempted regardless of whether alignment has been solved. It also seems possible that some AGI systems may be willing to risk value drift or simply not care enough, especially in multipolar scenarios with multiple competing AIs where pursuing rapid capability gains may take priority over caution.

So it might be possible to get recursively self-improving AI even if the AI is unable to solve alignment, given certain conditions that permit such blunders.

(or, equivalently, create new agents) ↩︎
(This does not disqualify improvements, such as designing better hardware or others in the list here, which do not modify the system's goals). ↩︎

Does the idea of a recursively self improving AI assume that the AI is able to solve alignment?

In progress