Would an AGI arrive at coherent goals?

3 min read

Suggest changes in Google Docs

Coherence is a property of a system which means that its decisions are logically compatible with each other. One example of incoherent goals is if they are intransitive. For example, a person who preferred to trade burgers for french fries, french fries for pizza, and pizza for burgers would be stuck in an infinite loop of trading until all the food got cold. In general, incoherence will not be this blatant, but human beings typically don’t have fully coherent goals.

Note that having multiple goals does not mean that they are incoherent. A person could want both meaningful relationships and a satisfying career, but in any specific situation they are able to choose between them. Similarly, a person can have many desires which seem inconsistent, but which all feed into an overarching objective.

A human-level AGI might also have incoherent goals, but eventually we would expect an advanced superintelligence to end up with coherent goals, after being rewritten and optimized enough.

An AGI could be made coherent in two ways. Either by humans in its design or training, or by itself during recursive self-improvement.

If an AI is designed by humans who have some goal, they’ll probably try to make it act coherently toward that goal. Agent AIs that choose their actions based on what most effectively accomplishes some aim will tend to outperform tool AIs that only provide human users with information.

In machine learning, agents aren’t designed by hand, but grown in a training process. Their creators may not control or understand what the resulting systems look like, and these systems may not end up coherently maximizing their reward functions¹. But some tasks that people will want AI systems to accomplish may be hard enough to require AIs that act like coherent utility maximizers in that they search for plans in powerful ways; which would create the kind of problems you get from agents seeking power as an instrumentally convergent goal.

But in the long run, AI systems will mostly be designed by AI systems. Such systems might, themselves, face a version of the alignment problem, and avoid building fully coherent goal-directed successors for that reason. But with sufficient (superintelligent) understanding of AI code and the implications of changes to that code, this would stop being an issue. If the AI has “preferences”, then any incoherence in those preferences is a vulnerability that can be taken advantage of. Whatever goal the AI was trying to accomplish, it would experience pressures to build a successor that coherently pursued that goal.

Some have argued that more intelligent systems, in general, behave less coherently. ↩︎

Would an AGI arrive at coherent goals?

In progress