If utility maximizers are so dangerous, is there some other kind of system we could build?

3 min read

A powerful utility maximizing agent would have fundamental issues such as vulnerability to Goodhart's Law. There are a number of systems that could be used instead of utility maximization. These include:

Satisficers — agents with goals which are ‘bounded’ (have a limit beyond which no more reward is gained)

Human imitators — agents which pick from actions they expect humans would take
Quantilizers — agents which select randomly from the top few % of the most effective actions.

Unfortunately, these systems also have problems. Satisficers and quantilizers may still take catastrophic actions at random or due to their uncertainty about the world, while human imitators will be unable to perform at a superhuman level of intelligence.

To illustrate this, imagine we train two different models to accumulate stamps:

An unbounded model, which gets increasing reward for more stamps with no end, will be incentivised to destroy human civilisation to access resources that can be used to make stamps due to instrumental convergence.
A satisficer — which gets more reward for each stamp it collects up to 100 stamps, but no additional reward for 101+ stamps, will select randomly from actions which gain at least 100 stamps. While this is an improvement, some of those actions will still be catastrophic for humans — and we don’t currently know what the odds are that the satisficer takes a catastrophic action. Additionally, because the agent has uncertainty about the world, it is motivated to maximize its expected maximum reward. This means that it takes actions which make it more and more certain that it will gain at least 100 stamps. In practice, this may end up selecting for catastrophic actions — if all it cares about is being as sure as possible that it has at least 100 stamps, then it may gain as many stamps as possible, just like the utility maximiser.

At least one AI safety organization (GaNe) lists producing and testing non-optimising agents (such as satisficers) as a major focus of its research. In the announcement of their agenda for AI safety, they describe one possible approach for building safer non-optimising agents:

“the client would for example specify a feasibility interval for the expected value of the return (= long-term discounted sum of rewards according to some reward function that we explicitly do not assume to be a proper measure of utility), and the learning algorithm would seek a policy that makes the expected return fall anywhere into this interval.”

What is an expected utility maximizer?

What is a "quantilizer"?

What is shard theory?