Is there a danger in anthropomorphizing AIs?

Using some human-related metaphors (e.g. what an AGI ‘wants’ or ‘believes’) is almost unavoidable[1], as our language is built around experiences with humans, but we should be aware that these may lead us astray.

Many paths to AGI would result in a mind very different from a human or animal, and it would be hard to predict in detail how it would act. We should not trust intuitions trained on humans to predict what an AGI or superintelligence would do. High fidelity whole brain emulations are one exception, where we would expect the system to at least initially be fairly human, but it may diverge depending on its environment and what modifications are applied to it.

There has been some discussion about how language models trained on lots of human-written text seem likely to pick up human concepts and think in a somewhat human way, and how we could use this to improve alignment.


  1. Gurnee et al. note “We do not, of course, think that these small models are capable of having wants or intentionality in the sense of a human! In fact, it would be more accurate to say that stochastic gradient descent selects for models with low expected loss, and that model weights that minimise interference seem likely to have lower expected loss. But we believe that notions like "want" can be valuable intuition pumps to reason about what algorithms may be expressed in a model’s weights, so long as the reader is careful to keep in mind the limits and potential illusions of anthropomorphism” ↩︎