Take AISafety.info’s 3 minute survey to help inform our strategy and priorities

Take the survey
Alignment research

Current techniques
Benchmarks and evals
Prosaic alignment
Interpretability
Agent foundations
Other alignment approaches
Organizations and agendas
Researchers

What are "true names" in the context of AI alignment?

True names are precise mathematical formulations of intuitive concepts that capture all the properties that we care about for those concepts. “True names” is a term introduced by alignment researcher John Wentworth, possibly inspired by the idea from folklore that knowing a thing's “true name” grants you power over it.

Wentworth gives many examples of true names. Concepts like "force", "pressure", "charge" and "current" were all once poorly understood, based on vague intuitions about the physical world, but have now been robustly formalized mathematically.

To put it another way, a “true name” can be thought of as a mathematical formulation that robustly generalizes as intended. An important property of true names is that they are not susceptible to failing via Goodhart's law when faced with the immense optimization pressure of a future superintelligence

. Since alignment researchers are interested in finding mathematical measures that are “non-Goodhartable”, they also care about finding true names. However, non-Goodhartability is just one property of true names. Robustness to optimization might be a necessary condition to conclude that we are dealing with a true name, but it isn't the definition of true names.

Many alignment researchers care about human values. It would be a huge boon for AI alignment

efforts if we could discover a robust formulation or a “true name” of human values. Currently, we use proxies of what humans truly care about in AI models in order to measure how well a given model performs. The use of these proxies often results in side effects through things like reward misspecification or specification gaming. However, if we had a “true name” for human values which we could optimize for, then we would not need to worry about undesired side effects or unforeseen consequences.

In addition to human values, alignment researchers also seek true names for components of agency such as optimization, goals, world models, abstraction, counterfactuals, and embeddedness.

Keep Reading

Continue with the next entry in "Alignment research"
What is Infra-Bayesianism?
Next
Or jump to a related question


AISafety.info

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.