What are "true names" in the context of AI alignment?
True names are precise mathematical formulations of intuitive concepts that capture all the properties that we care about for those concepts. “True names” is a term introduced by alignment researcher John Wentworth, possibly inspired by the idea from folklore that knowing a thing's “true name” grants you power over it.
Wentworth gives many examples of true names. Concepts like "force", "pressure", "charge" and "current" were all once poorly understood, based on vague intuitions about the physical world, but have now been robustly formalized mathematically.
To put it another way, a “true name” can be thought of as a mathematical formulation that robustly generalizes as intended. An important property of true names is that they are not susceptible to failing via Goodhart's law when faced with the immense optimization pressure of a future superintelligence
An AI with cognitive abilities far greater than those of humans in a wide range of important domains.
An agent's ability to maintain its goal and its capabilities when exposed to environments that are substantially different from that on which the agent was trained.
Many alignment researchers care about human values. It would be a huge boon for AI alignment
The problem of making sure that the precise formulation of what we train the AI to do matches what we intend it to do.
Behavior where an AI performs a task in a way that scores highly according to the objective that was specified, while going against the task’s intended “spirit.”
In addition to human values, alignment researchers also seek true names for components of agency such as optimization, goals, world models, abstraction, counterfactuals, and embeddedness.