What is shard theory?
Shard theory is a research program built on the idea that in order to solve AI alignment
Therefore, shard theory includes two related ideas:
-
The “shard theory of human values”, a theory of how human values emerge.
-
The “shard paradigm of AI alignment”, which investigates the properties of value formation in deep learning systems, with the goal of learning how to shape specific values through reinforcement.
Shard theory of human values
The main idea is that human values and desires are not programmed into the human genome, but instead are acquired, starting in infancy, through a process of Reinforcement learning
A machine learning method in which the machine gets rewards based on its actions, and is adjusted to be more likely to take actions that lead to high reward.
These “shards” are made up of "subshards" which encode specific behaviors appropriate to certain situations. For example, a behavior which involves reaching for something shiny, moving an impediment out of the way or turning around to look if you expect that something is behind you could all be part of a larger shard which corresponds to a complete value, such as wanting drinking water. Each individual shard acts as a subagent, having its own goals, so that the person's values and actions emerge from the negotiation between all of the different shards as they are activated.
The hope is that it will be possible to identify which values emerge for any given reinforcement learning schedule so that we could instill values appropriate to every situation by providing the proper sequence of experiences.
Shard theory as an alignment strategy
Shard theory’s approach to alignment is based on the assumption that we can learn how human beings learn values and then use similar mechanisms to align an AI. Specifically, it proposes applying insights into how human values develop to the design of reinforcement training schedules for AIs which implant in them the values which we want. In order to do so, we need to develop Interpretability
A research area that aims to make machine learning systems easier for humans to understand.
The current focus is on using interpretability tools to learn how to train a powerful language model
An AI model that takes in some text and predicts how the text is most likely to continue.