What is shard theory?
Shard theory is a research program built on the idea that in order to solve AI alignment, we should study the only known case of (at least somewhat) “aligned” entities — human beings.
Therefore, shard theory includes two related ideas:
-
The “shard theory of human values”, a theory of how human values emerge.
-
The “shard paradigm of AI alignment”, which investigates the properties of value formation in deep learning systems, with the goal of learning how to shape specific values through reinforcement.
Shard theory of human values
The main idea is that human values and desires are not programmed into the human genome, but instead are acquired, starting in infancy, through a process of reinforcement learning. Values are implemented through circuits in the brain. These circuits are not activated equally in all situations; rather, they are activated more strongly when they are in a situation similar to those which have reinforced it in the past. A “shard of value” refers to these contextually activated computations which are downstream of similar historical reinforcement events. For example, a baby might have a shard which causes her to drink juice (which was reinforced in the past by sugar boosts) when juice is in front of her. Or a person may have a shard which leads him to spend time caring for his siblings, which is more strongly activated when around them, since that is the environment in which it was reinforced in the past.
These “shards” are made up of "subshards" which encode specific behaviors appropriate to certain situations. For example, a behavior which involves reaching for something shiny, moving an impediment out of the way or turning around to look if you expect that something is behind you could all be part of a larger shard which corresponds to a complete value, such as wanting drinking water. Each individual shard acts as a subagent, having its own goals, so that the person's values and actions emerge from the negotiation between all of the different shards as they are activated.
The hope is that it will be possible to identify which values emerge for any given reinforcement learning schedule so that we could instill values appropriate to every situation by providing the proper sequence of experiences.
Shard theory as an alignment strategy
Shard theory’s approach to alignment is based on the assumption that we can learn how human beings learn values and then use similar mechanisms to align an AI. Specifically, it proposes applying insights into how human values develop to the design of reinforcement training schedules for AIs which implant in them the values which we want. In order to do so, we need to develop interpretability tools to discover which shards of value are present and when they are being reinforced. With that understanding, we would be able to reinforce those shards which are in line with our values and fine tune them with an appropriate reinforcement schedule.
The current focus is on using interpretability tools to learn how to train a powerful language model to have certain important values, such as corrigibility, and then to use that model to help solve the rest of the alignment problem. However, the basic methods could help align systems other than language models as well.