How might Shard Theory help with alignment?

Shard theory could help with alignment in a number of ways.

  1. By discovering ways to predict which goal will form under a given reinforcement schedule.

  2. . Using the example of the emergence of human values as a model for instilling those values in an artificial system.

For example, since we know some humans value diamonds, there must be some sequence of events which led to learning to value them. If so, we could give a similar sequence of experiences to an AI system to train it to also value diamonds.

In addition to the general way that Shard theory could serve as a paradigm for alignment research. The specific focus of shard theory is currently to discover and use interpretability tools to learn how to train a powerful language model to have certain important values such as corrigibility. And to then use that model to help solve the rest of the alignment problem. Thus punting many of the specific