What is Aligned AI's research agenda?

One of the key problems in AI safety is that there are many ways for an AI to generalize off-distribution, so it is very likely that an arbitrary generalization will be unaligned. See the model splintering post for more detail. Aligned AI's plan to solve this problem is as follows:

  1. Maintain a set of all possible extrapolations of reward data that are consistent with the training process.

  2. Pick among these for a safe reward extrapolation.

They are currently working on algorithms to accomplish step 1: see Value Extrapolation.

Their initial operationalization of this problem is the lion and husky problem. Basically: if you train an image model on a dataset of images of lions and huskies, the lions are always in the desert, and the huskies are always in the snow. So the problem of learning a classifier is under-defined: should the classifier be classifying based on the background environment (e.g. snow vs sand), or based on the animal in the image?

A good extrapolation algorithm, on this problem, would generate classifiers that extrapolate in all the different ways, so that the 'correct' extrapolation must be in this generated set of classifiers. They have also introduced a new dataset for this, with a similar idea: Happy Faces.

Step 2 could be done in different ways. Possibilities for doing this include: conservatism, generalized deference to humans, or an automated process for removing some goals, like wireheading/deception/killing everyone.

AlignedAI are mainly working on concept extrapolation from familiar to novel contexts, something he believes is “necessary and almost sufficient” for AI alignment.