What is Anthropic's alignment research agenda?

Anthropic is a major AI lab that "develop[s] large-scale AI systems so that we can study their safety properties [and] use these insights to create safer, steerable, and more reliable models". It is currently focused on scaling supervision, mechanistic interpretability, process-oriented learning, and understanding and evaluating how AI systems learn and generalize.

Anthropic has worked on a number of approaches to alignment:

Anthropic has published its core views on AI safety.