What is Anthropic's alignment research agenda?

2 min read

Anthropic is a major AI lab that "develop[s] large-scale AI systems so that we can study their safety properties [and] use these insights to create safer, steerable, and more reliable models". It is currently focused on scaling supervision, mechanistic interpretability, process-oriented learning, and understanding and evaluating how AI systems learn and generalize.

Anthropic has worked on a number of approaches to alignment:

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2022) applies "preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants."
Mathematical Framework for Transformer Circuits (2021) — The idea of circuits was first applied to CNNs in the case of vision. This paper is meant to begin applying it to the transformers used in the architecture of recent large language models. The paper cites a second paper which has some more significant results — specifically, the idea of “induction heads”, which are attention heads which allow for in-context learning.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (2023) — Uses a sparse autoencoder to extract meaningful interpretations from sets of neurons in a toy transformer model.
Language Models (Mostly) Know What they Know (Kadavath et al., 2022) — Tasks LMs with predicting which questions they will get correct, and whether their own claims are valid. Preliminary results are encouraging; generally the models are calibrated to the probability their answers are correct, after proposing them. Calibration is worse when the question is “Do you know the answer to x?”, but improves when given extra source material to work with.

Anthropic has published its core views on AI safety.

What is Anthropic's approach to LLM alignment?