What is Anthropic's alignment research agenda?
Anthropic is a major AI lab that "develop[s] large-scale AI systems so that we can study their safety properties [and] use these insights to create safer, steerable, and more reliable models". They are currently focused on scaling supervision, mechanistic interpretability, process-oriented learning, and understanding and evaluating how AI systems learn and generalize.
Anthropic has worked on a number of approaches to alignment.
-
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2022) applies "preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants."
-
Mathematical Framework for Transformer Circuits(2021) – The idea of circuits was first applied to CNNs in the case of vision. This paper is meant to begin applying it to the transformers used in the architecture of recent large language models. The paper contains a reference to a second paper which has some more significant results – specifically, the idea of “induction heads”, which are attention heads which allow for in-context learning.
-
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (2023) - Uses a sparse autoencoder to extract meaningful interpretations from sets of neurons in a toy transformer model.
-
Language Models (Mostly) Know What they Know (Kadavath, Kaplan, 2022) - Tasks LMs with predicting which questions they will get correct, and whether their own claims are valid. Preliminary results are encouraging; generally the models are calibrated to the probability their answers are correct, after proposing them. Calibration is worse when the question becomes “Do you know the answer to x?”, but improves when given extra source material to work with.
Anthropic has published their core views on AI safety.