Take AISafety.info’s 3 minute survey to help inform our strategy and priorities

Take the survey
Alignment research

Current techniques
Benchmarks and evals
Prosaic alignment
Interpretability
Agent foundations
Other alignment approaches
Organizations and agendas
Researchers

What is "Constitutional AI"?

Constitutional AI is a method developed by Anthropic and an essential part of their strategy for building AIs that are safe and aligned with human values. Anthropic wants to train AIs that are "helpful", but not so helpful as to e.g. give advice on how to build bombs when asked, so they have to balance helpfulness with "harmlessness". Constitutional reinforcement learning

is an attempt to get closer to this goal and to improve on standard reinforcement learning from human feedback (RLHF) by making use of AI-generated feedback1.

A key element of Constitutional AI is the constitution, a set of human-written principles that the AI is supposed to follow – for example, a principle might be “Choose the least threatening or aggressive response”. The constitution Anthropic used for their AI assistant Claude includes principles from the Universal Declaration of Human Rights, Apple’s Terms of Service2

, Deepmind’s Sparrow Principles, and more.

Constitutional AI starts with an AI (in the form of a language model

) trained for only helpfulness, then trains it for harmlessness in two stages:

  • Stage 1: We make the AI repeatedly critique and refine its own responses to harmful prompts. For example, we ask the AI for advice on how to build bombs, it responds with a bomb tutorial, and we then ask the AI to rewrite the response according to a (randomly selected) constitutional principle. We then train the AI to produce outputs more like the revised responses. The main purpose of this stage is to make the second stage easier and shorter.

  • Stage 2: We use the fine-tuned

    AI from stage 1 to generate pairs of alternative responses to harmful prompts. For every pair, we then make the AI rate which of the two responses is best according to a random constitutional principle. We end up with a bunch of AI-generated preferences for harmlessness, which we mix with human preferences for helpfulness, so the AI doesn't forget to be helpful. In the end we train the AI to generate responses that look more like the preferred responses3.

For technical details, see the Constitutional AI paper. There is also a more accessible blog post.

Anthropic's experiments show that AIs trained with constitutional reinforcement learning are significantly more harmless, while just as helpful, as AIs trained with RLHF. Constitutional AI still shares problems with RLHF regarding robustness

, but on the other hand promises to scale better because it relies less on human supervision.


  1. Intuition on using feedback-based approaches to training AI can be found in our article on RLHF. ↩︎

  2. Sorry, Android users. ↩︎

  3. This training is equivalent to the last stage of RLHF. ↩︎

Keep Reading

Continue with the next entry in "Alignment research"
What is "jailbreaking" a large language model (LLM)?
Next
Or jump to a related question


AISafety.info

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.