How is red teaming used in AI alignment?

Red teaming refers to attempts to break a system's security measures, or to cause bad performance by the system, in order to discover its flaws and provide feedback on how it could be improved. In AI safety, red teaming can be applied to concrete systems like LLMs, to find inputs that cause undesirable behavior, but it can also be applied to alignment strategies, to find ways in which the strategies break.

Redwood Research has produced research on red-teaming real systems using adversarial training. They trained a language model to produce fiction and then trained a second model (a “classifier”) to predict if a human would say that the text generated by the first model involved somebody being injured. They then used examples that were labeled as involving injury to retrain the original language model to avoid producing such output.

In addition to RLHF, a type of red teaming was also used (in addition to RLHF) to train GPT-4 to produce only “acceptable” responses.

Red teaming applied to alignment strategies was used by the Alignment Research Center (ARC) in their problem statement on Eliciting Latent Knowledge. In this approach, one person tries to come up with a way of solving the problem and another person tries to come up with an example that would break that way of solving the problem; then the first person alters their example to fix this problem. This process is repeated until either person gives up, which hopefully produces a robust solution to the problem or makes it clear that the approach can’t work.

ARC1 was also involved in red teaming GPT-4 before it was released. They found many concerning behaviors, such as when an early version of GPT-4 hired a crowd worker to solve a CAPTCHA for it. However, they ultimately did not think it was too dangerous to release, since it wouldn’t be able to self-replicate independently or become difficult to shut down. Still, they think that future systems would need careful red teaming before release, since current systems are potentially on the cusp of being dangerous.


  1. The team at ARC that used to do such evaluations is now named METR. ↩︎