What is adversarial training?

Adversarial training is a safety technique that pits two (or more) models against each other. This enables the creators of models to explore and correct harmful inputs that could otherwise go unexplored due to limited training data or limited human oversight.

This technique is also called “red teaming”. This term is taken from IT security, where the “red team” is made up of “offensive” security experts who attack an organization’s cybersecurity defenses while the “blue team” responds to the red team’s attacks.

An early example of its use was in generative adversarial networks (GANs):

Another context in which adversarial training has been used is large language models (LLMs). As a part of the adversarial training process, multiple LLMs can be used as simulated adversaries in a red-team vs blue-team setup.

Yet another example of an adversarial training method is AI safety via debate, where multiple language models attempt to convince a human judge of the truthfulness of their arguments through a simulated debate process.

Organizations that have experimented with this technique includeRedwood Research and DeepMind.