How does Redwood Research do adversarial training?
Redwood Research explains its approach to adversarial training in the paper “Adversarial Training for High-Stakes Reliability”. They took a language model (LM) that had been ‘fine-tuned
Fine-tuning is the process of adapting a pre-trained ML model for more specific tasks or to display more specific behaviors.
To do this, they trained a ‘classifier’, a model that predicts whether a human would say that the completion involved someone getting injured. This classifier can act as a filter for safe vs. unsafe stories after the stories have been generated by the LM. They then used stories classified as “unsafe” as additional training examples for the LM (an example of ‘adversarial training’). Another LM helped the humans paraphrase the existing unsafe stories in order to achieve data augmentation and have access to a higher number of adversarial training examples.
Redwood found that they can set very conservative classifier thresholds without significantly impacting the quality of the filtered outputs, i.e. the stories are still interesting and engaging to read. Additionally, adversarial training helped Redwood increase the robustness
An agent's ability to maintain its goal and its capabilities when exposed to environments that are substantially different from that on which the agent was trained.