What is Redwood Research's agenda?

3 min read

Redwood Research focuses on identifying promising methods for interpreting and aligning current AI systems, and considering theoretical arguments about whether those methods will continue to work even as systems become much more intelligent. Its research includes:

Causal scrubbing, a method for evaluating interpretability hypotheses about an AI system. Briefly, imagine we have some guess about which parts of a model do the calculations needed for the model to exhibit a particular behavior — e.g., for an image classifier, "I think these neurons here are for identifying dogs". Using some clever math, we can look through the model, identify which parts of the neural network "shouldn't matter" (if our guess is correct) for that specific behavior, and "scrub" the model by replacing the activation values that "shouldn't matter" with values corresponding to a random input sample. Then, we can check whether the scrubbed model still exhibits the behavior we expected for the relevant input (i.e., correctly labeling a dog). If our hypothesis was a good one, the behavior of the scrubbed model should match the unscrubbed one.
"Adversarial training for high-stakes reliability," which uses adversarial training to try to increase a model's reliability — essentially, giving the model inputs specifically chosen to cause it to produce undesirable outputs, then training the model on those outputs as examples of what to not do in the future. In particular, Redwood aimed to produce models reliable enough that they could be used for "high stakes" tasks where a single failure would be catastrophic. To test this idea, Redwood tried to use adversarial training to fine-tune a language model to never complete a prompt in a way that involved "injury" (i.e., describing or implying someone getting hurt).
"Benchmarks for Detecting Measurement Tampering," which investigates ways to determine whether a measurement has been tampered with.
Doing interpretability work on large language models. For instance, see this paper about how Redwood identified the circuit that GPT-2 uses to figure out the proper indirect object of a sentence.

Redwood also maintains the Rust Circuit Library, intended to help with neural network interpretability research.

How does Redwood Research do adversarial training?

What work is Redwood doing on LLM interpretability?

What is interpretability and what approaches are there?