What is the NYU Alignment Research Group's research agenda?

2 min read

The NYU Alignment Research Group (ARG) is an academic research group, led by Sam Bowman, that is "doing empirical work with language models that aims to address longer-term concerns about the impacts of deploying highly-capable AI systems."

Bowman describes their research agenda as focused on:

• Concrete alignment strategies inspired by debate, amplification, and recursive reward modeling that attempt to use an AI system’s capabilities—even if they’re initially unreliable or unaligned—to allow us to bootstrap human oversight on difficult problems: These seem to me to be some of the best-vetted strategies for alignment, and while they aren’t a complete solution, they could plausibly be a large part of one if they worked. Despite this, little empirical work has been done so far to test their feasibility.

• Sandwiching-style experimental protocols, where, roughly, we look for ways to pose artificial alignment challenges in which researchers need to reliably solve artificial tasks using unreliable AI/ML tools that have some knowledge or skill that’s necessary for the task.

• Alignment-relevant properties of generalization in large language models: For example, when will a model that is aggressively fine-tuned to be truthful and calibrated on a simple class of questions going to be truthful and calibrated on more difficult questions?

• Chain-of-thought-style reasoning in large language models: How far can a plain language model’s stated reasoning diverge from its actual behavior on tasks where it reliably succeeds? Are there fine-tuning strategies that meaningfully constrain this divergence?

• Looking for additional ways to better understand (and communicate) how hard this problem is likely to be.

What is AI Safety via Debate?

What is Iterated Distillation and Amplification (IDA)?

What is recursive reward modeling?