What is superalignment?

2 min read

Suggest changes in Google Docs

The superalignment team was a division within OpenAI with the goal of figuring out how to "steer and control AI systems much smarter than us." It was led by Jan Leike and Ilya Sutskever until it was disbanded in May 2024, after Sutskever, Leike, and other safety researchers left the company.

The team's primary strategy was to develop tools to train and align AIs which are themselves powerful enough to help with alignment research. They planned to use these AIs to offload a greater and greater proportion of alignment research tasks from human researchers while still always keeping humans "in the loop." They expected other labs to develop AIs that can produce research on AI capabilities, and they wanted to make sure that when this happens, there is a framework available to repurpose these AIs towards alignment.

In order to ensure that any successful techniques that they discover are used by other labs, they intended to focus much of their attention on publicly testing their solutions.