What is scalable oversight?
Current techniques for aligning AI, such as reinforcement learning from human feedback, rely on humans’ ability to supervise AI. But humans won’t be able to reliably supervise AI systems much smarter than them (superintelligence). For example, using human feedback runs into problems when AI systems have key information the humans lack, or when the AI is deceptive.
The basic idea behind most approaches to scalable oversight is to use AI systems to assist supervision of other AI systems. Concrete approaches to scalable oversight include:
Progress in scalable oversight is central to the research agenda of many leading AI companies, for example OpenAI’s superalignment.
Problems with research on scalable oversight methods include that the methods can be hard to evaluate since we do not yet have AI with abilities that generally exceed ours. Scalable oversight research has also been criticized to be primarily about improving the capability of AI, rather than making AI safer.