What is Eliciting Latent Knowledge (ELK)?

Eliciting Latent Knowledge (ELK) is an open problem in AI safety. It asks how we can access information that an AI model "knows", but which its output doesn't include or make clear by default.

The problem statement from the original report by the Alignment Research Center (ARC):

“Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.

But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.

In these cases, the prediction model "knows" facts (like "the camera was tampered with") that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?”

The hope is that a successful method for ELK would make it easier to tell if an AI is safe, or in what ways it might be unsafe.

ARC's original report includes a description of ARC's first-pass research strategy for solving ELK using a form of red teaming. The report goes on to describe a number of approaches to ELK that ARC considered and difficulties with each of them.

A summary of both ARC's original ELK report and the results of the 2022 ELK proposal contest is available here.