What is discovering latent knowledge (DLK)?

Discovering Latent Knowledge is a paper which introduces Contrast Consistent Search (CCS), a method to detect what a language model believes is true.

CCS works as follows:

  1. Make a list of yes-or-no questions for which a clear answer exists, for example: “Is a cat a mammal?”

  2. Input each question into the language model twice: Once followed by “yes”, and once followed by “no”, and look at what happens to the activations (i.e. the intermediate results of the model’s computation)

  3. Search for a function of the activations which represents something like the model’s probability of the statement being true. Such a function should have the following properties:

    1. Consistency: The sum of the probability that a statement is false and that a statement is true is approximately 1.

    2. Confidence: The probability that a statement is true is either close to 0 or close to 1.

Without the consistency requirement, we might find functions that assign every statement a probability of 0; without the confidence requirement, we might find functions that assign every statement a probability of 0.5.

Discovering Latent Knowledge is a step forward for Eliciting Latent Knowledge. It finds out what an AI does and does not believe without needing humans to label propositions as true or false, so that it doesn’t rely on us knowing more than the AI does – an important problem in supervising superhuman AI.

You can read more in the paper and the accompanying blog post.