Take AISafety.info’s 3 minute survey to help inform our strategy and priorities

Take the survey
Alignment research

Current techniques
Benchmarks and evals
Prosaic alignment
Interpretability
Agent foundations
Other alignment approaches
Organizations and agendas
Researchers

What is the difference between verifiability, interpretability, transparency, and explainability?

Verifiability, interpretability

, transparency, and explainability all pertain to how an AI model might interface with humans - in particular, to how humans understand AI systems. These terms have imprecise and overlapping meanings, and often appear in similar contexts.

  • Verifiability refers to the ability to check whether a claim someone makes about a model is true.

    • Work on verifiability can include making it easier to check if the outputs of a system are correct, or promoting third party auditing of developers’ claims, working on more secure hardware, etc.

    • It can include work on formal verification, which aims to mathematically specify requirements and design AI systems where it can be proved that they will follow those requirements, but verifiability is not limited to formal proofs.

Verifiability relates more closely to what humans say to other humans, while “interpretability”, “transparency”, and “explainability” refer to humans understanding AI models. These three terms are closely linked — sometimes even used interchangeably.

  • Interpretability tends to be the most general term. It refers to 1) trying to understand how a system makes decisions, and 2) trying to design systems that are easy to understand. It can be divided into two aspects:

    • Black-box interpretability (also post-hoc interpretability) studies a model’s inputs and outputs to understand it, without looking inside it. It might be considered analogous to observing an animal’s behavior when it is exposed to various stimuli, to reason about what the animal might be thinking.

    • White-box interpretability (also inner interpretability and intrinsic interpretability) involves looking inside a model and trying to understand what things inside mean, or making models with an internal structure that is easy to understand. It might be considered analogous to MRI studies where someone is presented with certain stimuli and we observe what parts of their brain are activated.

  • Transparency is closely related to white-box interpretability as it refers to the ability to “look inside” a model, to understand how it works and why it produces a specific output. It is sometimes used interchangeably with interpretability, but it can also be seen as just a part of interpretability.

    • For example, Andreas et al. trained a neural network to be modular. This is considered work on transparency even under the narrower definition because it focuses on creating a model with a different internal structure, rather than adding additional elements to make the model more comprehensible while keeping the internal structure consistent.
  • Explainability (also Explainable AI) is generally used when researchers focus on how humans understand models rather than the features of models themselves. Those working on explainability aim to help people understand how models make decisions. It is sometimes used interchangeably with interpretability, but is often more closely associated with studying the inputs and outputs of models than transparency is. In other words, explainability is more closely associated with black-box interpretability while transparency is more closely associated with white-box interpretability. It can also refer to making AI easier to understand or interact with.

    • For example, visualizations such as this one can make it easier for people to understand what a model is doing. This kind of work is less closely associated with interpretability, because it focuses on communicating to the public or non-technical stakeholders, rather than researchers or engineers.

    • Gale et al. (2018) trained a model to produce sentences that explain why a classifier is detecting hip fractures, and to highlight what the model is paying attention to so that it is easier for doctors to understand what it is doing.

Keep Reading

Continue with the next entry in "Alignment research"
How might interpretability be helpful?
Next
Or jump to a related question


AISafety.info

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.