Interpretability
17 pages tagged "Interpretability"
How is the Alignment Research Center (ARC) trying to solve Eliciting Latent Knowledge (ELK)?
What is neural network modularity?
What is interpretability and what approaches are there?
What is John Wentworth's research agenda?
What is "externalized reasoning oversight"?
What is Conjecture's research agenda?
What is Anthropic's alignment research agenda?
How might interpretability be helpful?
How does "chain-of-thought" prompting work?
What is shard theory?
What is feature visualization?
What are polysemantic neurons?
What is Eliciting Latent Knowledge (ELK)?
What is the difference between verifiability, interpretability, transparency, and explainability?
Alignment research
What is a "polytope" in a neural network?
What is discovering latent knowledge (DLK)?