Categories

Academia (6)Actors (5)Adversarial Training (7)Agency (6)Agent Foundations (20)AGI (19)AGI Fire Alarm (3)AI Boxing (2)AI Takeoff (8)AI Takeover (6)Alignment (5)Alignment Proposals (10)Alignment Targets (4)Anthropic (1)ARC (3)Autonomous Weapons (1)Awareness (6)Benefits (2)Brain-based AI (3)Brain-computer Interfaces (1)CAIS (2)Capabilities (20)Careers (14)Catastrophe (29)CHAI (1)CLR (1)Cognition (5)Cognitive Superpowers (9)Coherent Extrapolated Volition (2)Collaboration (6)Community (10)Comprehensive AI Services (1)Compute (9)Consciousness (5)Content (2)Contributing (29)Control Problem (7)Corrigibility (8)Deception (5)Deceptive Alignment (8)Decision Theory (5)DeepMind (4)Definitions (86)Difficulty of Alignment (8)Do What I Mean (2)ELK (3)Emotions (1)Ethics (7)Eutopia (5)Existential Risk (29)Failure Modes (13)FAR AI (1)Forecasting (7)Funding (10)Game Theory (1)Goal Misgeneralization (13)Goodhart's Law (3)Governance (25)Government (3)GPT (3)Hedonium (1)Human Level AI (5)Human Values (11)Inner Alignment (10)Instrumental Convergence (5)Intelligence (15)Intelligence Explosion (7)International (3)Interpretability (17)Inverse Reinforcement Learning (1)Language Models (13)Literature (4)Living document (2)Machine Learning (20)Maximizers (1)Mentorship (8)Mesa-optimization (6)MIRI (2)Misuse (4)Multipolar (4)Narrow AI (4)Objections (60)Open AI (2)Open Problem (4)Optimization (4)Organizations (15)Orthogonality Thesis (3)Other Concerns (8)Outcomes (5)Outer Alignment (14)Outreach (5)People (4)Philosophy (5)Pivotal Act (1)Plausibility (7)Power Seeking (5)Productivity (6)Prosaic Alignment (7)Quantilizers (2)Race Dynamics (6)Ray Kurzweil (1)Recursive Self-improvement (6)Regulation (3)Reinforcement Learning (13)Research Agendas (26)Research Assistants (1)Resources (19)Robots (7)S-risk (6)Sam Bowman (1)Scaling Laws (6)Selection Theorems (1)Singleton (3)Specification Gaming (10)Study (13)Superintelligence (34)Technological Unemployment (1)Technology (3)Timelines (14)Tool AI (2)Transformative AI (4)Transhumanism (2)Types of AI (2)Utility Functions (3)Value Learning (5)What About (9)Whole Brain Emulation (6)Why Not Just (15)

Interpretability

17 pages tagged "Interpretability"

How is the Alignment Research Center (ARC) trying to solve Eliciting Latent Knowledge (ELK)?

What is neural network modularity?

What is interpretability and what approaches are there?

What is John Wentworth's research agenda?

What is "externalized reasoning oversight"?

What is Conjecture's research agenda?

What is Anthropic's alignment research agenda?

How might interpretability be helpful?

How does "chain-of-thought" prompting work?

What is shard theory?

What is feature visualization?

What are polysemantic neurons?

What is Eliciting Latent Knowledge (ELK)?

What is the difference between verifiability, interpretability, transparency, and explainability?

Alignment research

What is a "polytope" in a neural network?

What is discovering latent knowledge (DLK)?

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.

Get involved

Join us on Discord

Partner projects

Alignment Ecosystem Development

© AISafety.info, 2022—2024