What are some good books about AI safety?

Light introductions to AI safety

There are a number of excellent introductions to AI safety that assume no prior knowledge:

Uncontrollable (2023) by Darren McKee is the latest layman’s introduction to AI x-risk. The book covers how powerful AI systems might become, why superhuman AI might be dangerous, and what we can do to prevent AI from killing billions.

Human Compatible (2019) by Stuart Russell explains the problem of making powerful AI which are compatible with humans. The book discusses potential solutions with an emphasis on the approaches from Professor Russell’s lab, the Center for Human-Compatible AI. One such approach is cooperative inverse reinforcement learning, which places human-machine interaction at the heart of training AI to be safe. Don’t let the name scare you! The book keeps things simple and to the point, without overly simplifying the subject.

The Alignment Problem (2020) by Brian Christian is a comprehensive overview of challenges that come with aligning AI systems from the perspective of a machine learning researcher. If you have some interest in machine learning, this book is for you.

The AI Does Not Hate You (2019) by Tom Chivers is an entertaining and accessible outline of the core ideas around AI existential risk, along with an exploration of the community and culture of AI safety researchers.

Other accessible introductions include: Toby Ord’s The Precipice (2020), Max Tegmark’s Life 3.0 (2017), Yuval Noah Harari’s Homo Deus (2016), Stuart Armstrong’s Smarter Than Us (2014), and Luke Muehlhauser’s Facing the Intelligence Explosion (2013).

More involved reads

For those who want to get into the weeds of AI safety, here are books that may require a lot more time to read.

The book that first made the case to the public is Nick Bostrom’s Superintelligence (2014). It gives an excellent overview of the state of the field as it was in 2014 and makes a strong case for why AI safety is important. The arguments for AI posing an existential risk are, even now, quite influential — to the point that they could be viewed as the “classical” arguments for concern over AI. However, it was written before the dominance of deep learning, and as a result also doesn’t talk about newer developments such as large language models.

Rationality: From AI to Zombies (2015) is a compendium of essays written by Eliezer Yudkowsky, an early researcher on preventing AI x-risks. Book 3, the Machine in the Ghost, covers what an optimizer is, why powerful optimizers won’t care about fragile human values by default, and why we should expect AI to be powerful optimizers. Book 5, Mere Goodness, expands on the complexity of human values, and the difficulty of getting an AI to care about them. The content on AI is somewhat out of date, as it focuses on general AI designs, rather than specifically on deep learning. Yet much of the content is still valuable for anyone learning about AI safety.

Introduction to AI Safety, Ethics and Society (2024) is a textbook written by Dan Hendrycks, director of the Center for AI Safety. It approaches AI safety as a societal challenge, and covers the basics of modern AI, the technical challenges of AI safety, collective action problems, and the challenges of governing AI.

Novels

There are many works of fiction which illustrate AI misalignment. Mostly, these are short stories, some of which are quite detailed. There are also a few novels which place AI existential risk at their forefront.

The Crystal Trilogy (2019), written by AI safety researcher Max Harms, is set in 2039 and takes the perspective of a collective AI housed in a single body. The story focuses on the conflicts between the AI and humanity, and then amongst the AI themselves.

The Number (2022) takes the perspective of an AI whose sole goal is to make a number go up. Naturally, this involves taking over the world. The novel illustrates how competitive pressures can lead to the creation of a deceptively aligned AI.

A Fire Upon the Deep (1992), by Vernor Vinge, greatly influenced the pioneers of AI Safety. It depicts a galactic conflict between two superintelligences that is played out through biological proxies. Perhaps most influential is his depiction of superintelligences as something truly beyond human capacities to outwit or outmaneuver.1 Together with Vinge’s essays, this depiction of superintelligence, and the power and perils that come along with it, convinced some readers to start working on AI safety. This led to the birth of the field.


  1. Which is why “A Fire Upon the Deep” has a plot device which forces requiring superintelligences superhuman intelligences to stay on the edges of the galaxy. ↩︎