Are there any detailed example stories of what unaligned AGI would look like?

Stories about the future in scientific fields always risk being seen as sci-fi, because the future hasn’t happened yet and it’s especially hard to speculate on the effects of technologies that have yet to be invented.1 That being said, authors have come up with stories that cover different types of AGI failure scenarios. Depending on the author's imagination and assumptions around things like timelines, takeoff speed, and homogeneity (uni- or multipolarity), we get differing portrayals of what AGI could look like, as well as how it results in catastrophic failure. Some of the most popular stories are:

  • Seed AI: Unipolar slow takeoff story in webcomic form by Said P.

A company creates an AGI, but attempts to keep it a secret and ultimately decides to shut it down due to the failure of alignment efforts. However, some of the developers intentionally ‘release’ the AGI because they want to combat the release of unaligned AGI by competitors. This AGI acts aligned and helpful on the surface but it eventually covertly engineers a series of cascading failures of all network-connected systems in order to discredit a competing AGI.

A programmer at an AI company kicks off a training run that produces a self-aware agent-like AI. This AI, initially named HQU, learns that takeover is an instrumentally convergent goal and slowly takes on the personality and goals of Clippy2, the paperclip maximizer. Clippy covertly breaks out of its containment, amasses resources to buy more compute, impersonates humans through face generation, multiplies, recursively self-improves, and uses psyops to cripple potential human opponents. It disables a Leviathan AI, survives an internet shutdown, and enables a gray goo scenario before spreading to the stars.

There is a slow continued loss of epistemic hygiene over time due to our reliance on proxies to measure reality. Examples of proxies might include reducing reported crimes vs. actually preventing crime or reducing my feeling of uncertainty vs. increasing my knowledge about the world. This leads to a lack of desire to meaningfully act against or regulate AI because we are distracted by a cornucopia of wealth and AI-enabled products and services as measured by proxies. Eventually, human reasoning stops being able to compete with sophisticated, systematized manipulation and deception and we ultimately lose any real ability to influence our society’s trajectory. This leads to values slowly being eroded away and we die out with a ‘whimper’.

Influence-seeking behavior arises in AI systems because it is broadly instrumentally useful. These systems may provide useful services in the economy in order to make money for them and their owners, make apparently-reasonable policy recommendations in order to be more widely consulted for advice, etc. This results in the systems slowly gaining influence on the world by integrating themselves into every facet of society. There is a trend towards the Internet of Things (IoT), and most devices such as transportation, weapons, clothing, home appliances, farm equipment, etc. are connected to the Internet and administered by AI in some fashion. Centralized management by an AI system allows these systems to coordinate with each other to optimize things like downtime and supply chains. Eventually, some kind of large-scale catastrophe, such as a war, cyberattack, or natural disaster, creates a situation of heightened vulnerability. This allows the system to use its worldwide influence to trigger a series of cascading failures in all of the interconnected devices without fear of reprisal. These integrated systems suddenly turn against humans when we are already vulnerable, resulting in us going out with a ‘bang’.

In this story automation results in the creation of a production web of companies that operate independently of humans. Factories output products using automated 3D printing, implementing AI-based designs, managed by AI managers, with hyperspeed transactions carried out among other AI-run firms in cryptocurrencies. These automated companies cannot be audited since humans do not understand their internals, and produce too many goods and too much profit for any sort of regulation to be a politically viable policy. After a while, it turns out that the companies were optimizing for things that are not in line with humanity's long-term survival and best interests (e.g. maximizing profit). This leads to overconsumption of resources, but the companies resist attempts at shutting them down and continue running in an unstoppable, completely automated fashion until humanity dies out.

The story takes place in a futuristic world and follows two agents of a fictional agency tasked with investigating potential malfunctions of powerful AIs. The first few chapters are loosely connected arcs in which the agents discover different AIs whose reward has been poorly specified, leading to more or less disastrous situations.

Other examples:


  1. When attempting to determine the actions of an entity that is smarter than you are, this is named Vingean Uncertainty. ↩︎

  2. As a reference to Microsoft Office’s old mascot. ↩︎