I listened to the debate between Stephen Wolfram and Eliezer Yudkowsky on Machine Learning Street Talk (MLST).
I found the discussion frustrating, since it felt like they were trying to have two very different conversations: Wolfram questioning basic principles and trying to build the argument from the foundations, Yudkowsky taking AI risk as being mostly self-evident and defending particular aspects of his thesis.
Yudkowsky seems reluctant to provide a concise point-wise argument for AI risk, which leads to these kinds of strange debates where he defends a sequence of narrow points that feel mostly disconnected. From his body of work, I infer two general reasons why he does this:
- He has learned that different people find different parts of the argument obvious vs. confusing, true vs. false. So rather than reiterate the whole argument, he tries to identify the parts they take issue with, and deal with those. This might work for one-on-one discussions, but for public debates (where the actual audience is the broader set of listeners), this makes it feel like Yudkowsky doesn’t have a coherent end-to-end argument (though he definitely does).
- Yudkowsky’s style, in general, is not to just “give the answer,” but rather to lead the reader through a sequence of thoughts by which they should come to the right conclusion. In motivated pedagogy (where the reader is trying to learn), this is often the right way. “Giving the answer” won’t cause the person to learn the underlying pattern; the answer might feel too obvious and be quickly forgotten. Thus one instead tries to guide the person through the right thoughts. But to a resistant listener, this leaves the (incorrect) impression that the person’s arguments are vague.
Let me try to put together a step-wise argument for ASI risk. I think it goes something like:
- Humans are actively trying to make AIs smarter, more capable, and more agentic (including giving access/control to real-world systems like computers and robots and factories).
- There is no particular ceiling at human intelligence. It is possible in principle for an AI to be much smarter than a human, and indeed there are lots of easy-to-imagine ways that they would outstrip human abilities to predict/plan/make-decisions.
- AIs will, generically, “go hard”; meaning they will put maximal effort into achieving their goals.
- The effective goals of a powerful optimizer will tend to deviate strongly from the design goals. There are many reasons for this:
- It is hard to reliably engineer something as fuzzy (and, ultimately, inconsistent) as human values.
- Optimizers often have a mis-alignment between the intended goal and the realized inner optimization (inner/outer alignment problem, mesa-optimizers, etc.).
- The analogy to evolution is often offered: evolution is optimizing for replication of genes, yet enacted human values have only a little to do with that (wanting to have children, etc.); humans mostly care about non-genetic things (comfort, happiness, truth), and are often misaligned to genes (using contraception).
- Even goals perfectly-specified for a modest context (e.g. human-scale values) will generalize to a broader context (e.g. control the light-cone) in an ill-defined way. There is a one-to-many mapping from the small to the large context, and so there is no way to establish the dynamics to pick which exact goals are enacted in the extrapolated context.
- In the space of “all possible goals”, the vast majority are nonsense/meaningless. A small subspace of this total space is being selected by human design (making AIs that understand human data, and do human things like solve problems, design technology, make money, etc.). Even within this subspace, however, there is enormous heterogeneity to what the “effective goals” look like; and only a tiny fraction of those possible AI goals involve having flourishing humans (or other sentient minds).
- To be clear, humans will design AIs with the intention that their effective goals preserve human flourishing, but (c.f. #4) this is a difficult, ill-posed problem. The default outcome is an AI optimizing for something other than human flourishing.
- A powerful system pursuing goals that don’t explicitly require humans will, generally speaking, not be good for humans. For instance, a system trying to harness as much energy as possible for its computational goals will not worry about the fact that humans die as it converts all the matter in the solar system into solar cells and computer clusters.
- A superhuman (#2) system with real-world control (#1) pursuing (with maximum effort, #3) goals misaligned to human values (#4) will try to enact a future that does not include humans (#5). It will, generically, succeed in this effort, which will incidentally exterminate humans (#6).
- Moreover, this isn’t a case where one can just keep trying until one gets it right. The very first ASI could spell ruin, after which one does not get another change. It’s like trying to send a rocket to the moon, without being able to do test flights! (And where failure means extinction.)
This argument has many things left unspecified and undefended. The purpose is not to provide an airtight argument for ASI risk; but rather to enumerate the conceptual steps, so that one can focus a discussion down to the actual crux of disagreement.