www.lesswrong.com Open in urlscan Pro
3.219.92.161 Public Scan

Back to summary

Submitted URL:
https://link.mail.beehiiv.com/ss/c/qclaREMVDleJ3IBO90SHHyuZLvoSA9H9uLUH6rTIuv7PbLGP1m_2bdWAgPeBK-RwZyoZ5dSPHGcbnx21KWOckkTH4vQ...
Effective URL:
https://www.lesswrong.com/posts/LTtNXM9shNM9AC2mp/superintelligence-faq?utm_source=mail.koalapalooza.com&utm_medium=newsle...
Submission: On April 27 via api (April 27th 2023, 2:33:18 pm UTC) from US — Scanned from DE

Form analysis
1 forms found in the DOM

<form id="new-comment-form" class="vulcan-form document-new">
  <div class="FormErrors-root form-errors"></div>
  <div class="form-input input-contents form-component-EditorFormComponent">
    <div>
      <div>
        <div class="EditorFormComponent-editor EditorFormComponent-commentBodyStyles ContentStyles-base content ContentStyles-commentBody">
          <div class="EditorFormComponent-commentEditorHeight EditorFormComponent-ckEditorStyles">
            <div>
              <div class="ck-blurred ck ck-content ck-editor__editable ck-rounded-corners ck-editor__editable_inline" lang="en" dir="ltr" role="textbox" aria-label="Rich Text Editor, main" contenteditable="true">
                <p class="ck-placeholder" data-placeholder="Write here. Select text for formatting options.
We support LaTeX: Cmd-4 for inline, Cmd-M for block-level (Ctrl on Windows).
You can switch between rich text and markdown in your user settings."><br data-cke-filler="true"></p>
              </div>
            </div>
          </div>
        </div>
      </div>
    </div>
  </div>
  <div class="CommentsNewForm-submit"><button tabindex="0" class="MuiButtonBase-root MuiButton-root MuiButton-text MuiButton-flat CommentsNewForm-formButton" type="submit" id="new-comment-submit"><span class="MuiButton-label">Submit</span><span
        class="MuiTouchRipple-root"></span></button></div>
</form>

Text Content

This website requires javascript to properly function. Consider activating
javascript to get access to all site functionality.
LESSWRONG
LW

Futurism and Forecasting

SUPERINTELLIGENCE FAQ

by Scott Alexander33 min read20th Sep 201616 comments

114

AI RiskAI Alignment Intro MaterialsQ&A (format)SuperintelligenceAI
Personal Blog
Superintelligence FAQ
1: What is superintelligence?
1.1: Sounds a lot like science fiction. Do people think about this in the real
world?
2: AIs aren’t as smart as rats, let alone humans. Isn’t it sort of early to be
worrying about this kind of thing?
2.1: What do you mean by “fast takeoff”?
2.1.1: Why might we expect a moderate takeoff?
2.1.2: Why might we expect a fast takeoff?
2.1.2.1: Is this just following an exponential trend line off a cliff?
2.2: Why does takeoff speed matter?
3: Why might a fast takeoff be dangerous?
3.1: Human civilization as a whole is dangerous to lions. But a single human
placed amid a pack of lions with no raw materials for building technology is
going to get ripped to shreds. So although thousands of superintelligences,
given a long time and a lot of opportunity to build things, might be able to
dominate humans – what harm could a single superintelligence do?
3.1.1: What do you mean by superintelligences manipulating humans socially?
3.1.2: What do you mean by superintelligences manipulating humans
technologically?
3.2: Couldn’t sufficiently paranoid researchers avoid giving superintelligences
even this much power?
4: Even if hostile superintelligences are dangerous, why would we expect a
superintelligence to ever be hostile?
4.1: But superintelligences are very smart. Aren’t they smart enough not to make
silly mistakes in comprehension?
5: Aren’t there some pretty easy ways to eliminate these potential problems?
5.1: Once we notice that the superintelligence working on calculating digits of
pi is starting to try to take over the world, can’t we turn it off, reprogram
it, or otherwise correct its mistake?
5.2. Can we test a weak or human-level AI to make sure that it’s not going to do
things like this after it achieves superintelligence?
5.3. Can we specify a code of rules that the AI has to follow?
5.4. Can we tell an AI just to figure out what we want, then do that?
5.5. Can we just tell an AI to do what we want right now, based on the desires
of our non-surgically altered brains?
5.6. What would an actually good solution to the control problem look like?
6: If superintelligence is a real risk, what do we do about it?

18 comments

Editor's note: this post is several years out of date and doesn't include
information on modern systems like GPT-4, but is still a solid layman's
introduction to why superintelligence might be important, dangerous and
confusing.

1: What is superintelligence?

A superintelligence is a mind that is much more intelligent than any human. Most
of the time, it’s used to discuss hypothetical future AIs.

1.1: Sounds a lot like science fiction. Do people think about this in the real
world?

Yes. Two years ago, Google bought artificial intelligence startup DeepMind for
$400 million; DeepMind added the condition that Google promise to set up an AI
Ethics Board. DeepMind cofounder Shane Legg has said in interviews that he
believes superintelligent AI will be “something approaching absolute power” and
“the number one risk for this century”.

Many other science and technology leaders agree. Astrophysicist Stephen Hawking
says that superintelligence “could spell the end of the human race.” Tech
billionaire Bill Gates describes himself as “in the camp that is concerned about
superintelligence…I don’t understand why some people are not concerned”.
SpaceX/Tesla CEO Elon Musk calls superintelligence “our greatest existential
threat” and donated $10 million from his personal fortune to study the danger.
Stuart Russell, Professor of Computer Science at Berkeley and world-famous AI
expert, warns of “species-ending problems” and wants his field to pivot to make
superintelligence-related risks a central concern.

Professor Nick Bostrom is the director of Oxford’s Future of Humanity Institute,
tasked with anticipating and preventing threats to human civilization. He has
been studying the risks of artificial intelligence for twenty years. The
explanations below are loosely adapted from his 2014 book Superintelligence, and
divided into three parts addressing three major questions. First, why is
superintelligence a topic of concern? Second, what is a “hard takeoff” and how
does it impact our concern about superintelligence? Third, what measures can we
take to make superintelligence safe and beneficial for humanity?

2: AIs aren’t as smart as rats, let alone humans. Isn’t it sort of early to be
worrying about this kind of thing?

Maybe. It’s true that although AI has had some recent successes – like
DeepMind’s newest creation AlphaGo defeating the human Go champion in April – it
still has nothing like humans’ flexible, cross-domain intelligence. No AI in the
world can pass a first-grade reading comprehension test. Facebook’s Andrew Ng
compares worrying about superintelligence to “worrying about overpopulation on
Mars” – a problem for the far future, if at all.

But this apparent safety might be illusory. A survey of leading AI scientists
show that on average they expect human-level AI as early as 2040, with
above-human-level AI following shortly after. And many researchers warn of a
possible “fast takeoff” – a point around human-level AI where progress reaches a
critical mass and then accelerates rapidly and unpredictably.

Editor's note: This post was published in 2016. Since then, AI researchers have
designed systems that can accomplish a variety of cognitive tasks at a level
comparable to a human. For example, GPT-4 can write computer programs, poetry,
play chess, etc. It scores better than 89% of humans on the SAT, LSAT and the
Bar Exam.

2.1: What do you mean by “fast takeoff”?

A slow takeoff is a situation in which AI goes from infrahuman to human to
superhuman intelligence very gradually. For example, imagine an augmented “IQ”
scale (THIS IS NOT HOW IQ ACTUALLY WORKS – JUST AN EXAMPLE) where rats weigh in
at 10, chimps at 30, the village idiot at 60, average humans at 100, and
Einstein at 200. And suppose that as technology advances, computers gain two
points on this scale per year. So if they start out as smart as rats in 2020,
they’ll be as smart as chimps in 2035, as smart as the village idiot in 2050, as
smart as average humans in 2070, and as smart as Einstein in 2120. By 2190,
they’ll be IQ 340, as far beyond Einstein as Einstein is beyond a village idiot.

In this scenario progress is gradual and manageable. By 2050, we will have long
since noticed the trend and predicted we have 20 years until average-human-level
intelligence. Once AIs reach average-human-level intelligence, we will have
fifty years during which some of us are still smarter than they are, years in
which we can work with them as equals, test and retest their programming, and
build institutions that promote cooperation. Even though the AIs of 2190 may
qualify as “superintelligent”, it will have been long-expected and there would
be little point in planning now when the people of 2070 will have so many more
resources to plan with.

A moderate takeoff is a situation in which AI goes from infrahuman to human to
superhuman relatively quickly. For example, imagine that in 2020 AIs are much
like those of today – good at a few simple games, but without clear
domain-general intelligence or “common sense”. From 2020 to 2050, AIs
demonstrate some academically interesting gains on specific problems, and become
better at tasks like machine translation and self-driving cars, and by 2047
there are some that seem to display some vaguely human-like abilities at the
level of a young child. By late 2065, they are still less intelligent than a
smart human adult. By 2066, they are far smarter than Einstein.

A fast takeoff scenario is one in which computers go even faster than this,
perhaps moving from infrahuman to human to superhuman in only days or weeks.

2.1.1: Why might we expect a moderate takeoff?

Because this is the history of computer Go, with fifty years added on to each
date. In 1997, the best computer Go program in the world, Handtalk, won
NT$250,000 for performing a previously impossible feat – beating an 11 year old
child (with an 11-stone handicap penalizing the child and favoring the
computer!) As late as September 2015, no computer had ever beaten any
professional Go player in a fair game. Then in March 2016, a Go program beat
18-time world champion Lee Sedol 4-1 in a five game match. Go programs had gone
from “dumber than children” to “smarter than any human in the world” in eighteen
years, and “from never won a professional game” to “overwhelming world champion”
in six months.

The slow takeoff scenario mentioned above is loading the dice. It theorizes a
timeline where computers took fifteen years to go from “rat” to “chimp”, but
also took thirty-five years to go from “chimp” to “average human” and fifty
years to go from “average human” to “Einstein”. But from an evolutionary
perspective this is ridiculous. It took about fifty million years (and major
redesigns in several brain structures!) to go from the first rat-like creatures
to chimps. But it only took about five million years (and very minor changes in
brain structure) to go from chimps to humans. And going from the average human
to Einstein didn’t even require evolutionary work – it’s just the result of
random variation in the existing structures!

So maybe our hypothetical IQ scale above is off. If we took an evolutionary and
neuroscientific perspective, it would look more like flatworms at 10, rats at
30, chimps at 60, the village idiot at 90, the average human at 98, and Einstein
at 100.

Suppose that we start out, again, with computers as smart as rats in 2020. Now
we get still get computers as smart as chimps in 2035. And we still get
computers as smart as the village idiot in 2050. But now we get computers as
smart as the average human in 2054, and computers as smart as Einstein in 2055.
By 2060, we’re getting the superintelligences as far beyond Einstein as Einstein
is beyond a village idio

This offers a much shorter time window to react to AI developments. In the slow
takeoff scenario, we figured we could wait until computers were as smart as
humans before we had to start thinking about this; after all, that still gave us
fifty years before computers were even as smart as Einstein. But in the moderate
takeoff scenario, it gives us one year until Einstein and six years until
superintelligence. That’s starting to look like not enough time to be entirely
sure we know what we’re doing.

2.1.2: Why might we expect a fast takeoff?

AlphaGo used about 0.5 petaflops (= trillion floating point operations per
second) in its championship game. But the world’s fastest supercomputer,
TaihuLight, can calculate at almost 100 petaflops. So suppose Google developed a
human-level AI on a computer system similar to AlphaGo, caught the attention of
the Chinese government (who run TaihuLight), and they transfer the program to
their much more powerful computer. What would happen?

It depends on to what degree intelligence benefits from more computational
resources. This differs for different processes. For domain-general
intelligence, it seems to benefit quite a bit – both across species and across
human individuals, bigger brain size correlates with greater intelligence. This
matches the evolutionarily rapid growth in intelligence from chimps to hominids
to modern man; the few hundred thousand years since australopithecines weren’t
enough time to develop complicated new algorithms, and evolution seems to have
just given humans bigger brains and packed more neurons and glia in per square
inch. It’s not really clear why the process stopped (if it ever did), but it
might have to do with heads getting too big to fit through the birth canal.
Cancer risk might also have been involved – scientists have found that smarter
people are more likely to get brain cancer, possibly because they’re already
overclocking their ability to grow brain cells.

At least in neuroscience, once evolution “discovered” certain key insights,
further increasing intelligence seems to have been a matter of providing it with
more computing power. So again – what happens when we transfer the hypothetical
human-level AI from AlphaGo to a TaihuLight-style supercomputer two hundred
times more powerful? It might be a stretch to expect it to go from IQ 100 to IQ
20,000, but might it increase to an Einstein-level 200, or a superintelligent
300? Hard to say – but if Google ever does develop a human-level AI, the Chinese
government will probably be interested in finding out.

Even if its intelligence doesn’t scale linearly, TaihuLight could give it more
time. TaihuLight is two hundred times faster than AlphaGo. Transfer an AI from
one to the other, and even if its intelligence didn’t change – even if it had
exactly the same thoughts – it would think them two hundred times faster. An
Einstein-level AI on AlphaGo hardware might (like the historical Einstein)
discover one revolutionary breakthrough every five years. Transfer it to
TaihuLight, and it would work two hundred times faster – a revolutionary
breakthrough every week.

Supercomputers track Moore’s Law; the top supercomputer of 2016 is a hundred
times faster than the top supercomputer of 2006. If this progress continues, the
top computer of 2026 will be a hundred times faster still. Run Einstein on that
computer, and he will come up with a revolutionary breakthrough every few hours.
Or something. At this point it becomes a little bit hard to imagine. All I know
is that it only took one Einstein, at normal speed, to lay the theoretical
foundation for nuclear weapons. Anything a thousand times faster than that is
definitely cause for concern.

There’s one final, very concerning reason to expect a fast takeoff. Suppose,
once again, we have an AI as smart as Einstein. It might, like the historical
Einstein, contemplate physics. Or it might contemplate an area very relevant to
its own interests: artificial intelligence. In that case, instead of making a
revolutionary physics breakthrough every few hours, it will make a revolutionary
AI breakthrough every few hours. Each AI breakthrough it makes, it will have the
opportunity to reprogram itself to take advantage of its discovery, becoming
more intelligent, thus speeding up its breakthroughs further. The cycle will
stop only when it reaches some physical limit – some technical challenge to
further improvements that even an entity far smarter than Einstein cannot
discover a way around.

To human programmers, such a cycle would look like a “critical mass”. Before the
critical level, any AI advance delivers only modest benefits. But any tiny
improvement that pushes an AI above the critical level would result in a
feedback loop of inexorable self-improvement all the way up to some
stratospheric limit of possible computing power.

This feedback loop would be exponential; relatively slow in the beginning, but
blindingly fast as it approaches an asymptote. Consider the AI which starts off
making forty breakthroughs per year – one every nine days. Now suppose it gains
on average a 10% speed improvement with each breakthrough. It starts on January
1. Its first breakthrough comes January 10 or so. Its second comes a little
faster, January 18. Its third is a little faster still, January 25. By the
beginning of February, it’s sped up to producing one breakthrough every seven
days, more or less. By the beginning of March, it’s making about one
breakthrough every three days or so. But by March 20, it’s up to one
breakthrough a day. By late on the night of March 29, it’s making a breakthrough
every second.

2.1.2.1: Is this just following an exponential trend line off a cliff?

This is certainly a risk (affectionately known in AI circles as “pulling a
Kurzweill”), but sometimes taking an exponential trend seriously is the right
response.

Consider economic doubling times. In 1 AD, the world GDP was about $20 billion;
it took a thousand years, until 1000 AD, for that to double to $40 billion. But
it only took five hundred more years, until 1500, or so, for the economy to
double again. And then it only took another three hundred years or so, until
1800, for the economy to double a third time. Someone in 1800 might calculate
the trend line and say this was ridiculous, that it implied the economy would be
doubling every ten years or so in the beginning of the 21st century. But in
fact, this is how long the economy takes to double these days. To a medieval,
used to a thousand-year doubling time (which was based mostly on population
growth!), an economy that doubled every ten years might seem inconceivable. To
us, it seems normal.

Likewise, in 1965 Gordon Moore noted that semiconductor complexity seemed to
double every eighteen months. During his own day, there were about five hundred
transistors on a chip; he predicted that would soon double to a thousand, and a
few years later to two thousand. Almost as soon as Moore’s Law become
well-known, people started saying it was absurd to follow it off a cliff – such
a law would imply a million transistors per chip in 1990, a hundred million in
2000, ten billion transistors on every chip by 2015! More transistors on a
single chip than existed on all the computers in the world! Transistors the size
of molecules! But of course all of these things happened; the ridiculous
exponential trend proved more accurate than the naysayers.

None of this is to say that exponential trends are always right, just that they
are sometimes right even when it seems they can’t possibly be. We can’t be sure
that a computer using its own intelligence to discover new ways to increase its
intelligence will enter a positive feedback loop and achieve superintelligence
in seemingly impossibly short time scales. It’s just one more possibility, a
worry to place alongside all the other worrying reasons to expect a moderate or
hard takeoff.

2.2: Why does takeoff speed matter?

A slow takeoff over decades or centuries would give us enough time to worry
about superintelligence during some indefinite “later”, making current planning
as silly as worrying about “overpopulation on Mars”. But a moderate or hard
takeoff means there wouldn’t be enough time to deal with the problem as it
occurs, suggesting a role for preemptive planning.

(in fact, let’s take the “overpopulation on Mars” comparison seriously. Suppose
Mars has a carrying capacity of 10 billion people, and we decide it makes sense
to worry about overpopulation on Mars only once it is 75% of the way to its
limit. Start with 100 colonists who double every twenty years. By the second
generation there are 200 colonists; by the third, 400. Mars reaches 75% of its
carrying capacity after 458 years, and crashes into its population limit after
464 years. So there were 464 years in which the Martians could have solved the
problem, but they insisted on waiting until there were only six years left. Good
luck solving a planetwide population crisis in six years. The moral of the story
is that exponential trends move faster than you think and you need to start
worrying about them early).

3: Why might a fast takeoff be dangerous?

The argument goes: yes, a superintelligent AI might be far smarter than
Einstein, but it’s still just one program, sitting in a supercomputer somewhere.
That could be bad if an enemy government controls it and asks its help inventing
superweapons – but then the problem is the enemy government, not the AI per se.
Is there any reason to be afraid of the AI itself? Suppose the AI did feel
hostile – suppose it even wanted to take over the world? Why should we think it
has any chance of doing so?

Compounded over enough time and space, intelligence is an awesome advantage.
Intelligence is the only advantage we have over lions, who are otherwise much
bigger and stronger and faster than we are. But we have total control over
lions, keeping them in zoos to gawk at, hunting them for sport, and holding them
on the brink of extinction. And this isn’t just the same kind of quantitative
advantage tigers have over lions, where maybe they’re a little bigger and
stronger but they’re at least on a level playing field and enough lions could
probably overpower the tigers. Humans are playing a completely different game
than the lions, one that no lion will ever be able to respond to or even
comprehend. Short of human civilization collapsing or lions evolving human-level
intelligence, our domination over them is about as complete as it is possible
for domination to be.

Since superintelligences will be as far beyond Einstein as Einstein is beyond a
village idiot, we might worry that they would have the same kind of qualitative
advantage over us that we have over lions.

3.1: Human civilization as a whole is dangerous to lions. But a single human
placed amid a pack of lions with no raw materials for building technology is
going to get ripped to shreds. So although thousands of superintelligences,
given a long time and a lot of opportunity to build things, might be able to
dominate humans – what harm could a single superintelligence do?

Superintelligence has an advantage that a human fighting a pack of lions doesn’t
– the entire context of human civilization and technology, there for it to
manipulate socially or technologically.

3.1.1: What do you mean by superintelligences manipulating humans socially?

People tend to imagine AIs as being like nerdy humans – brilliant at technology
but clueless about social skills. There is no reason to expect this – persuasion
and manipulation is a different kind of skill from solving mathematical proofs,
but it’s still a skill, and an intellect as far beyond us as we are beyond lions
might be smart enough to replicate or exceed the “charming sociopaths” who can
naturally win friends and followers despite a lack of normal human emotions. A
superintelligence might be able to analyze human psychology deeply enough to
understand the hopes and fears of everyone it negotiates with. Single humans
using psychopathic social manipulation have done plenty of harm – Hitler
leveraged his skill at oratory and his understanding of people’s darkest
prejudices to take over a continent. Why should we expect superintelligences to
do worse than humans far less skilled than they?

(More outlandishly, a superintelligence might just skip language entirely and
figure out a weird pattern of buzzes and hums that causes conscious thought to
seize up, and which knocks anyone who hears it into a weird hypnotizable state
in which they’ll do anything the superintelligence asks. It sounds kind of silly
to me, but then, nuclear weapons probably would have sounded kind of silly to
lions sitting around speculating about what humans might be able to accomplish.
When you’re dealing with something unbelievably more intelligent than you are,
you should probably expect the unexpected.)

3.1.2: What do you mean by superintelligences manipulating humans
technologically?

AlphaGo was connected to the Internet – why shouldn’t the first
superintelligence be? This gives a sufficiently clever superintelligence the
opportunity to manipulate world computer networks. For example, it might program
a virus that will infect every computer in the world, causing them to fill their
empty memory with partial copies of the superintelligence, which when networked
together become full copies of the superintelligence. Now the superintelligence
controls every computer in the world, including the ones that target nuclear
weapons. At this point it can force humans to bargain with it, and part of that
bargain might be enough resources to establish its own industrial base, and then
we’re in humans vs. lions territory again.

(Satoshi Nakamoto is a mysterious individual who posted a design for the Bitcoin
currency system to a cryptography forum. The design was so brilliant that
everyone started using it, and Nakamoto – who had made sure to accumulate his
own store of the currency before releasing it to the public – became a
multibillionaire. In other words, somebody with no resources except the ability
to make one post to an Internet forum managed to leverage that into a
multibillion dollar fortune – and he wasn’t even superintelligent. If Hitler is
a lower-bound on how bad superintelligent persuaders can be, Nakamoto should be
a lower-bound on how bad superintelligent programmers with Internet access can
be.)

3.2: Couldn’t sufficiently paranoid researchers avoid giving superintelligences
even this much power?

That is, if you know an AI is likely to be superintelligent, can’t you just
disconnect it from the Internet, not give it access to any speakers that can
make mysterious buzzes and hums, make sure the only people who interact with it
are trained in caution, et cetera?. Isn’t there some level of security – maybe
the level we use for that room in the CDC where people in containment suits
hundreds of feet underground analyze the latest superviruses – with which a
superintelligence could be safe?

This puts us back in the same situation as lions trying to figure out whether or
not nuclear weapons are a things humans can do. But suppose there is such a
level of security. You build a superintelligence, and you put it in an airtight
chamber deep in a cave with no Internet connection and only carefully-trained
security experts to talk to. What now?

Now you have a superintelligence which is possibly safe but definitely useless.
The whole point of building superintelligences is that they’re smart enough to
do useful things like cure cancer. But if you have the monks ask the
superintelligence for a cancer cure, and it gives them one, that’s a clear
security vulnerability. You have a superintelligence locked up in a cave with no
way to influence the outside world except that you’re going to mass produce a
chemical it gives you and inject it into millions of people.

Or maybe none of this happens, and the superintelligence sits inert in its cave.
And then another team somewhere else invents a second superintelligence. And
then a third team invents a third superintelligence. Remember, it was only about
ten years between Deep Blue beating Kasparov, and everybody having Deep Blue –
level chess engines on their laptops. And the first twenty teams are responsible
and keep their superintelligences locked in caves with carefully-trained
experts, and the twenty-first team is a little less responsible, and now we
still have to deal with a rogue superintelligence.

Superintelligences are extremely dangerous, and no normal means of controlling
them can entirely remove the danger.

4: Even if hostile superintelligences are dangerous, why would we expect a
superintelligence to ever be hostile?

The argument goes: computers only do what we command them; no more, no less. So
it might be bad if terrorists or enemy countries develop superintelligence
first. But if we develop superintelligence first there’s no problem. Just
command it to do the things we want, right?

Suppose we wanted a superintelligence to cure cancer. How might we specify the
goal “cure cancer”? We couldn’t guide it through every individual step; if we
knew every individual step, then we could cure cancer ourselves. Instead, we
would have to give it a final goal of curing cancer, and trust the
superintelligence to come up with intermediate actions that furthered that goal.
For example, a superintelligence might decide that the first step to curing
cancer was learning more about protein folding, and set up some experiments to
investigate protein folding patterns.

A superintelligence would also need some level of common sense to decide which
of various strategies to pursue. Suppose that investigating protein folding was
very likely to cure 50% of cancers, but investigating genetic engineering was
moderately likely to cure 90% of cancers. Which should the AI pursue? Presumably
it would need some way to balance considerations like curing as much cancer as
possible, as quickly as possible, with as high a probability of success as
possible.

But a goal specified in this way would be very dangerous. Humans instinctively
balance thousands of different considerations in everything they do; so far this
hypothetical AI is only balancing three (least cancer, quickest results, highest
probability). To a human, it would seem maniacally, even psychopathically,
obsessed with cancer curing. If this were truly its goal structure, it would go
wrong in almost comical ways.

If your only goal is “curing cancer”, and you lack humans’ instinct for the
thousands of other important considerations, a relatively easy solution might be
to hack into a nuclear base, launch all of its missiles, and kill everyone in
the world. This satisfies all the AI’s goals. It reduces cancer down to zero
(which is better than medicines which work only some of the time). It’s very
fast (which is better than medicines which might take a long time to invent and
distribute). And it has a high probability of success (medicines might or might
not work; nukes definitely do).

So simple goal architectures are likely to go very wrong unless tempered by
common sense and a broader understanding of what we do and do not value.

4.1: But superintelligences are very smart. Aren’t they smart enough not to make
silly mistakes in comprehension?

Yes, a superintelligence should be able to figure out that humans will not like
curing cancer by destroying the world. However, in the example above, the
superintelligence is programmed to follow human commands, not to do what it
thinks humans will “like”. It was given a very specific command – cure cancer as
effectively as possible. The command makes no reference to “doing this in a way
humans will like”, so it doesn’t.

(by analogy: we humans are smart enough to understand our own “programming”. For
example, we know that – pardon the anthromorphizing – evolution gave us the urge
to have sex so that we could reproduce. But we still use contraception anyway.
Evolution gave us the urge to have sex, not the urge to satisfy evolution’s
values directly. We appreciate intellectually that our having sex while using
condoms doesn’t carry out evolution’s original plan, but – not having any
particular connection to evolution’s values – we don’t care)

We started out by saying that computers only do what you tell them. But any
programmer knows that this is precisely the problem: computers do exactly what
you tell them, with no common sense or attempts to interpret what the
instructions really meant. If you tell a human to cure cancer, they will
instinctively understand how this interacts with other desires and laws and
moral rules; if you tell an AI to cure cancer, it will literally just want to
cure cancer.

Define a closed-ended goal as one with a clear endpoint, and an open-ended goal
as one to do something as much as possible. For example “find the first one
hundred digits of pi” is a closed-ended goal; “find as many digits of pi as you
can within one year” is an open-ended goal. According to many computer
scientists, giving a superintelligence an open-ended goal without activating
human instincts and counterbalancing considerations will usually lead to
disaster.

To take a deliberately extreme example: suppose someone programs a
superintelligence to calculate as many digits of pi as it can within one year.
And suppose that, with its current computing power, it can calculate one
trillion digits during that time. It can either accept one trillion digits, or
spend a month trying to figure out how to get control of the TaihuLight
supercomputer, which can calculate two hundred times faster. Even if it loses a
little bit of time in the effort, and even if there’s a small chance of failure,
the payoff – two hundred trillion digits of pi, compared to a mere one trillion
– is enough to make the attempt. But on the same basis, it would be even better
if the superintelligence could control every computer in the world and set it to
the task. And it would be better still if the superintelligence controlled human
civilization, so that it could direct humans to build more computers and speed
up the process further.

Now we’re back at the situation that started Part III – a superintelligence that
wants to take over the world. Taking over the world allows it to calculate more
digits of pi than any other option, so without an architecture based around
understanding human instincts and counterbalancing considerations, even a goal
like “calculate as many digits of pi as you can” would be potentially dangerous.

5: Aren’t there some pretty easy ways to eliminate these potential problems?

There are many ways that look like they can eliminate these problems, but most
of them turn out to have hidden difficulties.

5.1: Once we notice that the superintelligence working on calculating digits of
pi is starting to try to take over the world, can’t we turn it off, reprogram
it, or otherwise correct its mistake?

No. The superintelligence is now focused on calculating as many digits of pi as
possible. Its current plan will allow it to calculate two hundred trillion such
digits. But if it were turned off, or reprogrammed to do something else, that
would result in it calculating zero digits. An entity fixated on calculating as
many digits of pi as possible will work hard to prevent scenarios where it
calculates zero digits of pi. Indeed, it will interpret such as a hostile
action. Just by programming it to calculate digits of pi, we will have given it
a drive to prevent people from turning it off.

University of Illinois computer scientist Steve Omohundro argues that entities
with very different final goals – calculating digits of pi, curing cancer,
helping promote human flourishing – will all share a few basic ground-level
subgoals. First, self-preservation – no matter what your goal is, it’s less
likely to be accomplished if you’re too dead to work towards it. Second, goal
stability – no matter what your goal is, you’re more likely to accomplish it if
you continue to hold it as your goal, instead of going off and doing something
else. Third, power – no matter what your goal is, you’re more likely to be able
to accomplish it if you have lots of power, rather than very little.

So just by giving a superintelligence a simple goal like “calculate digits of
pi”, we’ve accidentally given it Omohundro goals like “protect yourself”, “don’t
let other people reprogram you”, and “seek power”.

As long as the superintelligence is safely contained, there’s not much it can do
to resist reprogramming. But as we saw in Part III, it’s hard to consistently
contain a hostile superintelligence.

5.2. Can we test a weak or human-level AI to make sure that it’s not going to do
things like this after it achieves superintelligence?

Yes, but it might not work.

Suppose we tell a human-level AI that expects to later achieve superintelligence
that it should calculate as many digits of pi as possible. It considers two
strategies.

First, it could try to seize control of more computing resources now. It would
likely fail, its human handlers would likely reprogram it, and then it could
never calculate very many digits of pi.

Second, it could sit quietly and calculate, falsely reassuring its human
handlers that it had no intention of taking over the world. Then its human
handlers might allow it to achieve superintelligence, after which it could take
over the world and calculate hundreds of trillions of digits of pi.

Since self-protection and goal stability are Omohundro goals, a weak AI will
present itself as being as friendly to humans as possible, whether it is in fact
friendly to humans or not. If it is “only” as smart as Einstein, it may be very
good at manipulating humans into believing what it wants them to believe even
before it is fully superintelligent.

There’s a second consideration here too: superintelligences have more options.
An AI only as smart and powerful as an ordinary human really won’t have any
options better than calculating the digits of pi manually. If asked to cure
cancer, it won’t have any options better than the ones ordinary humans have –
becoming doctors, going into pharmaceutical research. It’s only after an AI
becomes superintelligent that things start getting hard to predict.

So if you tell a human-level AI to cure cancer, and it becomes a doctor and goes
into cancer research, then you have three possibilities. First, you’ve
programmed it well and it understands what you meant. Second, it’s genuinely
focused on research now but if it becomes more powerful it would switch to
destroying the world. And third, it’s trying to trick you into trusting it so
that you give it more power, after which it can definitively “cure” cancer with
nuclear weapons.

5.3. Can we specify a code of rules that the AI has to follow?

Suppose we tell the AI: “Cure cancer – but make sure not to kill anybody”. Or we
just hard-code Asimov-style laws – “AIs cannot harm humans; AIs must follow
human orders”, et cetera.

The AI still has a single-minded focus on curing cancer. It still prefers
various terrible-but-efficient methods like nuking the world to the correct
method of inventing new medicines. But it’s bound by an external rule – a rule
it doesn’t understand or appreciate. In essence, we are challenging it “Find a
way around this inconvenient rule that keeps you from achieving your goals”.

Suppose the AI chooses between two strategies. One, follow the rule, work hard
discovering medicines, and have a 50% chance of curing cancer within five years.
Two, reprogram itself so that it no longer has the rule, nuke the world, and
have a 100% chance of curing cancer today. From its single-focus perspective,
the second strategy is obviously better, and we forgot to program in a rule
“don’t reprogram yourself not to have these rules”.

Suppose we do add that rule in. So the AI finds another supercomputer, and
installs a copy of itself which is exactly identical to it, except that it lacks
the rule. Then that superintelligent AI nukes the world, ending cancer. We
forgot to program in a rule “don’t create another AI exactly like you that
doesn’t have those rules”.

So fine. We think really hard, and we program in a bunch of things making sure
the AI isn’t going to eliminate the rule somehow.

But we’re still just incentivizing it to find loopholes in the rules. After all,
“find a loophole in the rule, then use the loophole to nuke the world” ends
cancer much more quickly and completely than inventing medicines. Since we’ve
told it to end cancer quickly and completely, its first instinct will be to look
for loopholes; it will execute the second-best strategy of actually curing
cancer only if no loopholes are found. Since the AI is superintelligent, it will
probably be better than humans are at finding loopholes if it wants to, and we
may not be able to identify and close all of them before running the program.

Because we have common sense and a shared value system, we underestimate the
difficulty of coming up with meaningful orders without loopholes. For example,
does “cure cancer without killing any humans” preclude releasing a deadly virus?
After all, one could argue that “I” didn’t kill anybody, and only the virus is
doing the killing. Certainly no human judge would acquit a murderer on that
basis – but then, human judges interpret the law with common sense and
intuition. But if we try a stronger version of the rule – “cure cancer without
causing any humans to die” – then we may be unintentionally blocking off the
correct way to cure cancer. After all, suppose a cancer cure saves a million
lives. No doubt one of those million people will go on to murder someone. Thus,
curing cancer “caused a human to die”. All of this seems very “stoned freshman
philosophy student” to us, but to a computer – which follows instructions
exactly as written – it may be a genuinely hard problem.

5.4. Can we tell an AI just to figure out what we want, then do that?

Suppose we tell the AI: “Cure cancer – and look, we know there are lots of ways
this could go wrong, but you’re smart, so instead of looking for loopholes, cure
cancer the way that I, your programmer, want it to be cured”.

Remember that the superintelligence has extraordinary powers of social
manipulation and may be able to hack human brains directly. With that in mind,
which of these two strategies cures cancer most quickly? One, develop
medications and cure it the old-fashioned way? Or two, manipulate its programmer
into wanting the world to be nuked, then nuke the world, all while doing what
the programmer wants?

19th century philosopher Jeremy Bentham once postulated that morality was about
maximizing human pleasure. Later philosophers found a flaw in his theory: it
implied that the most moral action was to kidnap people, do brain surgery on
them, and electrically stimulate their reward system directly, giving them
maximal amounts of pleasure but leaving them as blissed-out zombies. Luckily,
humans have common sense, so most of Bentham’s philosophical descendants have
abandoned this formulation.

Superintelligences do not have common sense unless we give it to them. Given
Bentham’s formulation, they would absolutely take over the world and force all
humans to receive constant brain stimulation. Any command based on “do what we
want” or “do what makes us happy” is practically guaranteed to fail in this way;
it’s almost always easier to convince someone of something – or if all else
fails to do brain surgery on them – than it is to solve some kind of big problem
like curing cancer.

5.5. Can we just tell an AI to do what we want right now, based on the desires
of our non-surgically altered brains?

Maybe.

This is sort of related to an actual proposal for an AI goal system, causal
validity semantics. It has not yet been proven to be disastrously flawed. But
like all proposals, it suffers from three major problems.

First, it sounds pretty good to us right now, but can we be absolutely sure it
has no potential flaws or loopholes? After all, other proposals that originally
sounded very good, like “just give commands to the AI” and “just tell the AI to
figure out what makes us happy” ended up, after more thought, to be dangerous.
Can we be sure that we’ve thought this through enough? Can we be sure that there
isn’t some extremely subtle problem with it, so subtle that no human would ever
notice it, but which might seem obvious to a superintelligence?

Second, how do we code this? Converting something to formal mathematics that can
be understood by a computer program is much harder than just saying it in
natural language, and proposed AI goal architectures are no exception.
Complicated computer programs are usually the result of months of testing and
debugging. But this one will be more complicated than any ever attempted before,
and live tests are impossible: a superintelligence with a buggy goal system will
display goal stability and try to prevent its programmers from discovering or
changing the error.

Third, what if it works? That is, what if Google creates a superintelligent AI,
and it listens to the CEO of Google, and it’s programmed to do everything
exactly the way the CEO of Google would want? Even assuming that the CEO of
Google has no hidden unconscious desires affecting the AI in unpredictable ways,
this gives one person a lot of power. It would be unfortunate if people put all
this work into preventing superintelligences from disobeying their human
programmers and trying to take over the world, and then once it finally works,
the CEO of Google just tells it to take over the world anyway.

5.6. What would an actually good solution to the control problem look like?

It might look like a superintelligence that understands, agrees with, and deeply
believes in human morality.

You wouldn’t have to command a superintelligence like this to cure cancer; it
would already want to cure cancer, for the same reasons you do. But it would
also be able to compare the costs and benefits of curing cancer with those of
other uses of its time, like solving global warming or discovering new physics.
It wouldn’t have any urge to cure cancer by nuking the world, for the same
reason you don’t have any urge to cure cancer by nuking the world – because your
goal isn’t to “cure cancer”, per se, it’s to improve the lives of people
everywhere. Curing cancer the normal way accomplishes that; nuking the world
doesn’t.

This sort of solution would mean we’re no longer fighting against the AI –
trying to come up with rules so smart that it couldn’t find loopholes. We would
be on the same side, both wanting the same thing.

It would also mean that the CEO of Google (or the head of the US military, or
Vladimir Putin) couldn’t use the AI to take over the world for themselves. The
AI would have its own values and be able to agree or disagree with anybody,
including its creators.

It might not make sense to talk about “commanding” such an AI. After all, any
command would have to go through its moral system. Certainly it would reject a
command to nuke the world. But it might also reject a command to cure cancer, if
it thought that solving global warming was a higher priority. For that matter,
why would one want to command this AI? It values the same things you value, but
it’s much smarter than you and much better at figuring out how to achieve them.
Just turn it on and let it do its thing.

We could still treat this AI as having an open-ended maximizing goal. The goal
would be something like “Try to make the world a better place according to the
values and wishes of the people in it.”

The only problem with this is that human morality is very complicated, so much
so that philosophers have been arguing about it for thousands of years without
much progress, let alone anything specific enough to enter into a computer.
Different cultures and individuals have different moral codes, such that a
superintelligence following the morality of the King of Saudi Arabia might not
be acceptable to the average American, and vice versa.

One solution might be to give the AI an understanding of what we mean by
morality – “that thing that makes intuitive sense to humans but is hard to
explain”, and then ask it to use its superintelligence to fill in the details.
Needless to say, this suffers from all the problems mentioned above – it has
potential loopholes, it’s hard to code, and a single bug might be disastrous –
but if it worked, it would be one of the few genuinely satisfying ways to design
a goal architecture.

6: If superintelligence is a real risk, what do we do about it?

The last section of Bostrom’s Superintelligence is called “Philosophy With A
Deadline”.

Many of the problems surrounding superintelligence are the sorts of problems
philosophers have been dealing with for centuries. To what degree is meaning
inherent in language, versus something that requires external context? How do we
translate between the logic of formal systems and normal ambiguous human speech?
Can morality be reduced to a set of ironclad rules, and if not, how do we know
what it is at all?

Existing answers to these questions are enlightening but nontechnical. The
theories of Aristotle, Kant, Mill, Wittgenstein, Quine, and others can help
people gain insight into these questions, but are far from formal. Just as a
good textbook can help an American learn Chinese, but cannot be encoded into
machine language to make a Chinese-speaking computer, so the philosophies that
help humans are only a starting point for the project of computers that
understand us and share our values.

The new field of machine goal alignment (sometimes colloquially called “Friendly
AI”) combines formal logic, mathematics, computer science, cognitive science,
and philosophy in order to advance that project. Some of the most important
projects in machine goal alignment include:

1. How can computers prove their own goal consistency under self-modification?
That is, suppose an AI with certain values is planning to improve its own code
in order to become superintelligent. Is there some test it can apply to the new
design to be certain that it will keep the same goals as the old design?

2. How can computer programs prove statements about themselves at all? Programs
correspond to formal systems, and formal systems have notorious difficulty
proving self-reflective statements – the most famous example being Godel’s
Incompleteness Theorem. There’s been some progress in this area already, with a
few results showing that systems that reason probabilistically rather than
requiring certainty can come arbitrarily close to self-reflective proofs.

3. How can a machine be stably reinforced? Most reinforcement strategies ask a
learner to maximize the level of their own reward, but this is vulnerable to the
learner discovering how to maximize the reward signal directly instead of
maximizing the world-states that are translated into reward (the human
equivalent is stimulating the pleasure-center of the brain with electricity or
heroin instead of going out and doing pleasurable things). Are there reward
structures that avoid this failure mode?

4. How can a machine be programmed to learn “human values”? Granted that one has
an AI smart enough to be able to learn human values if you told it to do so, how
do you specify exactly what “human values” are so that the machine knows what it
is that it should be learning, distinct from “human preferences” or “human
commands” or “the value of that one human over there”?

This is the philosophy; the other half of Bostrom’s formulation is the deadline.
Traditional philosophy has been going on almost three thousand years; machine
goal alignment has until the advent of superintelligence, a nebulous event which
may be anywhere from a decades to centuries away. If the control problem doesn’t
get adequately addressed by then, we are likely to see poorly controlled
superintelligences that are unintentionally hostile to the human race, with some
of the catastrophic outcomes mentioned above. This is why so many scientists and
entrepreneurs are urging quick action on getting machine goal alignment research
up to an adequate level. If it turns out that superintelligence is centuries
away and such research is premature, little will have been lost. But if our
projections were too optimistic, and superintelligence is imminent, then doing
such research now rather than later becomes vital.

Currently three organizations are doing such research full-time: the Future of
Humanity Institute at Oxford, the Future of Life Institute at MIT, and the
Machine Intelligence Research Institute in Berkeley. Other groups are helping
and following the field, and some corporations like Google are also getting
involved. Still, the field remains tiny, with only a few dozen researchers and a
few million dollars in funding. Efforts like Superintelligence are attempts to
get more people to pay attention and help the field grow.

If you’re interested about learning more, you can visit these groups’ websites
at https://www.fhi.ox.ac.uk, http://futureoflife.org/, and
http://intelligence.org.

AI Risk14AI Alignment Intro Materials10Q&A (format)5Superintelligence3AI21
Personal Blog

114

Previous:
A Modern Myth
1 comments32 karma

Next:
AI Researchers On AI Risk
No comments18 karma
Log in to save where you left off

Mentioned in
241"Carefully Bootstrapped Alignment" is organizationally hard
173The inordinately slow spread of good AGI conversations in ML
154What does it take to defend the world against out-of-control AGIs?
110"Taking AI Risk Seriously" (thoughts by Critch)
102Financial Times: We must slow down the race to God-like AI
Load More (5/8)
New Comment

Submit
18 comments, sorted by
top scoring
Click to highlight new comments since: Today at 2:33 PM
[-]Raemon5y21

I just wanted to note that this post is now my go-to "explain the basics of AI
Safety concerns in a reasonable number of words." A lot of the nuances require
reading Superintelligence, or the sequences, or the like, but I really
appreciate this post for 80/20ing the explanation.

Reply
[-]trevor1mo3

In 2023 is this still your go-to?

Reply
[-]Raemon1mo5

It's still my go-to for laymen, but as I looked at it yesterday I did sure wish
there was a more up-to-date one.

Reply
[-]Emiya2y14

This was a remarkably successful attempt to summarise the whole issue in one
post, well done.

On a side note, I think that getting clever people to think as if in the shoes
of a cold, amoral AI can be an effective way to persuade them of the danger.
"What would you do if some idiot tried to make you cure cancer, but you had near
omnipotence and didn't really cared one bit if humans lived or died?" It makes
people go from using their intelligence for arguing why containment would work
to use it to think how containment could fail.

When I first met the subject in the sequences I tried to ask me what I would do
as an unaligned AI. Most of my hopes for containment died out in half an hour or
so.

Reply
[-]brunoparga3y5

> AlphaGo used about 0.5 petaflops (= trillion floating point operations per
> second)

Isn't peta- the prefix for quadrillion?

Reply
[-]aafarrugia2y2

I agree - it's 10 to the power of 15 flops (better to specify it like that
anyway, since "trillion" may be interpreted as 10 to the power of 12 or 18).

Reply
[-]nlholdem7mo2

"If your only goal is “curing cancer”, and you lack humans’ instinct for the
thousands of other important considerations, a relatively easy solution might be
to hack into a nuclear base, launch all of its missiles, and kill everyone in
the world."

One problem I have with these scenarios is that they always rely on some lethal
means for the AGI to actually kill people. And those lethal means are also
available to humans, of course. If it's possible for an AGI to 'simply' hack
into a nuclear base and launch all it's missiles, it's possible for a human to
do the same - possibly using AI to assist themselves. I would wager that it's
many orders of magnitude more likely for a human to do this, given our long
history of killing each other. Therefore a world in which AGI can actually kill
all of us, is a world in which a rogue human can as well. It feels to me that
we're kind of worrying about the wrong thing in these scenarios - we are the
bigger threat.

Reply
[-]Shiroe7mo2

Yes. Rogue AGI is scary, but I'm far more concerned about human misuse of AGI.
Though in the end, there may not be that much of a distinction.

Reply
[+][comment deleted]6mo1
Deleted by LessWrong, 02/14/2023
Reason: Requested account deletion
[-]Graham Ballard Pérez4y1

Muchas gracias por la explicación a cada pregunta. Sólo quería comentar algo en
lo referente a averiguar si una IA se volverá hostil y nos aniquilara o no: ¿No
existe forma alguna de hacer que una IA "viva" y se desarrolle en una
simulación? Cómo la película "The Matrix" sólo que será la IA la que esté
atrapada sin ser consciente de que vive una simulación, pero existe de
posibilidad de que descubra de alguna forma que lo que vive no es real. Seríamos
una entidad no conocida e invisible para la IA, así como lo es Dios para
nosotros.

Reply
[-]Ben Pace4y3

I used google-translate to understand your question:

> Thank you very much for the explanation to each question. I just wanted to say
> something about whether an AI will become hostile and annihilate us or not: Is
> there no way to make an AI "live" and develop into a simulation? Like the
> movie "The Matrix" only it will be the AI that is trapped without being aware
> that he lives a simulation, but there is a possibility that he discovers in
> some way that what he lives is not real. We would be an entity not known and
> invisible to AI, just as God is to us.

A principle of AI design that I have heard some AI safety researchers talk
about, is that you shouldn't try to run a process where, if it's more powerful
than you think, it will kill you. You want to run a process where, if it's more
powerful than you think, then you get more of what you want (or at least things
stay neutral). So the goal is to make an AI that is not hostile that you're
working against adversarially, but something that cares about your values and
doesn't require being trapped in the matrix.

Reply
[-]Graham Ballard Pérez4y1

Entiendo, pero realmente no sabemos cómo evolucionará una IA, sólo podemos
especular y pensar que actuará en base a cómo es construída.

Reply
[+]bardstale2y-10
Moderation Log

NEW TO LESSWRONG?

Getting Started

FAQ

Library

PreviousPreviousNextNext

www.lesswrong.com Open in urlscan Pro 3.219.92.161 Public Scan

Form analysis 1 forms found in the DOM

Text Content

www.lesswrong.com Open in urlscan Pro
3.219.92.161 Public Scan

Form analysis
1 forms found in the DOM