ai-alignment.com Open in urlscan Pro
52.1.119.170  Public Scan

Submitted URL: https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616
Effective URL: https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616?gi=aa7da1512ea6
Submission: On April 11 via manual from US — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

Open in app

Sign up

Sign in

Write


Sign up

Sign in




ITERATED DISTILLATION AND AMPLIFICATION

Ajeya Cotra

·

Follow

Published in

AI Alignment

·
8 min read
·
Mar 4, 2018

828

6

Listen

Share

This is a guest post summarizing Paul Christiano’s proposed scheme for training
machine learning systems that can be robustly aligned to complex and fuzzy
values, which I call Iterated Distillation and Amplification (IDA) here. IDA is
notably similar to AlphaGoZero and expert iteration.

The hope is that if we use IDA to train each learned component of an AI then the
overall AI will remain aligned with the user’s interests while achieving state
of the art performance at runtime — provided that any non-learned components
such as search or logic are also built to preserve alignment and maintain
runtime performance. This document gives a high-level outline of IDA.


MOTIVATION: THE ALIGNMENT/CAPABILITIES TRADEOFF

Assume that we want to train a learner A to perform some complex fuzzy task,
e.g. “Be a good personal assistant.” Assume that A is capable of learning to
perform the task at a superhuman level — that is, if we could perfectly specify
a “personal assistant” objective function and trained A to maximize it, then A
would become a far better personal assistant than any human.

There is a spectrum of possibilities for how we might train A to do this task.
On one end, there are techniques which allow the learner to discover powerful,
novel policies that improve upon human capabilities:

 * Broad reinforcement learning: As A takes actions in the world, we give it a
   relatively sparse reward signal based on how satisfied or dissatisfied we are
   with the eventual consequences. We then allow A to optimize for the expected
   sum of its future rewards
 * Broad inverse reinforcement learning: A attempts to infer our deep long-term
   values from our actions, perhaps using a sophisticated model of human
   psychology and irrationality to select which of many possible extrapolations
   is correct.

However, it is difficult to specify a broad objective that captures everything
we care about, so in practice A will be optimizing for some proxy that is not
completely aligned with our interests. Even if this proxy objective is “almost”
right, its optimum could be disastrous according to our true values.

On the other end, there are techniques that try to narrowly emulate human
judgments:

 * Imitation learning: We could train A to exactly mimic how an expert would do
   the task, e.g. by training it to fool a discriminative model trying to tell
   apart A’s actions from the human expert’s actions.
 * Narrow inverse reinforcement learning: We could train A to infer our
   near-term instrumental values from our actions, with the presumption that our
   actions are roughly optimal according to those values.
 * Narrow reinforcement learning: As A takes actions in the world, we give it a
   dense reward signal based on how reasonable we judge its choices are (perhaps
   we directly reward state-action pairs themselves rather than outcomes in the
   world, as in TAMER). A optimizes for the expected sum of its future rewards.

Using these techniques, the risk of misalignment is reduced significantly
(though not eliminated) by restricting agents to the range of known human
behavior — but this introduces severe limitations on capability. This tradeoff
between allowing for novel capabilities and reducing misalignment risk applies
across different learning schemes (with imitation learning generally being
narrowest and lowest risk) as well as within a single scheme.

The motivating problem that IDA attempts to solve: if we are only able to align
agents that narrowly replicate human behavior, how can we build an AGI that is
both aligned and ultimately much more capable than the best humans?


CORE CONCEPT: ANALOGY TO ALPHAGOZERO

The core idea of Paul’s scheme is similar to AlphaGoZero (AGZ): We use a learned
model many times as a subroutine in a more powerful decision-making process, and
then re-train the model to imitate those better decisions.

AGZ’s policy network p is the learned model. At each iteration, AGZ selects
moves by an expensive Monte Carlo Tree Search (MCTS) which uses policy p as its
prior; p is then trained to directly predict the distribution of moves that MCTS
ultimately settles on. In the next iteration, MCTS is run using the new more
accurate p, and p is trained to predict the eventual outcome of that process,
and so on. After enough iterations, a fixed point is reached — p is unable to
learn how running MCTS will change its current probabilities.

MCTS is an amplification of p — it uses p as a subroutine in a larger process
that ultimately makes better moves than p alone could. In turn, p is a
distillation of MCTS: it learns to directly guess the results of running MCTS,
achieving comparable performance while short-cutting the expensive computation.
The idea of IDA is to use the basic iterated distillation and amplification
procedure in a much more general domain.


THE IDA SCHEME

IDA involves repeatedly improving a learned model through an amplification and
distillation process over multiple iterations.


AMPLIFICATION IS INTERACTIVE AND HUMAN-DIRECTED IN IDA

In AGZ, the amplification procedure is Monte Carlo Tree Search — it’s a simple
and well-understood algorithm, and there’s a clear mechanism for how it improves
on the policy network’s original choices (it traverses the game tree more
deeply). But in IDA, amplification is not necessarily a fixed algorithm that can
be written down once and repeatedly applied; it’s an interactive process
directed by human decisions.

In most domains, humans are capable of improving their native capabilities by
delegating to assistants (e.g. because CEOs can delegate tasks to a large team,
they can produce orders of magnitude more output per day than they could on
their own). This means if our learning procedure can create an adequate helper
for the human, the human can use the AI to amplify their ability — this human/AI
system may be capable of doing things that the human couldn’t manage on their
own.

Below I consider the example of using IDA to build a superhuman personal
assistant. Let A[t] to refer to the state of the learned model after the end of
iteration t; the initial agent A[0] is trained by a human overseer H.


EXAMPLE: BUILDING A SUPERHUMAN PERSONAL ASSISTANT

H trains A[0] using a technique from the narrow end of the spectrum, such as
imitation learning. Here we are imagining a much more powerful version of
“imitation learning” than current systems are actually capable of — we assume
that A[0] can acquire nearly human-level capabilities through this process. That
is, the trained A[0] model executes all the tasks of a personal assistant as H
would (including comprehending English instructions, writing emails, putting
together a meeting schedule, etc).

Even though A[0] cannot discover any novel capabilities, it has two key
advantages over H: it can run much faster, and many copies or versions of it can
be run at once. We hope to leverage these advantages to construct a larger
system — involving H and many copies of A[0] — that will substantially improve
on H’s capabilities while preserving alignment with H’s values.

H can use calls to A[0] (along with other tools such as external memory) to
become a better personal assistant. For example, H could assign one copy of A[0]
to figuring out the best time to schedule the client’s recurring team meetings,
another copy to figure out what to order the client for lunch, another copy to
balance the client’s personal budget, etc. H now has the ability to get very
quick solutions to sub-problems that are roughly as good as the ones H would
have come up with on their own over a longer time period, and can combine these
results to make much better decisions than an unaided human.

Let Amplify(H, A[0]) refer to the larger system of H + many copies of A[0] +
aids. Compared to A[0] alone, the Amplify(H, A[0]) system has much higher time
and resource costs but its eventual decisions are much better. Moreover, because
in each of its individual decisions each copy of A[0] continues to act just as a
human personal assistant would act, we can hope that Amplify(H, A[0]) preserves
alignment.

In the next iteration of training, the Amplify(H, A[0]) system takes over the
role of H as the overseer. A[1] is trained with narrow and safe techniques to
quickly reproduce the results of Amplify(H, A[0]). Because we assumed Amplify(H,
A[0]) was aligned, we can hope that A[1] is also aligned if it is trained using
sufficiently narrow techniques which introduce no new behaviors. A[1] is then
used in Amplify(H, A[1]), which serves as an overseer to train A[2], and so on.


PSEUDOCODE

def IDA(H):
   A <- random initialization
   repeat:
      A <- Distill(Amplify(H, A))def Distill(overseer):
   """
   Returns an AI trained using narrow, robust techniques to perform 
   a task that the overseer already understands how to perform.
   """def Amplify(human, AI):
   """
   Interactive process in which human uses many calls to AI to 
   improve on human's native performance at relevant task(s).
   """


WHAT PROPERTIES MUST HOLD FOR IDA TO WORK?

The IDA scheme is a template with “slots” for Amplify and Distill procedures
that have not been fully specified yet — in fact, they rely on capabilities we
don’t yet have. Because IDA itself is not fully specified, it’s not clear what
minimal set of properties are necessary for it to succeed.


ACHIEVING ALIGNMENT AND HIGH CAPABILITY

That said, here are some general properties which seem necessary — though likely
not sufficient — for IDA agents to achieve robust alignment and high capability:

 1. The Distill procedure robustly preserves alignment: Given an aligned agent H
    we can use narrow safe learning techniques to train a much faster agent A
    which behaves as H would have behaved, without introducing any misaligned
    optimization or losing important aspects of what H values.
 2. The Amplify procedure robustly preserves alignment: Given an aligned agent
    A, it is possible to specify an amplification scheme which calls A multiple
    times as a subroutine in a way that reliably avoids introducing misaligned
    optimization.
 3. At least some human experts are able to iteratively apply amplification to
    achieve arbitrarily high capabilities at the relevant task: a) there is some
    threshold of general capability such that if someone is above this
    threshold, they can eventually solve any problem that an arbitrarily
    intelligent system could solve, provided they can delegate tasks to
    similarly-intelligent assistants and are given arbitrary amounts of memory
    and time; b) at least some human experts are above this threshold of
    generality — given enough time and resources, they can figure out how to use
    AI assistants and tools to improve their capabilities arbitrarily far.

The non-profit Ought is working on gathering more evidence about assumptions 2
and 3.


ACHIEVING COMPETITIVE PERFORMANCE AND EFFICIENCY

Paul aims for IDA agents to be competitive with traditional RL agents in time
and resource costs at runtime — this is a reasonable expectation because an IDA
agent is ultimately just another learned model whose weights were tuned with an
unusual training procedure.

Resource and time cost during training is a more open question; I haven’t
explored the assumptions that would have to hold for the IDA training process to
be practically feasible or resource-competitive with other AI projects.





SIGN UP TO DISCOVER HUMAN STORIES THAT DEEPEN YOUR UNDERSTANDING OF THE WORLD.


FREE



Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.


Sign up for free


MEMBERSHIP



Access the best member-only stories.

Support independent authors.

Listen to audio narrations.

Read offline.

Join the Partner Program and earn for your writing.


Try for 5 $/month
Machine Learning


828

828

6


Follow



WRITTEN BY AJEYA COTRA

125 Followers
·Writer for

AI Alignment

Research Analyst at the Open Philanthropy Project

Follow




MORE FROM AJEYA COTRA AND AI ALIGNMENT

Ajeya Cotra


HIRING ANALYTICAL THINKERS TO HELP GIVE AWAY BILLIONS


APPLICATIONS ARE NOW CLOSED FOR THIS POSITION.

6 min read·Mar 31, 2018

490

1




Paul Christiano

in

AI Alignment


CLARIFYING “AI ALIGNMENT”


CLARIFYING WHAT I MEAN WHEN I SAY THAT AN AI IS ALIGNED.

4 min read·Apr 7, 2018

446

3




Paul Christiano

in

AI Alignment


MY VIEWS ON “DOOM”


I’M OFTEN ASKED: “WHAT’S THE PROBABILITY OF A REALLY BAD OUTCOME FROM AI?” IN
THIS POST I ANSWER 10 VERSIONS OF THAT QUESTION.

3 min read·Apr 27, 2023

98

3




Paul Christiano

in

AI Alignment


MY RESEARCH METHODOLOGY


I EXPLAIN WHY I FOCUS ON THE “WORST” CASE WHEN DOING THEORETICAL ALIGNMENT
RESEARCH.

18 min read·Mar 22, 2021

81

1



See all from Ajeya Cotra
See all from AI Alignment



RECOMMENDED FROM MEDIUM

BugendaiTech Deutschland GmbH


PERFORMANCE METRICS IN EVALUATING STABLE DIFFUSION MODELS


PERFORMANCE METRICS IN EVALUATING STABLE DIFFUSION MODELS

7 min read·Feb 6, 2024



Joonbeom Kwon


EXPLANATION: SUPERVISED FINE-TUNING & REINFORCEMENT LEARNING FROM HUMAN FEEDBACK


IN THE INTRICATE AND MULTIFACETED WORLD OF ARTIFICIAL INTELLIGENCE (AI),
UNDERSTANDING THE NUANCES OF TRAINING METHODOLOGIES IS PIVOTAL FOR…

4 min read·Nov 27, 2023

41

2





LISTS


PREDICTIVE MODELING W/ PYTHON

20 stories·1079 saves


PRACTICAL GUIDES TO MACHINE LEARNING

10 stories·1302 saves


NATURAL LANGUAGE PROCESSING

1361 stories·849 saves


THE NEW CHATBOTS: CHATGPT, BARD, AND BEYOND

12 stories·353 saves


Bhavin Jawade



in

Towards Data Science


UNDERSTANDING LORA — LOW RANK ADAPTATION FOR FINETUNING LARGE MODELS


FINE-TUNING LARGE PRE-TRAINED MODELS IS COMPUTATIONALLY CHALLENGING, OFTEN
INVOLVING ADJUSTMENT OF MILLIONS OF PARAMETERS. THIS…

4 min read·Dec 22, 2023

526

2




Mark Riedl


A VERY GENTLE INTRODUCTION TO LARGE LANGUAGE MODELS WITHOUT THE HYPE


1. INTRODUCTION

38 min read·Apr 14, 2023

7.5K

121




Mike Cvet

in

Better Programming


THE VALUE OF CODE


SOONER OR LATER, EVERY SOFTWARE ENGINEERING ORGANIZATION IS FACED WITH THE TASK
OF TRYING TO EVALUATE THE PRODUCTIVITY OF THEIR TEAM AND…

3 min read·Jul 13, 2023

365

3




João Lages


DIRECT PREFERENCE OPTIMIZATION (DPO)


A SIMPLIFIED EXPLANATION

3 min read·Nov 5, 2023

393

3



See more recommendations

Help

Status

About

Careers

Blog

Privacy

Terms

Text to speech

Teams

To make Medium work, we log user data. By using Medium, you agree to our Privacy
Policy, including cookie policy.