ai-alignment.com
Open in
urlscan Pro
52.1.119.170
Public Scan
Submitted URL: https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616
Effective URL: https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616?gi=aa7da1512ea6
Submission: On April 11 via manual from US — Scanned from DE
Effective URL: https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616?gi=aa7da1512ea6
Submission: On April 11 via manual from US — Scanned from DE
Form analysis
0 forms found in the DOMText Content
Open in app Sign up Sign in Write Sign up Sign in ITERATED DISTILLATION AND AMPLIFICATION Ajeya Cotra · Follow Published in AI Alignment · 8 min read · Mar 4, 2018 828 6 Listen Share This is a guest post summarizing Paul Christiano’s proposed scheme for training machine learning systems that can be robustly aligned to complex and fuzzy values, which I call Iterated Distillation and Amplification (IDA) here. IDA is notably similar to AlphaGoZero and expert iteration. The hope is that if we use IDA to train each learned component of an AI then the overall AI will remain aligned with the user’s interests while achieving state of the art performance at runtime — provided that any non-learned components such as search or logic are also built to preserve alignment and maintain runtime performance. This document gives a high-level outline of IDA. MOTIVATION: THE ALIGNMENT/CAPABILITIES TRADEOFF Assume that we want to train a learner A to perform some complex fuzzy task, e.g. “Be a good personal assistant.” Assume that A is capable of learning to perform the task at a superhuman level — that is, if we could perfectly specify a “personal assistant” objective function and trained A to maximize it, then A would become a far better personal assistant than any human. There is a spectrum of possibilities for how we might train A to do this task. On one end, there are techniques which allow the learner to discover powerful, novel policies that improve upon human capabilities: * Broad reinforcement learning: As A takes actions in the world, we give it a relatively sparse reward signal based on how satisfied or dissatisfied we are with the eventual consequences. We then allow A to optimize for the expected sum of its future rewards * Broad inverse reinforcement learning: A attempts to infer our deep long-term values from our actions, perhaps using a sophisticated model of human psychology and irrationality to select which of many possible extrapolations is correct. However, it is difficult to specify a broad objective that captures everything we care about, so in practice A will be optimizing for some proxy that is not completely aligned with our interests. Even if this proxy objective is “almost” right, its optimum could be disastrous according to our true values. On the other end, there are techniques that try to narrowly emulate human judgments: * Imitation learning: We could train A to exactly mimic how an expert would do the task, e.g. by training it to fool a discriminative model trying to tell apart A’s actions from the human expert’s actions. * Narrow inverse reinforcement learning: We could train A to infer our near-term instrumental values from our actions, with the presumption that our actions are roughly optimal according to those values. * Narrow reinforcement learning: As A takes actions in the world, we give it a dense reward signal based on how reasonable we judge its choices are (perhaps we directly reward state-action pairs themselves rather than outcomes in the world, as in TAMER). A optimizes for the expected sum of its future rewards. Using these techniques, the risk of misalignment is reduced significantly (though not eliminated) by restricting agents to the range of known human behavior — but this introduces severe limitations on capability. This tradeoff between allowing for novel capabilities and reducing misalignment risk applies across different learning schemes (with imitation learning generally being narrowest and lowest risk) as well as within a single scheme. The motivating problem that IDA attempts to solve: if we are only able to align agents that narrowly replicate human behavior, how can we build an AGI that is both aligned and ultimately much more capable than the best humans? CORE CONCEPT: ANALOGY TO ALPHAGOZERO The core idea of Paul’s scheme is similar to AlphaGoZero (AGZ): We use a learned model many times as a subroutine in a more powerful decision-making process, and then re-train the model to imitate those better decisions. AGZ’s policy network p is the learned model. At each iteration, AGZ selects moves by an expensive Monte Carlo Tree Search (MCTS) which uses policy p as its prior; p is then trained to directly predict the distribution of moves that MCTS ultimately settles on. In the next iteration, MCTS is run using the new more accurate p, and p is trained to predict the eventual outcome of that process, and so on. After enough iterations, a fixed point is reached — p is unable to learn how running MCTS will change its current probabilities. MCTS is an amplification of p — it uses p as a subroutine in a larger process that ultimately makes better moves than p alone could. In turn, p is a distillation of MCTS: it learns to directly guess the results of running MCTS, achieving comparable performance while short-cutting the expensive computation. The idea of IDA is to use the basic iterated distillation and amplification procedure in a much more general domain. THE IDA SCHEME IDA involves repeatedly improving a learned model through an amplification and distillation process over multiple iterations. AMPLIFICATION IS INTERACTIVE AND HUMAN-DIRECTED IN IDA In AGZ, the amplification procedure is Monte Carlo Tree Search — it’s a simple and well-understood algorithm, and there’s a clear mechanism for how it improves on the policy network’s original choices (it traverses the game tree more deeply). But in IDA, amplification is not necessarily a fixed algorithm that can be written down once and repeatedly applied; it’s an interactive process directed by human decisions. In most domains, humans are capable of improving their native capabilities by delegating to assistants (e.g. because CEOs can delegate tasks to a large team, they can produce orders of magnitude more output per day than they could on their own). This means if our learning procedure can create an adequate helper for the human, the human can use the AI to amplify their ability — this human/AI system may be capable of doing things that the human couldn’t manage on their own. Below I consider the example of using IDA to build a superhuman personal assistant. Let A[t] to refer to the state of the learned model after the end of iteration t; the initial agent A[0] is trained by a human overseer H. EXAMPLE: BUILDING A SUPERHUMAN PERSONAL ASSISTANT H trains A[0] using a technique from the narrow end of the spectrum, such as imitation learning. Here we are imagining a much more powerful version of “imitation learning” than current systems are actually capable of — we assume that A[0] can acquire nearly human-level capabilities through this process. That is, the trained A[0] model executes all the tasks of a personal assistant as H would (including comprehending English instructions, writing emails, putting together a meeting schedule, etc). Even though A[0] cannot discover any novel capabilities, it has two key advantages over H: it can run much faster, and many copies or versions of it can be run at once. We hope to leverage these advantages to construct a larger system — involving H and many copies of A[0] — that will substantially improve on H’s capabilities while preserving alignment with H’s values. H can use calls to A[0] (along with other tools such as external memory) to become a better personal assistant. For example, H could assign one copy of A[0] to figuring out the best time to schedule the client’s recurring team meetings, another copy to figure out what to order the client for lunch, another copy to balance the client’s personal budget, etc. H now has the ability to get very quick solutions to sub-problems that are roughly as good as the ones H would have come up with on their own over a longer time period, and can combine these results to make much better decisions than an unaided human. Let Amplify(H, A[0]) refer to the larger system of H + many copies of A[0] + aids. Compared to A[0] alone, the Amplify(H, A[0]) system has much higher time and resource costs but its eventual decisions are much better. Moreover, because in each of its individual decisions each copy of A[0] continues to act just as a human personal assistant would act, we can hope that Amplify(H, A[0]) preserves alignment. In the next iteration of training, the Amplify(H, A[0]) system takes over the role of H as the overseer. A[1] is trained with narrow and safe techniques to quickly reproduce the results of Amplify(H, A[0]). Because we assumed Amplify(H, A[0]) was aligned, we can hope that A[1] is also aligned if it is trained using sufficiently narrow techniques which introduce no new behaviors. A[1] is then used in Amplify(H, A[1]), which serves as an overseer to train A[2], and so on. PSEUDOCODE def IDA(H): A <- random initialization repeat: A <- Distill(Amplify(H, A))def Distill(overseer): """ Returns an AI trained using narrow, robust techniques to perform a task that the overseer already understands how to perform. """def Amplify(human, AI): """ Interactive process in which human uses many calls to AI to improve on human's native performance at relevant task(s). """ WHAT PROPERTIES MUST HOLD FOR IDA TO WORK? The IDA scheme is a template with “slots” for Amplify and Distill procedures that have not been fully specified yet — in fact, they rely on capabilities we don’t yet have. Because IDA itself is not fully specified, it’s not clear what minimal set of properties are necessary for it to succeed. ACHIEVING ALIGNMENT AND HIGH CAPABILITY That said, here are some general properties which seem necessary — though likely not sufficient — for IDA agents to achieve robust alignment and high capability: 1. The Distill procedure robustly preserves alignment: Given an aligned agent H we can use narrow safe learning techniques to train a much faster agent A which behaves as H would have behaved, without introducing any misaligned optimization or losing important aspects of what H values. 2. The Amplify procedure robustly preserves alignment: Given an aligned agent A, it is possible to specify an amplification scheme which calls A multiple times as a subroutine in a way that reliably avoids introducing misaligned optimization. 3. At least some human experts are able to iteratively apply amplification to achieve arbitrarily high capabilities at the relevant task: a) there is some threshold of general capability such that if someone is above this threshold, they can eventually solve any problem that an arbitrarily intelligent system could solve, provided they can delegate tasks to similarly-intelligent assistants and are given arbitrary amounts of memory and time; b) at least some human experts are above this threshold of generality — given enough time and resources, they can figure out how to use AI assistants and tools to improve their capabilities arbitrarily far. The non-profit Ought is working on gathering more evidence about assumptions 2 and 3. ACHIEVING COMPETITIVE PERFORMANCE AND EFFICIENCY Paul aims for IDA agents to be competitive with traditional RL agents in time and resource costs at runtime — this is a reasonable expectation because an IDA agent is ultimately just another learned model whose weights were tuned with an unusual training procedure. Resource and time cost during training is a more open question; I haven’t explored the assumptions that would have to hold for the IDA training process to be practically feasible or resource-competitive with other AI projects. SIGN UP TO DISCOVER HUMAN STORIES THAT DEEPEN YOUR UNDERSTANDING OF THE WORLD. FREE Distraction-free reading. No ads. Organize your knowledge with lists and highlights. Tell your story. Find your audience. Sign up for free MEMBERSHIP Access the best member-only stories. Support independent authors. Listen to audio narrations. Read offline. Join the Partner Program and earn for your writing. Try for 5 $/month Machine Learning 828 828 6 Follow WRITTEN BY AJEYA COTRA 125 Followers ·Writer for AI Alignment Research Analyst at the Open Philanthropy Project Follow MORE FROM AJEYA COTRA AND AI ALIGNMENT Ajeya Cotra HIRING ANALYTICAL THINKERS TO HELP GIVE AWAY BILLIONS APPLICATIONS ARE NOW CLOSED FOR THIS POSITION. 6 min read·Mar 31, 2018 490 1 Paul Christiano in AI Alignment CLARIFYING “AI ALIGNMENT” CLARIFYING WHAT I MEAN WHEN I SAY THAT AN AI IS ALIGNED. 4 min read·Apr 7, 2018 446 3 Paul Christiano in AI Alignment MY VIEWS ON “DOOM” I’M OFTEN ASKED: “WHAT’S THE PROBABILITY OF A REALLY BAD OUTCOME FROM AI?” IN THIS POST I ANSWER 10 VERSIONS OF THAT QUESTION. 3 min read·Apr 27, 2023 98 3 Paul Christiano in AI Alignment MY RESEARCH METHODOLOGY I EXPLAIN WHY I FOCUS ON THE “WORST” CASE WHEN DOING THEORETICAL ALIGNMENT RESEARCH. 18 min read·Mar 22, 2021 81 1 See all from Ajeya Cotra See all from AI Alignment RECOMMENDED FROM MEDIUM BugendaiTech Deutschland GmbH PERFORMANCE METRICS IN EVALUATING STABLE DIFFUSION MODELS PERFORMANCE METRICS IN EVALUATING STABLE DIFFUSION MODELS 7 min read·Feb 6, 2024 Joonbeom Kwon EXPLANATION: SUPERVISED FINE-TUNING & REINFORCEMENT LEARNING FROM HUMAN FEEDBACK IN THE INTRICATE AND MULTIFACETED WORLD OF ARTIFICIAL INTELLIGENCE (AI), UNDERSTANDING THE NUANCES OF TRAINING METHODOLOGIES IS PIVOTAL FOR… 4 min read·Nov 27, 2023 41 2 LISTS PREDICTIVE MODELING W/ PYTHON 20 stories·1079 saves PRACTICAL GUIDES TO MACHINE LEARNING 10 stories·1302 saves NATURAL LANGUAGE PROCESSING 1361 stories·849 saves THE NEW CHATBOTS: CHATGPT, BARD, AND BEYOND 12 stories·353 saves Bhavin Jawade in Towards Data Science UNDERSTANDING LORA — LOW RANK ADAPTATION FOR FINETUNING LARGE MODELS FINE-TUNING LARGE PRE-TRAINED MODELS IS COMPUTATIONALLY CHALLENGING, OFTEN INVOLVING ADJUSTMENT OF MILLIONS OF PARAMETERS. THIS… 4 min read·Dec 22, 2023 526 2 Mark Riedl A VERY GENTLE INTRODUCTION TO LARGE LANGUAGE MODELS WITHOUT THE HYPE 1. INTRODUCTION 38 min read·Apr 14, 2023 7.5K 121 Mike Cvet in Better Programming THE VALUE OF CODE SOONER OR LATER, EVERY SOFTWARE ENGINEERING ORGANIZATION IS FACED WITH THE TASK OF TRYING TO EVALUATE THE PRODUCTIVITY OF THEIR TEAM AND… 3 min read·Jul 13, 2023 365 3 João Lages DIRECT PREFERENCE OPTIMIZATION (DPO) A SIMPLIFIED EXPLANATION 3 min read·Nov 5, 2023 393 3 See more recommendations Help Status About Careers Blog Privacy Terms Text to speech Teams To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including cookie policy.