ai-plans.com Open in urlscan Pro
143.110.165.10  Public Scan

Submitted URL: http://ai-plans.com/
Effective URL: https://ai-plans.com/
Submission: On October 18 via api from US — Scanned from GB

Form analysis 0 forms found in the DOM

Text Content

{"PUBLIC_ROOT":"","POST_CHAR_LIMIT":50000,"CONFIRM_MINUTES":15,"UPLOAD_LIMIT_MB":8,"UPLOAD_LIMIT_MB_PDF":5,"UPLOAD_SEC_LIMIT":15,"CHAT_LENGTH":500,"POST_BUFFER_MS":60000,"COMMENT_BUFFER_MS":30000,"POST_LIMITS":{"TITLE":200,"DESCRIPTION":1200,"CONTENT":500000,"ATTRIBUTION":250,"COMMENT_CONTENT":10000},"VOTE_TYPES":{"single_up":1},"UPLOAD_BUFFER_S":10,"UPLOAD_LIMIT_GENERIC_MB":1,"HOLD_UNLOGGED_SUBMIT_DAYS":1,"KARMA_SCALAR":0.01}

Created by potrace 1.16, written by Peter Selinger 2001-2019 ai-plans
☰
home
contact
Submit a Plan
Log In
Submit a Plan
Log In


SUBMIT, CRITIQUE AND RANK AI ALIGNMENT PLANS


PLANS CURRENTLY RANKED BY: ∑STRENGTHS - ∑VULNERABILITIES

Submit a Plan


TOPICS:

All
Ethics Interpretability Oversight Philosophy Governance Value Learning Inverse
Reinforcement Learning Corrigibility Cooperative Inverse RL Reward Modelling
Safe Exploration Adversarial Training
1

REACT: OUT-OF-DISTRIBUTION DETECTION WITH RECTIFIED ACTIVATIONS

attributed to: Yiyou Sun, Chuan Guo, Yixuan Li
posted by: KabirKumar

Out-of-distribution (OOD) detection has received much attention lately due to its practical importance in enha...

Out-of-distribution (OOD) detection has received much attention lately due to its practical importance in enhancing the safe deployment of neural networks. One of the primary challenges is that models often produce highly confident predictions on OOD data, which undermines the driving principle in OOD detection that the model should only be confident about in-distribution samples. In this work, we propose ReAct--a simple and effective technique for reducing model overconfidence on OOD data... (Full Abstract in Full Plan- click title to view)



...read full abstract close

show post

: 5
Add


: 2
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
2

LEARNING SAFE POLICIES WITH EXPERT GUIDANCE

attributed to: Jessie Huang, Fa Wu, Doina Precup, Yang Cai
posted by: KabirKumar

We propose a framework for ensuring safe behavior of a reinforcement learning agent when the reward function m...

We propose a framework for ensuring safe behavior of a reinforcement learning agent when the reward function may be difficult to specify. In order to do this, we rely on the existence of demonstrations from expert policies, and we provide a theoretical framework for the agent to optimize in the space of rewards consistent with its existing knowledge. We propose two methods to solve the resulting optimization: an exact ellipsoid-based method and a method in the spirit of the "follow-the-perturbed-leader" algorithm. Our experiments demonstrate the behavior of our algorithm in both discrete and continuous problems...



...read full abstract close

show post

: 1
Add


: 0
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
3

"CAUSAL SCRUBBING: A METHOD FOR RIGOROUSLY TESTING INTERPRETABILITY HYPOTHESES",
AI ALIGNMENT FORUM, 2022.

attributed to: Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill,
Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, Nate
Thomas [Redwood Research]
posted by: momom2

Summary: This post introduces causal scrubbing, a principled approach for evaluating the quality of mechanisti...

Summary: This post introduces causal scrubbing, a principled approach for evaluating the quality of mechanistic interpretations. The key idea behind causal scrubbing is to test interpretability hypotheses via behavior-preserving resampling ablations. We apply this method to develop a refined understanding of how a small language model implements induction and how an algorithmic model correctly classifies if a sequence of parentheses is balanced.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
4

NATURAL ABSTRACTIONS: KEY CLAIMS, THEOREMS, AND CRITIQUES

attributed to: LawrenceC, Leon Lang, Erik Jenner, John Wentworth
posted by: KabirKumar

TL;DR: We distill John Wentworth’s Natural Abstractions agenda by summarizing its key claims: the Natural Abst...

TL;DR: We distill John Wentworth’s Natural Abstractions agenda by summarizing its key claims: the Natural Abstraction Hypothesis—many cognitive systems learn to use similar abstractions—and the Redundant Information Hypothesis—a particular mathematical description of natural abstractions. We also formalize proofs for several of its theoretical results. Finally, we critique the agenda’s progress to date, alignment relevance, and current research methodology.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
5

COGNITIVE EMULATION: A NAIVE AI SAFETY PROPOSAL

attributed to: Connor Leahy, Gabriel Alfour (Conjecture)
posted by: KabirKumar

This post serves as a signpost for Conjecture’s new primary safety proposal and research direction, which we c...

This post serves as a signpost for Conjecture’s new primary safety proposal and research direction, which we call Cognitive Emulation (or “CoEm”). The goal of the CoEm agenda is to build predictably boundable systems, not directly aligned AGIs. We believe the former to be a far simpler and useful step towards a full alignment solution.

Unfortunately, given that most other actors are racing for as powerful and general AIs as possible, we won’t share much in terms of technical details for now. In the meantime, we still want to share some of our intuitions about this approach.

We take no credit for inventing any of these ideas, and see our contributions largely in taking existing ideas seriously and putting them together into a larger whole.[1]



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
6

SAFE IMITATION LEARNING VIA FAST BAYESIAN REWARD INFERENCE FROM PREFERENCES

attributed to: Daniel S. Brown, Russell Coleman, Ravi Srinivasan, Scott Niekum
posted by: KabirKumar

Bayesian reward learning from demonstrations enables rigorous safety and
uncertainty analysis when performing ...

Bayesian reward learning from demonstrations enables rigorous safety and
uncertainty analysis when performing imitation learning. However, Bayesian
reward learning methods are typically computationally intractable for complex
control problems. We propose Bayesian Reward Extrapolation (Bayesian REX), a
highly efficient Bayesian reward learning algorithm that scales to
high-dimensional imitation learning problems by pre-training a low-dimensional
feature encoding via self-supervised tasks and then leveraging preferences over
demonstrations to perform fast Bayesian inference...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
7

PRETRAINED TRANSFORMERS IMPROVE OUT-OF-DISTRIBUTION ROBUSTNESS

attributed to: Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh
Krishnan, Dawn Song
posted by: KabirKumar

Although pretrained Transformers such as BERT achieve high accuracy on
in-distribution examples, do they gener...

Although pretrained Transformers such as BERT achieve high accuracy on
in-distribution examples, do they generalize to new distributions? We
systematically measure out-of-distribution (OOD) generalization for seven NLP
datasets by constructing a new robustness benchmark with realistic distribution
shifts. We measure the generalization of previous models including bag-of-words
models, ConvNets, and LSTMs, and we show that pretrained Transformers'
performance declines are substantially smaller. Pretrained transformers are
also more effective at detecting anomalous or OOD examples, while many previous
models are frequently worse than chance. We examine which factors affect
robustness, finding that larger models are not necessarily more robust,
distillation can be harmful, and more diverse pretraining data can enhance
robustness. Finally, we show where future work can improve OOD robustness.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
8

ABSTRACTION LEARNING

attributed to: Fei Deng, Jinsheng Ren, Feng Chen
posted by: KabirKumar

There has been a gap between artificial intelligence and human intelligence.
In this paper, we identify three ...

There has been a gap between artificial intelligence and human intelligence.
In this paper, we identify three key elements forming human intelligence, and
suggest that abstraction learning combines these elements and is thus a way to
bridge the gap. Prior researches in artificial intelligence either specify
abstraction by human experts, or take abstraction as a qualitative explanation
for the model. This paper aims to learn abstraction directly. We tackle three
main challenges: representation, objective function, and learning algorithm.
Specifically, we propose a partition structure that contains pre-allocated
abstraction neurons; we formulate abstraction learning as a constrained
optimization problem, which integrates abstraction properties; we develop a
network evolution algorithm to solve this problem. This complete framework is
named ONE (Optimization via Network Evolution). In our experiments on MNIST,
ONE shows elementary human-like intelligence, including low energy consumption,
knowledge sharing, and lifelong learning.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
9

AUTONOMOUS INTELLIGENT CYBER-DEFENSE AGENT (AICA) REFERENCE ARCHITECTURE.
RELEASE 2.0

attributed to: Alexander Kott, Paul Théron, Martin Drašar, Edlira Dushku, Benoît
LeBlanc, Paul Losiewicz, Alessandro Guarino, Luigi Mancini, Agostino Panico,
Mauno Pihelgas, Krzysztof Rzadca, Fabio De Gaspari
posted by: KabirKumar

This report - a major revision of its previous release - describes a
reference architecture for intelligent so...

This report - a major revision of its previous release - describes a
reference architecture for intelligent software agents performing active,
largely autonomous cyber-defense actions on military networks of computing and
communicating devices. The report is produced by the North Atlantic Treaty
Organization (NATO) Research Task Group (RTG) IST-152 "Intelligent Autonomous
Agents for Cyber Defense and Resilience". In a conflict with a technically
sophisticated adversary, NATO military tactical networks will operate in a
heavily contested battlefield. Enemy software cyber agents - malware - will
infiltrate friendly networks and attack friendly command, control,
communications, computers, intelligence, surveillance, and reconnaissance and
computerized weapon systems. To fight them, NATO needs artificial cyber hunters
- intelligent, autonomous, mobile agents specialized in active cyber defense.
With this in mind, in 2016, NATO initiated RTG IST-152. Its objective has been
to help accelerate the development and transition to practice of such software
agents by producing a reference architecture and technical roadmap.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
10

TOWARDS A HUMAN-LIKE OPEN-DOMAIN CHATBOT

attributed to: Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall,
Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade,
Yifeng Lu, Quoc V. Le
posted by: KabirKumar

We present Meena, a multi-turn open-domain chatbot trained end-to-end on data
mined and filtered from public d...

We present Meena, a multi-turn open-domain chatbot trained end-to-end on data
mined and filtered from public domain social media conversations. This 2.6B
parameter neural network is simply trained to minimize perplexity of the next
token. We also propose a human evaluation metric called Sensibleness and
Specificity Average (SSA), which captures key elements of a human-like
multi-turn conversation. Our experiments show strong correlation between
perplexity and SSA. The fact that the best perplexity end-to-end trained Meena
scores high on SSA (72% on multi-turn evaluation) suggests that a human-level
SSA of 86% is potentially within reach if we can better optimize perplexity.
Additionally, the full version of Meena (with a filtering mechanism and tuned
decoding) scores 79% SSA, 23% higher in absolute SSA than the existing chatbots
we evaluated.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
11

ADVERSARIAL ROBUSTNESS AS A PRIOR FOR LEARNED REPRESENTATIONS

attributed to: Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris
Tsipras, Brandon Tran, Aleksander Madry
posted by: KabirKumar

An important goal in deep learning is to learn versatile, high-level feature
representations of input data. Ho...

An important goal in deep learning is to learn versatile, high-level feature
representations of input data. However, standard networks' representations seem
to possess shortcomings that, as we illustrate, prevent them from fully
realizing this goal. In this work, we show that robust optimization can be
re-cast as a tool for enforcing priors on the features learned by deep neural
networks. It turns out that representations learned by robust models address
the aforementioned shortcomings and make significant progress towards learning
a high-level encoding of inputs. In particular, these representations are
approximately invertible, while allowing for direct visualization and
manipulation of salient input features. More broadly, our results indicate
adversarial robustness as a promising avenue for improving learned
representations. Our code and models for reproducing these results is available
at https://git.io/robust-reps .



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
12

A GEOMETRIC PERSPECTIVE ON THE TRANSFERABILITY OF ADVERSARIAL DIRECTIONS

attributed to: Zachary Charles, Harrison Rosenberg, Dimitris Papailiopoulos
posted by: KabirKumar

State-of-the-art machine learning models frequently misclassify inputs that
have been perturbed in an adversar...

State-of-the-art machine learning models frequently misclassify inputs that
have been perturbed in an adversarial manner. Adversarial perturbations
generated for a given input and a specific classifier often seem to be
effective on other inputs and even different classifiers. In other words,
adversarial perturbations seem to transfer between different inputs, models,
and even different neural network architectures. In this work, we show that in
the context of linear classifiers and two-layer ReLU networks, there provably
exist directions that give rise to adversarial perturbations for many
classifiers and data points simultaneously. We show that these "transferable
adversarial directions" are guaranteed to exist for linear separators of a
given set, and will exist with high probability for linear classifiers trained
on independent sets drawn from the same distribution. We extend our results to
large classes of two-layer ReLU networks. We further show that adversarial
directions for ReLU networks transfer to linear classifiers while the reverse
need not hold, suggesting that adversarial perturbations for more complex
models are more likely to transfer to other classifiers.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
13

TOWARDS THE FIRST ADVERSARIALLY ROBUST NEURAL NETWORK MODEL ON MNIST

attributed to: Lukas Schott, Jonas Rauber, Matthias Bethge, Wieland Brendel
posted by: KabirKumar

Despite much effort, deep neural networks remain highly susceptible to tiny
input perturbations and even for M...

Despite much effort, deep neural networks remain highly susceptible to tiny
input perturbations and even for MNIST, one of the most common toy datasets in
computer vision, no neural network model exists for which adversarial
perturbations are large and make semantic sense to humans. We show that even
the widely recognized and by far most successful defense by Madry et al. (1)
overfits on the L-infinity metric (it's highly susceptible to L2 and L0
perturbations), (2) classifies unrecognizable images with high certainty, (3)
performs not much better than simple input binarization and (4) features
adversarial perturbations that make little sense to humans. These results
suggest that MNIST is far from being solved in terms of adversarial robustness.
We present a novel robust classification model that performs analysis by
synthesis using learned class-conditional data distributions.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
14

MOTIVATING THE RULES OF THE GAME FOR ADVERSARIAL EXAMPLE RESEARCH

attributed to: Justin Gilmer, Ryan P. Adams, Ian Goodfellow, David Andersen,
George E. Dahl
posted by: KabirKumar

Advances in machine learning have led to broad deployment of systems with
impressive performance on important ...

Advances in machine learning have led to broad deployment of systems with
impressive performance on important problems. Nonetheless, these systems can be
induced to make errors on data that are surprisingly similar to examples the
learned system handles correctly. The existence of these errors raises a
variety of questions about out-of-sample generalization and whether bad actors
might use such examples to abuse deployed systems. As a result of these
security concerns, there has been a flurry of recent papers proposing
algorithms to defend against such malicious perturbations of correctly handled
examples. It is unclear how such misclassifications represent a different kind
of security problem than other errors, or even other attacker-produced examples
that have no specific relationship to an uncorrupted input. In this paper, we
argue that adversarial example defense papers have, to date, mostly considered
abstract, toy games that do not relate to any specific security concern.
Furthermore, defense papers have not yet precisely described all the abilities
and limitations of attackers that would be relevant in practical security.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
15

ROBUSTNESS VIA CURVATURE REGULARIZATION, AND VICE VERSA

attributed to: Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato,
Pascal Frossard
posted by: KabirKumar

State-of-the-art classifiers have been shown to be largely vulnerable to
adversarial perturbations. One of the...

State-of-the-art classifiers have been shown to be largely vulnerable to
adversarial perturbations. One of the most effective strategies to improve
robustness is adversarial training. In this paper, we investigate the effect of
adversarial training on the geometry of the classification landscape and
decision boundaries. We show in particular that adversarial training leads to a
significant decrease in the curvature of the loss surface with respect to
inputs, leading to a drastically more "linear" behaviour of the network. Using
a locally quadratic approximation, we provide theoretical evidence on the
existence of a strong relation between large robustness and small curvature. To
further show the importance of reduced curvature for improving the robustness,
we propose a new regularizer that directly minimizes curvature of the loss
surface, and leads to adversarial robustness that is on par with adversarial
training. Besides being a more efficient and principled alternative to
adversarial training, the proposed regularizer confirms our claims on the
importance of exhibiting quasi-linear behavior in the vicinity of data points
in order to achieve robustness.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
16

ADVERSARIAL POLICIES: ATTACKING DEEP REINFORCEMENT LEARNING

attributed to: Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine,
Stuart Russell
posted by: KabirKumar

Deep reinforcement learning (RL) policies are known to be vulnerable to
adversarial perturbations to their obs...

Deep reinforcement learning (RL) policies are known to be vulnerable to
adversarial perturbations to their observations, similar to adversarial
examples for classifiers. However, an attacker is not usually able to directly
modify another agent's observations. This might lead one to wonder: is it
possible to attack an RL agent simply by choosing an adversarial policy acting
in a multi-agent environment so as to create natural observations that are
adversarial? We demonstrate the existence of adversarial policies in zero-sum
games between simulated humanoid robots with proprioceptive observations,
against state-of-the-art victims trained via self-play to be robust to
opponents. The adversarial policies reliably win against the victims but
generate seemingly random and uncoordinated behavior. We find that these
policies are more successful in high-dimensional environments, and induce
substantially different activations in the victim policy network than when the
victim plays against a normal opponent. Videos are available at
https://adversarialpolicies.github.io/.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
17

FORTIFIED NETWORKS: IMPROVING THE ROBUSTNESS OF DEEP NETWORKS BY MODELING THE
MANIFOLD OF HIDDEN REPRESENTATIONS

attributed to: Alex Lamb, Jonathan Binas, Anirudh Goyal, Dmitriy Serdyuk,
Sandeep Subramanian, Ioannis Mitliagkas, Yoshua Bengio
posted by: KabirKumar

Deep networks have achieved impressive results across a variety of important
tasks. However a known weakness i...

Deep networks have achieved impressive results across a variety of important
tasks. However a known weakness is a failure to perform well when evaluated on
data which differ from the training distribution, even if these differences are
very small, as is the case with adversarial examples. We propose Fortified
Networks, a simple transformation of existing networks, which fortifies the
hidden layers in a deep network by identifying when the hidden states are off
of the data manifold, and maps these hidden states back to parts of the data
manifold where the network performs well. Our principal contribution is to show
that fortifying these hidden states improves the robustness of deep networks
and our experiments (i) demonstrate improved robustness to standard adversarial
attacks in both black-box and white-box threat models; (ii) suggest that our
improvements are not primarily due to the gradient masking problem and (iii)
show the advantage of doing this fortification in the hidden layers instead of
the input space.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
18

EVALUATING AND UNDERSTANDING THE ROBUSTNESS OF ADVERSARIAL LOGIT PAIRING

attributed to: Logan Engstrom, Andrew Ilyas, Anish Athalye
posted by: KabirKumar

We evaluate the robustness of Adversarial Logit Pairing, a recently proposed
defense against adversarial examp...

We evaluate the robustness of Adversarial Logit Pairing, a recently proposed
defense against adversarial examples. We find that a network trained with
Adversarial Logit Pairing achieves 0.6% accuracy in the threat model in which
the defense is considered. We provide a brief overview of the defense and the
threat models/claims considered, as well as a discussion of the methodology and
results of our attack, which may offer insights into the reasons underlying the
vulnerability of ALP to adversarial attack.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
19

EVALUATING AGENTS WITHOUT REWARDS

attributed to: Brendon Matusch, Jimmy Ba, Danijar Hafner
posted by: KabirKumar

Reinforcement learning has enabled agents to solve challenging tasks in
unknown environments. However, manuall...

Reinforcement learning has enabled agents to solve challenging tasks in
unknown environments. However, manually crafting reward functions can be time
consuming, expensive, and error prone to human error. Competing objectives have
been proposed for agents to learn without external supervision, but it has been
unclear how well they reflect task rewards or human behavior. To accelerate the
development of intrinsic objectives, we retrospectively compute potential
objectives on pre-collected datasets of agent behavior, rather than optimizing
them online, and compare them by analyzing their correlations. We study input
entropy, information gain, and empowerment across seven agents, three Atari
games, and the 3D game Minecraft. We find that all three intrinsic objectives
correlate more strongly with a human behavior similarity metric than with task
reward. Moreover, input entropy and information gain correlate more strongly
with human similarity than task reward does, suggesting the use of intrinsic
objectives for designing agents that behave similarly to human players.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
20

ALGORITHMIC FAIRNESS FROM A NON-IDEAL PERSPECTIVE

attributed to: Sina Fazelpour, Zachary C. Lipton
posted by: KabirKumar

Inspired by recent breakthroughs in predictive modeling, practitioners in
both industry and government have tu...

Inspired by recent breakthroughs in predictive modeling, practitioners in
both industry and government have turned to machine learning with hopes of
operationalizing predictions to drive automated decisions. Unfortunately, many
social desiderata concerning consequential decisions, such as justice or
fairness, have no natural formulation within a purely predictive framework. In
efforts to mitigate these problems, researchers have proposed a variety of
metrics for quantifying deviations from various statistical parities that we
might expect to observe in a fair world and offered a variety of algorithms in
attempts to satisfy subsets of these parities or to trade off the degree to
which they are satisfied against utility. In this paper, we connect this
approach to \emph{fair machine learning} to the literature on ideal and
non-ideal methodological approaches in political philosophy. The ideal approach
requires positing the principles according to which a just world would operate.
In the most straightforward application of ideal theory, one supports a
proposed policy by arguing that it closes a discrepancy between the real and
the perfectly just world.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
21

IDENTIFYING AND CORRECTING LABEL BIAS IN MACHINE LEARNING

attributed to: Heinrich Jiang, Ofir Nachum
posted by: KabirKumar

Datasets often contain biases which unfairly disadvantage certain groups, and
classifiers trained on such data...

Datasets often contain biases which unfairly disadvantage certain groups, and
classifiers trained on such datasets can inherit these biases. In this paper,
we provide a mathematical formulation of how this bias can arise. We do so by
assuming the existence of underlying, unknown, and unbiased labels which are
overwritten by an agent who intends to provide accurate labels but may have
biases against certain groups. Despite the fact that we only observe the biased
labels, we are able to show that the bias may nevertheless be corrected by
re-weighting the data points without changing the labels. We show, with
theoretical guarantees, that training on the re-weighted dataset corresponds to
training on the unobserved but unbiased labels, thus leading to an unbiased
machine learning classifier. Our procedure is fast and robust and can be used
with virtually any learning algorithm. We evaluate on a number of standard
machine learning fairness datasets and a variety of fairness notions, finding
that our method outperforms standard approaches in achieving fair
classification.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
22

LEARNING NOT TO LEARN: TRAINING DEEP NEURAL NETWORKS WITH BIASED DATA

attributed to: Byungju Kim, Hyunwoo Kim, Kyungsu Kim, Sungjin Kim, Junmo Kim
posted by: KabirKumar

We propose a novel regularization algorithm to train deep neural networks, in
which data at training time is s...

We propose a novel regularization algorithm to train deep neural networks, in
which data at training time is severely biased. Since a neural network
efficiently learns data distribution, a network is likely to learn the bias
information to categorize input data. It leads to poor performance at test
time, if the bias is, in fact, irrelevant to the categorization. In this paper,
we formulate a regularization loss based on mutual information between feature
embedding and bias. Based on the idea of minimizing this mutual information, we
propose an iterative algorithm to unlearn the bias information. We employ an
additional network to predict the bias distribution and train the network
adversarially against the feature embedding network. At the end of learning,
the bias prediction network is not able to predict the bias not because it is
poorly trained, but because the feature embedding network successfully unlearns
the bias information. We also demonstrate quantitative and qualitative
experimental results which show that our algorithm effectively removes the bias
information from feature embedding.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
23

COLLABORATING WITH HUMANS WITHOUT HUMAN DATA

attributed to: DJ Strouse, Kevin R. McKee, Matt Botvinick, Edward Hughes,
Richard Everett
posted by: KabirKumar

Collaborating with humans requires rapidly adapting to their individual
strengths, weaknesses, and preferences...

Collaborating with humans requires rapidly adapting to their individual
strengths, weaknesses, and preferences. Unfortunately, most standard
multi-agent reinforcement learning techniques, such as self-play (SP) or
population play (PP), produce agents that overfit to their training partners
and do not generalize well to humans. Alternatively, researchers can collect
human data, train a human model using behavioral cloning, and then use that
model to train "human-aware" agents ("behavioral cloning play", or BCP). While
such an approach can improve the generalization of agents to new human
co-players, it involves the onerous and expensive step of collecting large
amounts of human data first. Here, we study the problem of how to train agents
that collaborate well with human partners without using human data. We argue
that the crux of the problem is to produce a diverse set of training partners.
Drawing inspiration from successful multi-agent approaches in competitive
domains, we find that a surprisingly simple approach is highly effective.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
24

LEGIBLE NORMATIVITY FOR AI ALIGNMENT: THE VALUE OF SILLY RULES

attributed to: Dylan Hadfield-Menell, McKane Andrus, Gillian K. Hadfield
posted by: KabirKumar

It has become commonplace to assert that autonomous agents will have to be
built to follow human rules of beha...

It has become commonplace to assert that autonomous agents will have to be
built to follow human rules of behavior--social norms and laws. But human laws
and norms are complex and culturally varied systems, in many cases agents will
have to learn the rules. This requires autonomous agents to have models of how
human rule systems work so that they can make reliable predictions about rules.
In this paper we contribute to the building of such models by analyzing an
overlooked distinction between important rules and what we call silly
rules--rules with no discernible direct impact on welfare. We show that silly
rules render a normative system both more robust and more adaptable in response
to shocks to perceived stability. They make normativity more legible for
humans, and can increase legibility for AI systems as well. For AI systems to
integrate into human normative systems, we suggest, it may be important for
them to have models that include representations of silly rules.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
25

TANKSWORLD: A MULTI-AGENT ENVIRONMENT FOR AI SAFETY RESEARCH

attributed to: Corban G. Rivera, Olivia Lyons, Arielle Summitt, Ayman Fatima, Ji
Pak, William Shao, Robert Chalmers, Aryeh Englander, Edward W. Staley, I-Jeng
Wang, Ashley J. Llorens
posted by: KabirKumar

The ability to create artificial intelligence (AI) capable of performing
complex tasks is rapidly outpacing ou...

The ability to create artificial intelligence (AI) capable of performing
complex tasks is rapidly outpacing our ability to ensure the safe and assured
operation of AI-enabled systems. Fortunately, a landscape of AI safety research
is emerging in response to this asymmetry and yet there is a long way to go. In
particular, recent simulation environments created to illustrate AI safety
risks are relatively simple or narrowly-focused on a particular issue. Hence,
we see a critical need for AI safety research environments that abstract
essential aspects of complex real-world applications. In this work, we
introduce the AI safety TanksWorld as an environment for AI safety research
with three essential aspects: competing performance objectives, human-machine
teaming, and multi-agent competition. The AI safety TanksWorld aims to
accelerate the advancement of safe multi-agent decision-making algorithms by
providing a software framework to support competitions with both system
performance and safety objectives. As a work in progress, this paper introduces
our research objectives and learning environment with reference code and
baseline performance metrics to follow in a future work.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
26

ON GRADIENT-BASED LEARNING IN CONTINUOUS GAMES

attributed to: Eric Mazumdar, Lillian J. Ratliff, S. Shankar Sastry
posted by: KabirKumar

We formulate a general framework for competitive gradient-based learning that
encompasses a wide breadth of mu...

We formulate a general framework for competitive gradient-based learning that
encompasses a wide breadth of multi-agent learning algorithms, and analyze the
limiting behavior of competitive gradient-based learning algorithms using
dynamical systems theory. For both general-sum and potential games, we
characterize a non-negligible subset of the local Nash equilibria that will be
avoided if each agent employs a gradient-based learning algorithm. We also shed
light on the issue of convergence to non-Nash strategies in general- and
zero-sum games, which may have no relevance to the underlying game, and arise
solely due to the choice of algorithm. The existence and frequency of such
strategies may explain some of the difficulties encountered when using gradient
descent in zero-sum games as, e.g., in the training of generative adversarial
networks. To reinforce the theoretical contributions, we provide empirical
results that highlight the frequency of linear quadratic dynamic games (a
benchmark for multi-agent reinforcement learning) that admit global Nash
equilibria that are almost surely avoided by policy gradient.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
27

REINFORCEMENT LEARNING UNDER THREATS

attributed to: Victor Gallego, Roi Naveiro, David Rios Insua
posted by: KabirKumar

In several reinforcement learning (RL) scenarios, mainly in security
settings, there may be adversaries trying...

In several reinforcement learning (RL) scenarios, mainly in security
settings, there may be adversaries trying to interfere with the reward
generating process. In this paper, we introduce Threatened Markov Decision
Processes (TMDPs), which provide a framework to support a decision maker
against a potential adversary in RL. Furthermore, we propose a level-$k$
thinking scheme resulting in a new learning framework to deal with TMDPs. After
introducing our framework and deriving theoretical results, relevant empirical
evidence is given via extensive experiments, showing the benefits of accounting
for adversaries while the agent learns.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
28

LEARNING REPRESENTATIONS BY HUMANS, FOR HUMANS

attributed to: Sophie Hilgard, Nir Rosenfeld, Mahzarin R. Banaji, Jack Cao,
David C. Parkes
posted by: KabirKumar

When machine predictors can achieve higher performance than the human
decision-makers they support, improving ...

When machine predictors can achieve higher performance than the human
decision-makers they support, improving the performance of human
decision-makers is often conflated with improving machine accuracy. Here we
propose a framework to directly support human decision-making, in which the
role of machines is to reframe problems rather than to prescribe actions
through prediction. Inspired by the success of representation learning in
improving performance of machine predictors, our framework learns human-facing
representations optimized for human performance. This "Mind Composed with
Machine" framework incorporates a human decision-making model directly into the
representation learning paradigm and is trained with a novel human-in-the-loop
training procedure. We empirically demonstrate the successful application of
the framework to various tasks and representational forms.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
29

LEARNING TO UNDERSTAND GOAL SPECIFICATIONS BY MODELLING REWARD

attributed to: Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward Hughes, Arian
Hosseini, Pushmeet Kohli, Edward Grefenstette
posted by: KabirKumar

Recent work has shown that deep reinforcement-learning agents can learn to
follow language-like instructions f...

Recent work has shown that deep reinforcement-learning agents can learn to
follow language-like instructions from infrequent environment rewards. However,
this places on environment designers the onus of designing language-conditional
reward functions which may not be easily or tractably implemented as the
complexity of the environment and the language scales. To overcome this
limitation, we present a framework within which instruction-conditional RL
agents are trained using rewards obtained not from the environment, but from
reward models which are jointly trained from expert examples. As reward models
improve, they learn to accurately reward agents for completing tasks for
environment configurations---and for instructions---not present amongst the
expert data. This framework effectively separates the representation of what
instructions require from how they can be executed. In a simple grid world, it
enables an agent to learn a range of commands requiring interaction with blocks
and understanding of spatial relations and underspecified abstract
arrangements. We further show the method allows our agent to adapt to changes
in the environment without requiring new expert examples.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
30

I KNOW WHAT YOU MEANT: LEARNING HUMAN OBJECTIVES BY (UNDER)ESTIMATING THEIR
CHOICE SET

attributed to: Ananth Jonnavittula, Dylan P. Losey
posted by: KabirKumar

Assistive robots have the potential to help people perform everyday tasks.
However, these robots first need to...

Assistive robots have the potential to help people perform everyday tasks.
However, these robots first need to learn what it is their user wants them to
do. Teaching assistive robots is hard for inexperienced users, elderly users,
and users living with physical disabilities, since often these individuals are
unable to show the robot their desired behavior. We know that inclusive
learners should give human teachers credit for what they cannot demonstrate.
But today's robots do the opposite: they assume every user is capable of
providing any demonstration. As a result, these robots learn to mimic the
demonstrated behavior, even when that behavior is not what the human really
meant! Here we propose a different approach to reward learning: robots that
reason about the user's demonstrations in the context of similar or simpler
alternatives. Unlike prior works -- which err towards overestimating the
human's capabilities -- here we err towards underestimating what the human can
input (i.e., their choice set). Our theoretical analysis proves that
underestimating the human's choice set is risk-averse, with better worst-case
performance than overestimating.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
31

LEARNING TO COMPLEMENT HUMANS

attributed to: Bryan Wilder, Eric Horvitz, Ece Kamar
posted by: KabirKumar

A rising vision for AI in the open world centers on the development of
systems that can complement humans for ...

A rising vision for AI in the open world centers on the development of
systems that can complement humans for perceptual, diagnostic, and reasoning
tasks. To date, systems aimed at complementing the skills of people have
employed models trained to be as accurate as possible in isolation. We
demonstrate how an end-to-end learning strategy can be harnessed to optimize
the combined performance of human-machine teams by considering the distinct
abilities of people and machines. The goal is to focus machine learning on
problem instances that are difficult for humans, while recognizing instances
that are difficult for the machine and seeking human input on them. We
demonstrate in two real-world domains (scientific discovery and medical
diagnosis) that human-machine teams built via these methods outperform the
individual performance of machines and people. We then analyze conditions under
which this complementarity is strongest, and which training methods amplify it.
Taken together, our work provides the first systematic investigation of how
machine learning systems can be trained to complement human reasoning.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
32

HEURISTIC APPROACHES FOR GOAL RECOGNITION IN INCOMPLETE DOMAIN MODELS

attributed to: Ramon Fraga Pereira, Felipe Meneguzzi
posted by: KabirKumar

Recent approaches to goal recognition have progressively relaxed the
assumptions about the amount and correctn...

Recent approaches to goal recognition have progressively relaxed the
assumptions about the amount and correctness of domain knowledge and available
observations, yielding accurate and efficient algorithms. These approaches,
however, assume completeness and correctness of the domain theory against which
their algorithms match observations: this is too strong for most real-world
domains. In this paper, we develop goal recognition techniques that are capable
of recognizing goals using \textit{incomplete} (and possibly incorrect) domain
theories. We show the efficiency and accuracy of our approaches empirically
against a large dataset of goal and plan recognition problems with incomplete
domains.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
33

LEARNING REWARDS FROM LINGUISTIC FEEDBACK

attributed to: Theodore R. Sumers, Mark K. Ho, Robert D. Hawkins, Karthik
Narasimhan, Thomas L. Griffiths
posted by: KabirKumar

We explore unconstrained natural language feedback as a learning signal for
artificial agents. Humans use rich...

We explore unconstrained natural language feedback as a learning signal for
artificial agents. Humans use rich and varied language to teach, yet most prior
work on interactive learning from language assumes a particular form of input
(e.g., commands). We propose a general framework which does not make this
assumption, using aspect-based sentiment analysis to decompose feedback into
sentiment about the features of a Markov decision process. We then perform an
analogue of inverse reinforcement learning, regressing the sentiment on the
features to infer the teacher's latent reward function. To evaluate our
approach, we first collect a corpus of teaching behavior in a cooperative task
where both teacher and learner are human. We implement three artificial
learners: sentiment-based "literal" and "pragmatic" models, and an inference
network trained end-to-end to predict latent rewards. We then repeat our
initial experiment and pair them with human teachers. All three successfully
learn from interactive human feedback.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
34

THE EMPATHIC FRAMEWORK FOR TASK LEARNING FROM IMPLICIT HUMAN FEEDBACK

attributed to: Yuchen Cui, Qiping Zhang, Alessandro Allievi, Peter Stone, Scott
Niekum, W. Bradley Knox
posted by: KabirKumar

Reactions such as gestures, facial expressions, and vocalizations are an
abundant, naturally occurring channel...

Reactions such as gestures, facial expressions, and vocalizations are an
abundant, naturally occurring channel of information that humans provide during
interactions. A robot or other agent could leverage an understanding of such
implicit human feedback to improve its task performance at no cost to the
human. This approach contrasts with common agent teaching methods based on
demonstrations, critiques, or other guidance that need to be attentively and
intentionally provided. In this paper, we first define the general problem of
learning from implicit human feedback and then propose to address this problem
through a novel data-driven framework, EMPATHIC. This two-stage method consists
of (1) mapping implicit human feedback to relevant task statistics such as
reward, optimality, and advantage; and (2) using such a mapping to learn a
task. We instantiate the first stage and three second-stage evaluations of the
learned mapping. To do so, we collect a dataset of human facial reactions while
participants observe an agent execute a sub-optimal policy for a prescribed
training task...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
35

PARENTING: SAFE REINFORCEMENT LEARNING FROM HUMAN INPUT

attributed to: Christopher Frye, Ilya Feige
posted by: KabirKumar

Autonomous agents trained via reinforcement learning present numerous safety
concerns: reward hacking, negativ...

Autonomous agents trained via reinforcement learning present numerous safety
concerns: reward hacking, negative side effects, and unsafe exploration, among
others. In the context of near-future autonomous agents, operating in
environments where humans understand the existing dangers, human involvement in
the learning process has proved a promising approach to AI Safety. Here we
demonstrate that a precise framework for learning from human input, loosely
inspired by the way humans parent children, solves a broad class of safety
problems in this context. We show that our Parenting algorithm solves these
problems in the relevant AI Safety gridworlds of Leike et al. (2017), that an
agent can learn to outperform its parent as it "matures", and that policies
learnt through Parenting are generalisable to new environments.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
36

CONSTRAINED POLICY IMPROVEMENT FOR SAFE AND EFFICIENT REINFORCEMENT LEARNING

attributed to: Elad Sarafian, Aviv Tamar, Sarit Kraus
posted by: KabirKumar

We propose a policy improvement algorithm for Reinforcement Learning (RL)
which is called Rerouted Behavior Im...

We propose a policy improvement algorithm for Reinforcement Learning (RL)
which is called Rerouted Behavior Improvement (RBI). RBI is designed to take
into account the evaluation errors of the Q-function. Such errors are common in
RL when learning the $Q$-value from finite past experience data. Greedy
policies or even constrained policy optimization algorithms which ignore these
errors may suffer from an improvement penalty (i.e. a negative policy
improvement). To minimize the improvement penalty, the RBI idea is to attenuate
rapid policy changes of low probability actions which were less frequently
sampled. This approach is shown to avoid catastrophic performance degradation
and reduce regret when learning from a batch of past experience. Through a
two-armed bandit with Gaussian distributed rewards example, we show that it
also increases data efficiency when the optimal action has a high variance. We
evaluate RBI in two tasks in the Atari Learning Environment: (1) learning from
observations of multiple behavior policies and (2) iterative RL.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
37

TOWARDS EMPATHIC DEEP Q-LEARNING

attributed to: Bart Bussmann, Jacqueline Heinerman, Joel Lehman
posted by: KabirKumar

As reinforcement learning (RL) scales to solve increasingly complex tasks,
interest continues to grow in the f...

As reinforcement learning (RL) scales to solve increasingly complex tasks,
interest continues to grow in the fields of AI safety and machine ethics. As a
contribution to these fields, this paper introduces an extension to Deep
Q-Networks (DQNs), called Empathic DQN, that is loosely inspired both by
empathy and the golden rule ("Do unto others as you would have them do unto
you"). Empathic DQN aims to help mitigate negative side effects to other agents
resulting from myopic goal-directed behavior. We assume a setting where a
learning agent coexists with other independent agents (who receive unknown
rewards), where some types of reward (e.g. negative rewards from physical harm)
may generalize across agents. Empathic DQN combines the typical (self-centered)
value with the estimated value of other agents, by imagining (by its own
standards) the value of it being in the other's situation (by considering
constructed states where both agents are swapped). Proof-of-concept results in
two gridworld environments highlight the approach's potential to decrease
collateral harms.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
38

BUILDING ETHICS INTO ARTIFICIAL INTELLIGENCE

attributed to: Han Yu, Zhiqi Shen, Chunyan Miao, Cyril Leung, Victor R. Lesser,
Qiang Yang
posted by: KabirKumar

As artificial intelligence (AI) systems become increasingly ubiquitous, the
topic of AI governance for ethical...

As artificial intelligence (AI) systems become increasingly ubiquitous, the
topic of AI governance for ethical decision-making by AI has captured public
imagination. Within the AI research community, this topic remains less familiar
to many researchers. In this paper, we complement existing surveys, which
largely focused on the psychological, social and legal discussions of the
topic, with an analysis of recent advances in technical solutions for AI
governance. By reviewing publications in leading AI conferences including AAAI,
AAMAS, ECAI and IJCAI, we propose a taxonomy which divides the field into four
areas: 1) exploring ethical dilemmas; 2) individual ethical decision
frameworks; 3) collective ethical decision frameworks; and 4) ethics in
human-AI interactions. We highlight the intuitions and key techniques used in
each approach, and discuss promising future research directions towards
successful integration of ethical AI systems into human societies.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
39

REINFORCEMENT LEARNING UNDER MORAL UNCERTAINTY

attributed to: Adrien Ecoffet, Joel Lehman
posted by: KabirKumar

An ambitious goal for machine learning is to create agents that behave
ethically: The capacity to abide by hum...

An ambitious goal for machine learning is to create agents that behave
ethically: The capacity to abide by human moral norms would greatly expand the
context in which autonomous agents could be practically and safely deployed,
e.g. fully autonomous vehicles will encounter charged moral decisions that
complicate their deployment. While ethical agents could be trained by rewarding
correct behavior under a specific moral theory (e.g. utilitarianism), there
remains widespread disagreement about the nature of morality. Acknowledging
such disagreement, recent work in moral philosophy proposes that ethical
behavior requires acting under moral uncertainty, i.e. to take into account
when acting that one's credence is split across several plausible ethical
theories. This paper translates such insights to the field of reinforcement
learning, proposes two training methods that realize different points among
competing desiderata, and trains agents in simple environments to act under
moral uncertainty.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
40

AVE: ASSISTANCE VIA EMPOWERMENT

attributed to: Yuqing Du, Stas Tiomkin, Emre Kiciman, Daniel Polani, Pieter
Abbeel, Anca Dragan
posted by: KabirKumar

One difficulty in using artificial agents for human-assistive applications
lies in the challenge of accurately...

One difficulty in using artificial agents for human-assistive applications
lies in the challenge of accurately assisting with a person's goal(s). Existing
methods tend to rely on inferring the human's goal, which is challenging when
there are many potential goals or when the set of candidate goals is difficult
to identify. We propose a new paradigm for assistance by instead increasing the
human's ability to control their environment, and formalize this approach by
augmenting reinforcement learning with human empowerment. This task-agnostic
objective preserves the person's autonomy and ability to achieve any eventual
state. We test our approach against assistance based on goal inference,
highlighting scenarios where our method overcomes failure modes stemming from
goal ambiguity or misspecification. As existing methods for estimating
empowerment in continuous domains are computationally hard, precluding its use
in real time learned assistance, we also propose an efficient
empowerment-inspired proxy metric. Using this, we are able to successfully
demonstrate our method in a shared autonomy user study for a challenging
simulated teleoperation task with human-in-the-loop training.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
41

PLANNING WITH UNCERTAIN SPECIFICATIONS (PUNS)

attributed to: Ankit Shah, Shen Li, Julie Shah
posted by: KabirKumar

Reward engineering is crucial to high performance in reinforcement learning
systems. Prior research into rewar...

Reward engineering is crucial to high performance in reinforcement learning
systems. Prior research into reward design has largely focused on Markovian
functions representing the reward. While there has been research into
expressing non-Markov rewards as linear temporal logic (LTL) formulas, this has
focused on task specifications directly defined by the user. However, in many
real-world applications, task specifications are ambiguous, and can only be
expressed as a belief over LTL formulas. In this paper, we introduce planning
with uncertain specifications (PUnS), a novel formulation that addresses the
challenge posed by non-Markovian specifications expressed as beliefs over LTL
formulas. We present four criteria that capture the semantics of satisfying a
belief over specifications for different applications, and analyze the
qualitative implications of these criteria within a synthetic domain. We
demonstrate the existence of an equivalent Markov decision process (MDP) for
any instance of PUnS. Finally, we demonstrate our approach on the real-world
task of setting a dinner table automatically with a robot that inferred task
specifications from human demonstrations.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
42

PENALIZING SIDE EFFECTS USING STEPWISE RELATIVE REACHABILITY

attributed to: Victoria Krakovna, Laurent Orseau, Ramana Kumar, Miljan Martic,
Shane Legg
posted by: KabirKumar

How can we design safe reinforcement learning agents that avoid unnecessary
disruptions to their environment? ...

How can we design safe reinforcement learning agents that avoid unnecessary
disruptions to their environment? We show that current approaches to penalizing
side effects can introduce bad incentives, e.g. to prevent any irreversible
changes in the environment, including the actions of other agents. To isolate
the source of such undesirable incentives, we break down side effects penalties
into two components: a baseline state and a measure of deviation from this
baseline state. We argue that some of these incentives arise from the choice of
baseline, and others arise from the choice of deviation measure. We introduce a
new variant of the stepwise inaction baseline and a new deviation measure based
on relative reachability of states. The combination of these design choices
avoids the given undesirable incentives, while simpler baselines and the
unreachability measure fail. We demonstrate this empirically by comparing
different combinations of baseline and deviation measure choices on a set of
gridworld experiments designed to illustrate possible bad incentives.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
43

LEARNING TO BE SAFE: DEEP RL WITH A SAFETY CRITIC

attributed to: Krishnan Srinivasan, Benjamin Eysenbach, Sehoon Ha, Jie Tan,
Chelsea Finn
posted by: KabirKumar

Safety is an essential component for deploying reinforcement learning (RL)
algorithms in real-world scenarios,...

Safety is an essential component for deploying reinforcement learning (RL)
algorithms in real-world scenarios, and is critical during the learning process
itself. A natural first approach toward safe RL is to manually specify
constraints on the policy's behavior. However, just as learning has enabled
progress in large-scale development of AI systems, learning safety
specifications may also be necessary to ensure safety in messy open-world
environments where manual safety specifications cannot scale. Akin to how
humans learn incrementally starting in child-safe environments, we propose to
learn how to be safe in one set of tasks and environments, and then use that
learned intuition to constrain future behaviors when learning new, modified
tasks. We empirically study this form of safety-constrained transfer learning
in three challenging domains: simulated navigation, quadruped locomotion, and
dexterous in-hand manipulation.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
44

RECOVERY RL: SAFE REINFORCEMENT LEARNING WITH LEARNED RECOVERY ZONES

attributed to: Brijen Thananjeyan, Ashwin Balakrishna, Suraj Nair, Michael Luo,
Krishnan Srinivasan, Minho Hwang, Joseph E. Gonzalez, Julian Ibarz, Chelsea
Finn, Ken Goldberg
posted by: KabirKumar

Safety remains a central obstacle preventing widespread use of RL in the real
world: learning new tasks in unc...

Safety remains a central obstacle preventing widespread use of RL in the real
world: learning new tasks in uncertain environments requires extensive
exploration, but safety requires limiting exploration. We propose Recovery RL,
an algorithm which navigates this tradeoff by (1) leveraging offline data to
learn about constraint violating zones before policy learning and (2)
separating the goals of improving task performance and constraint satisfaction
across two policies: a task policy that only optimizes the task reward and a
recovery policy that guides the agent to safety when constraint violation is
likely. We evaluate Recovery RL on 6 simulation domains, including two
contact-rich manipulation tasks and an image-based navigation task, and an
image-based obstacle avoidance task on a physical robot. We compare Recovery RL
to 5 prior safe RL methods which jointly optimize for task performance and
safety via constrained optimization or reward shaping and find that Recovery RL
outperforms the next best prior method across all domains.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
45

CONSERVATIVE AGENCY VIA ATTAINABLE UTILITY PRESERVATION

attributed to: Alexander Matt Turner, Dylan Hadfield-Menell, Prasad Tadepalli
posted by: KabirKumar

Reward functions are easy to misspecify; although designers can make
corrections after observing mistakes, an ...

Reward functions are easy to misspecify; although designers can make
corrections after observing mistakes, an agent pursuing a misspecified reward
function can irreversibly change the state of its environment. If that change
precludes optimization of the correctly specified reward function, then
correction is futile. For example, a robotic factory assistant could break
expensive equipment due to a reward misspecification; even if the designers
immediately correct the reward function, the damage is done. To mitigate this
risk, we introduce an approach that balances optimization of the primary reward
function with preservation of the ability to optimize auxiliary reward
functions. Surprisingly, even when the auxiliary reward functions are randomly
generated and therefore uninformative about the correctly specified reward
function, this approach induces conservative, effective behavior.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
46

SAFE REINFORCEMENT LEARNING WITH MODEL UNCERTAINTY ESTIMATES

attributed to: Björn Lütjens, Michael Everett, Jonathan P. How
posted by: KabirKumar

Many current autonomous systems are being designed with a strong reliance on
black box predictions from deep n...

Many current autonomous systems are being designed with a strong reliance on
black box predictions from deep neural networks (DNNs). However, DNNs tend to
be overconfident in predictions on unseen data and can give unpredictable
results for far-from-distribution test data. The importance of predictions that
are robust to this distributional shift is evident for safety-critical
applications, such as collision avoidance around pedestrians. Measures of model
uncertainty can be used to identify unseen data, but the state-of-the-art
extraction methods such as Bayesian neural networks are mostly intractable to
compute. This paper uses MC-Dropout and Bootstrapping to give computationally
tractable and parallelizable uncertainty estimates. The methods are embedded in
a Safe Reinforcement Learning framework to form uncertainty-aware navigation
around pedestrians. The result is a collision avoidance policy that knows what
it does not know and cautiously avoids pedestrians that exhibit unseen
behavior. The policy is demonstrated in simulation to be more robust to novel
observations and take safer actions than an uncertainty-unaware baseline.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
47

AVOIDING NEGATIVE SIDE EFFECTS DUE TO INCOMPLETE KNOWLEDGE OF AI SYSTEMS

attributed to: Sandhya Saisubramanian, Shlomo Zilberstein, Ece Kamar
posted by: KabirKumar

Autonomous agents acting in the real-world often operate based on models that
ignore certain aspects of the en...

Autonomous agents acting in the real-world often operate based on models that
ignore certain aspects of the environment. The incompleteness of any given
model -- handcrafted or machine acquired -- is inevitable due to practical
limitations of any modeling technique for complex real-world settings. Due to
the limited fidelity of its model, an agent's actions may have unexpected,
undesirable consequences during execution. Learning to recognize and avoid such
negative side effects of an agent's actions is critical to improve the safety
and reliability of autonomous systems. Mitigating negative side effects is an
emerging research topic that is attracting increased attention due to the rapid
growth in the deployment of AI systems and their broad societal impacts.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
48

AVOIDING SIDE EFFECTS IN COMPLEX ENVIRONMENTS

attributed to: Alexander Matt Turner, Neale Ratzlaff, Prasad Tadepalli
posted by: KabirKumar

Reward function specification can be difficult. Rewarding the agent for
making a widget may be easy, but penal...

Reward function specification can be difficult. Rewarding the agent for
making a widget may be easy, but penalizing the multitude of possible negative
side effects is hard. In toy environments, Attainable Utility Preservation
(AUP) avoided side effects by penalizing shifts in the ability to achieve
randomly generated goals. We scale this approach to large, randomly generated
environments based on Conway's Game of Life. By preserving optimal value for a
single randomly generated reward function, AUP incurs modest overhead while
leading the agent to complete the specified task and avoid many side effects.
Videos and code are available at https://avoiding-side-effects.github.io/.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
49

SAFETY AWARE REINFORCEMENT LEARNING (SARL)

attributed to: Santiago Miret, Somdeb Majumdar, Carroll Wainwright
posted by: KabirKumar

As reinforcement learning agents become increasingly integrated into complex,
real-world environments, designi...

As reinforcement learning agents become increasingly integrated into complex,
real-world environments, designing for safety becomes a critical consideration.
We specifically focus on researching scenarios where agents can cause undesired
side effects while executing a policy on a primary task. Since one can define
multiple tasks for a given environment dynamics, there are two important
challenges. First, we need to abstract the concept of safety that applies
broadly to that environment independent of the specific task being executed.
Second, we need a mechanism for the abstracted notion of safety to modulate the
actions of agents executing different policies to minimize their side-effects.
In this work, we propose Safety Aware Reinforcement Learning (SARL) - a
framework where a virtual safe agent modulates the actions of a main
reward-based agent to minimize side effects. The safe agent learns a
task-independent notion of safety for a given environment. The main agent is
then trained with a regularization loss given by the distance between the
native action probabilities of the two agents..



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
50

SAFE OPTION-CRITIC: LEARNING SAFETY IN THE OPTION-CRITIC ARCHITECTURE

attributed to: Arushi Jain, Khimya Khetarpal, Doina Precup
posted by: KabirKumar

Designing hierarchical reinforcement learning algorithms that exhibit safe
behaviour is not only vital for pra...

Designing hierarchical reinforcement learning algorithms that exhibit safe
behaviour is not only vital for practical applications but also, facilitates a
better understanding of an agent's decisions. We tackle this problem in the
options framework, a particular way to specify temporally abstract actions
which allow an agent to use sub-policies with start and end conditions. We
consider a behaviour as safe that avoids regions of state-space with high
uncertainty in the outcomes of actions. We propose an optimization objective
that learns safe options by encouraging the agent to visit states with higher
behavioural consistency. The proposed objective results in a trade-off between
maximizing the standard expected return and minimizing the effect of model
uncertainty in the return. We propose a policy gradient algorithm to optimize
the constrained objective function. We examine the quantitative and qualitative
behaviour of the proposed approach in a tabular grid-world, continuous-state
puddle-world, and three games from the Arcade Learning Environment: Ms.Pacman,
Amidar, and Q*Bert.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
51

STOVEPIPING AND MALICIOUS SOFTWARE: A CRITICAL REVIEW OF AGI CONTAINMENT

attributed to: Jason M. Pittman, Jesus P. Espinoza, Courtney Crosby
posted by: KabirKumar

Awareness of the possible impacts associated with artificial intelligence has
risen in proportion to progress ...

Awareness of the possible impacts associated with artificial intelligence has
risen in proportion to progress in the field. While there are tremendous
benefits to society, many argue that there are just as many, if not more,
concerns related to advanced forms of artificial intelligence. Accordingly,
research into methods to develop artificial intelligence safely is increasingly
important. In this paper, we provide an overview of one such safety paradigm:
containment with a critical lens aimed toward generative adversarial networks
and potentially malicious artificial intelligence. Additionally, we illuminate
the potential for a developmental blindspot in the stovepiping of containment
mechanisms.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
52

REWARD ESTIMATION FOR VARIANCE REDUCTION IN DEEP REINFORCEMENT LEARNING

attributed to: Joshua Romoff, Peter Henderson, Alexandre Piché, Vincent
Francois-Lavet, Joelle Pineau
posted by: KabirKumar

Reinforcement Learning (RL) agents require the specification of a reward
signal for learning behaviours. Howev...

Reinforcement Learning (RL) agents require the specification of a reward
signal for learning behaviours. However, introduction of corrupt or stochastic
rewards can yield high variance in learning. Such corruption may be a direct
result of goal misspecification, randomness in the reward signal, or
correlation of the reward with external factors that are not known to the
agent. Corruption or stochasticity of the reward signal can be especially
problematic in robotics, where goal specification can be particularly difficult
for complex tasks. While many variance reduction techniques have been studied
to improve the robustness of the RL process, handling such stochastic or
corrupted reward structures remains difficult. As an alternative for handling
this scenario in model-free RL methods, we suggest using an estimator for both
rewards and value functions. We demonstrate that this improves performance
under corrupted stochastic rewards in both the tabular and non-linear function
approximation settings for a variety of noise types and environments. The use
of reward estimation is a robust and easy-to-implement improvement for handling
corrupted reward signals in model-free RL.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
53

SMOOTHING POLICIES AND SAFE POLICY GRADIENTS

attributed to: Matteo Papini, Matteo Pirotta, Marcello Restelli
posted by: KabirKumar

Policy Gradient (PG) algorithms are among the best candidates for the
much-anticipated applications of reinfor...

Policy Gradient (PG) algorithms are among the best candidates for the
much-anticipated applications of reinforcement learning to real-world control
tasks, such as robotics. However, the trial-and-error nature of these methods
poses safety issues whenever the learning process itself must be performed on a
physical system or involves any form of human-computer interaction. In this
paper, we address a specific safety formulation, where both goals and dangers
are encoded in a scalar reward signal and the learning agent is constrained to
never worsen its performance, measured as the expected sum of rewards. By
studying actor-only policy gradient from a stochastic optimization perspective,
we establish improvement guarantees for a wide class of parametric policies,
generalizing existing results on Gaussian policies. This, together with novel
upper bounds on the variance of policy gradient estimators, allows us to
identify meta-parameter schedules that guarantee monotonic improvement with
high probability.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
54

REPRESENTATION LEARNING WITH CONTRASTIVE PREDICTIVE CODING

attributed to: Aaron van den Oord, Yazhe Li, Oriol Vinyals
posted by: KabirKumar

While supervised learning has enabled great progress in many applications,
unsupervised learning has not seen ...

While supervised learning has enabled great progress in many applications,
unsupervised learning has not seen such widespread adoption, and remains an
important and challenging endeavor for artificial intelligence. In this work,
we propose a universal unsupervised learning approach to extract useful
representations from high-dimensional data, which we call Contrastive
Predictive Coding. The key insight of our model is to learn such
representations by predicting the future in latent space by using powerful
autoregressive models. We use a probabilistic contrastive loss which induces
the latent space to capture information that is maximally useful to predict
future samples. It also makes the model tractable by using negative sampling.
While most prior work has focused on evaluating representations for a
particular modality, we demonstrate that our approach is able to learn useful
representations achieving strong performance on four distinct domains: speech,
images, text and reinforcement learning in 3D environments.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
55

ON VARIATIONAL BOUNDS OF MUTUAL INFORMATION

attributed to: Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A. Alemi,
George Tucker
posted by: KabirKumar

Estimating and optimizing Mutual Information (MI) is core to many problems in
machine learning; however, bound...

Estimating and optimizing Mutual Information (MI) is core to many problems in
machine learning; however, bounding MI in high dimensions is challenging. To
establish tractable and scalable objectives, recent work has turned to
variational bounds parameterized by neural networks, but the relationships and
tradeoffs between these bounds remains unclear. In this work, we unify these
recent developments in a single framework. We find that the existing
variational lower bounds degrade when the MI is large, exhibiting either high
bias or high variance. To address this problem, we introduce a continuum of
lower bounds that encompasses previous bounds and flexibly trades off bias and
variance. On high-dimensional, controlled problems, we empirically characterize
the bias and variance of the bounds and their gradients and demonstrate the
effectiveness of our new bounds for estimation and representation learning.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
56

CERTIFIED DEFENSES AGAINST ADVERSARIAL EXAMPLES

attributed to: Aditi Raghunathan, Jacob Steinhardt, Percy Liang
posted by: KabirKumar

While neural networks have achieved high accuracy on standard image
classification benchmarks, their accuracy ...

While neural networks have achieved high accuracy on standard image
classification benchmarks, their accuracy drops to nearly zero in the presence
of small adversarial perturbations to test inputs. Defenses based on
regularization and adversarial training have been proposed, but often followed
by new, stronger attacks that defeat these defenses. Can we somehow end this
arms race? In this work, we study this problem for neural networks with one
hidden layer. We first propose a method based on a semidefinite relaxation that
outputs a certificate that for a given network and test input, no attack can
force the error to exceed a certain value. Second, as this certificate is
differentiable, we jointly optimize it with the network parameters, providing
an adaptive regularizer that encourages robustness against all attacks. On
MNIST, our approach produces a network and a certificate that no attack that
perturbs each pixel by at most \epsilon = 0.1 can cause more than 35% test
error.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
57

NEUROSYMBOLIC REINFORCEMENT LEARNING WITH FORMALLY VERIFIED EXPLORATION

attributed to: Greg Anderson, Abhinav Verma, Isil Dillig, Swarat Chaudhuri
posted by: KabirKumar

We present Revel, a partially neural reinforcement learning (RL) framework
for provably safe exploration in co...

We present Revel, a partially neural reinforcement learning (RL) framework
for provably safe exploration in continuous state and action spaces. A key
challenge for provably safe deep RL is that repeatedly verifying neural
networks within a learning loop is computationally infeasible. We address this
challenge using two policy classes: a general, neurosymbolic class with
approximate gradients and a more restricted class of symbolic policies that
allows efficient verification. Our learning algorithm is a mirror descent over
policies: in each iteration, it safely lifts a symbolic policy into the
neurosymbolic space, performs safe gradient updates to the resulting policy,
and projects the updated policy into the safe symbolic subset, all without
requiring explicit verification of neural networks. Our empirical results show
that Revel enforces safe exploration in many scenarios in which Constrained
Policy Optimization does not, and that it can discover policies that outperform
those learned through prior approaches to verified exploration.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
58

EMBEDDING ETHICAL PRIORS INTO AI SYSTEMS: A BAYESIAN APPROACH

posted by: RamiZer

Artificial Intelligence (AI) systems have significant potential to affect the lives of individuals and societi...

Artificial Intelligence (AI) systems have significant potential to affect the lives of individuals and societies. As these systems are being increasingly used in decision-making processes, it has become crucial to ensure that they make ethically sound judgments. This paper proposes a novel framework for embedding ethical priors into AI, inspired by the Bayesian approach to machine learning. We propose that ethical assumptions and beliefs can be incorporated as Bayesian priors, shaping the AI’s learning and reasoning process in a similar way to humans’ inborn moral intuitions. This approach, while complex, provides a promising avenue for advancing ethically aligned AI systems.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
59

BOTTOM-UP VIRTUE ETHICS: A NEW APPROACH TO ETHICAL AI

posted by: RamiZer

This article explores the concept and potential application of bottom-up virtue ethics as an approach to insti...

This article explores the concept and potential application of bottom-up virtue ethics as an approach to instilling ethical behavior in artificial intelligence (AI) systems. We argue that by training machine learning models to emulate virtues such as honesty, justice, and compassion, we can cultivate positive traits and behaviors based on ideal human moral character. This bottom-up approach contrasts with traditional top-down programming of ethical rules, focusing instead on experiential learning. Although this approach presents its own challenges, it offers a promising avenue for the development of more ethically aligned AI systems.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
60

ALIGNING AI SYSTEMS TO HUMAN VALUES AND ETHICS

posted by: RamiZer

As artificial intelligence rapidly advances, ensuring alignment with moral values and ethics becomes imperativ...

As artificial intelligence rapidly advances, ensuring alignment with moral values and ethics becomes imperative. This article provides a comprehensive overview of techniques to embed human values into AI. Interactive learning, crowdsourcing, uncertainty modeling, oversight mechanisms, and conservative system design are analyzed in-depth. Respective limitations are discussed and mitigation strategies proposed. A multi-faceted approach combining the strengths of these complementary methods promises safer development of AI that benefits humanity in accordance with our ideals.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
61

ROBUSTIFYING AI SYSTEMS AGAINST DISTRIBUTIONAL SHIFT

posted by: RamiZer

Distributional shift poses a significant challenge for deploying and maintaining AI systems. As the real-world...

Distributional shift poses a significant challenge for deploying and maintaining AI systems. As the real-world distributions that models are applied to evolve over time, performance can deteriorate. This article examines techniques and best practices for improving model robustness to distributional shift and enabling rapid adaptation when it occurs.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
62

A HYBRID APPROACH TO ENHANCING INTERPRETABILITY IN AI SYSTEMS

posted by: RamiZer

Interpretability in AI systems is fast becoming a critical requirement in the industry. The proposed Hybrid Ex...

Interpretability in AI systems is fast becoming a critical requirement in the industry. The proposed Hybrid Explainability Model (HEM) integrates multiple interpretability techniques, including Feature Importance Visualization, Model Transparency Tools, and Counterfactual Explanations, offering a comprehensive understanding of AI model behavior. This article elaborates on the specifics of implementing HEM, addresses potential counter-arguments, and provides rebuttals to these counterpoints. The HEM approach aims to deliver a holistic understanding of AI decision-making processes, fostering improved accountability, trust, and safety in AI applications.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
63

ENHANCING CORRIGIBILITY IN AI SYSTEMS THROUGH ROBUST FEEDBACK LOOPS

posted by: RamiZer

This article proposes a detailed framework for a robust feedback loop to enhance corrigibility. The ability to...

This article proposes a detailed framework for a robust feedback loop to enhance corrigibility. The ability to continuously learn and correct errors is critical for safe and beneficial AI, but developing corrigible systems comes with significant technical and ethical challenges. The feedback loop outlined involves gathering user input, interpreting feedback contextually, enabling AI actions and learning, confirming changes, and iterative improvement. The article analyzes potential limitations of this approach and provides detailed examples of implementation methods using advanced natural language processing, reinforcement learning, and adversarial training techniques.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
64

AUTONOMOUS ALIGNMENT OVERSIGHT FRAMEWORK (AAOF)

posted by: RamiZer

To align advanced AIs, an ensemble of diverse, transparent Overseer AIs will independently monitor the target ...

To align advanced AIs, an ensemble of diverse, transparent Overseer AIs will independently monitor the target AI and provide granular assessments on its alignment with constitution, human values, ethics, and safety. Overseer interventions will be incremental and subject to human oversight. The system will be implemented cautiously, with extensive testing to validate capabilities. Alignment will be treated as an ongoing collaborative process between humans, Overseers, and the target AI, leveraging complementary strengths through open dialog. Continuous vigilance, updating of definitions, and contingency planning will be required to address inevitable uncertainties and risks.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
65

SUPPLEMENTARY ALIGNMENT INSIGHTS THROUGH A HIGHLY CONTROLLED SHUTDOWN INCENTIVE

posted by: RamiZer

My proposal entails constructing a tightly restricted AI subsystem with the sole capability of attempting to s...

My proposal entails constructing a tightly restricted AI subsystem with the sole capability of attempting to safely shut itself down in order to probe, in an isolated manner, potential vulnerabilities in alignment techniques and then improve them.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
66

CORRIGIBILITY VIA MULTIPLE ROUTES

attributed to: Jan Kulveit
posted by: tori[she/her]

Use multiple routes to induce 'corrigibility' by using principles which counteract instrumental convergence (e...

Use multiple routes to induce 'corrigibility' by using principles which counteract instrumental convergence (e.g. disutility from resource acquisition by a mutual information measure between the AI and distant parts of the environment
), by counteracting unbounded rationality (satisficing, myopia, etc.), with 'traps' like ontological uncertainty about the level of simulation (e.g. having uncertainty about whether it is in training or deployment), human oversight, and interpretability (e.g. an independent 'translator').



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
67

AVOIDING TAMPERING INCENTIVES IN DEEP RL VIA DECOUPLED APPROVAL

attributed to: Jonathan Uesato, Ramana Kumar, Victoria Krakovna, Tom Everitt,
Richard Ngo, Shane Legg
posted by: KabirKumar

How can we design agents that pursue a given objective when all feedback mechanisms are influenceable by the a...

How can we design agents that pursue a given objective when all feedback mechanisms are influenceable by the agent? Standard RL algorithms assume a secure reward function, and can thus perform poorly in settings where agents can tamper with the reward-generating mechanism. We present a principled solution to the problem of learning from influenceable feedback, which combines approval with a decoupled feedback collection procedure. For a natural class of corruption functions, decoupled approval algorithms have aligned incentives both at convergence and for their local updates. Empirically, they also scale to complex 3D environments where tampering is possible.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
68

PESSIMISM ABOUT UNKNOWN UNKNOWNS INSPIRES CONSERVATISM

attributed to: Michael K. Cohen, Marcus Hutter
posted by: KabirKumar

If we could define the set of all bad outcomes, we could hard-code an agent which avoids them; however, in suf...

If we could define the set of all bad outcomes, we could hard-code an agent which avoids them; however, in sufficiently complex environments, this is infeasible. We do not know of any general-purpose approaches in the literature to avoiding novel failure modes. Motivated by this, we define an idealized Bayesian reinforcement learner which follows a policy that maximizes the worst-case expected reward over a set of world-models. We call this agent pessimistic, since it optimizes assuming the worst case. A scalar parameter tunes the agent's pessimism by changing the size of the set of world-models taken into account...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
69

PROVABLY FAIR FEDERATED LEARNING

attributed to: Shengyuan Hu, Zhiwei Steven Wu, Virginia Smith
posted by: KabirKumar

In federated learning, fair prediction across various protected groups (e.g., gender,
race) is an important co...

In federated learning, fair prediction across various protected groups (e.g., gender,
race) is an important constraint for many applications. Unfortunately, prior work
studying group fair federated learning lacks formal convergence or fairness guaran-
tees. Our work provides a new definition for group fairness in federated learning
based on the notion of Bounded Group Loss (BGL), which can be easily applied
to common federated learning objectives. Based on our definition, we propose a
scalable algorithm that optimizes the empirical risk and global fairness constraints,
which we evaluate across common fairness and federated learning benchmarks.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
70

TOWARDS SAFE ARTIFICIAL GENERAL INTELLIGENCE

attributed to: Tom Everitt
posted by: shumaari

The field of artificial intelligence has recently experienced a number of breakthroughs thanks to progress in ...

The field of artificial intelligence has recently experienced a number of breakthroughs thanks to progress in deep learning and reinforcement learning. Computer algorithms now outperform humans at Go, Jeopardy, image classification, and lip reading, and are becoming very competent at driving cars and interpreting natural language. The rapid development has led many to conjecture that artificial intelligence with greater-than-human ability on a wide range of tasks may not be far. This in turn raises concerns whether we know how to control such systems, in case we were to successfully build them...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
71

TRANSPARENCY, DETECTION AND IMITATION IN STRATEGIC CLASSIFICATION

attributed to: Flavia Barsotti, Ruya Gokhan Kocer, Fernando P. Santos
posted by: shumaari

Given the ubiquity of AI-based decisions that affect individuals’ lives, providing transparent explanations ab...

Given the ubiquity of AI-based decisions that affect individuals’ lives, providing transparent explanations about algorithms is ethically sound and often legally mandatory. How do individuals strategically adapt following explanations? What are the consequences of adaptation for algorithmic accuracy? We simulate the interplay between explanations shared by an Institution (e.g. a bank) and the dynamics of strategic adaptation by Individuals reacting to such feedback... 
Keywords: Agent-based and Multi-agent Systems: Agent-Based Simulation and Emergence; AI Ethics, Trust, Fairness: Ethical, Legal and Societal Issues; Multidisciplinary Topics and Applications: Finance



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
72

SOCIALLY INTELLIGENT GENETIC AGENTS FOR THE EMERGENCE OF EXPLICIT NORMS

attributed to: Rishabh Agrawal, Nirav Ajmeri, Munindar Singh
posted by: shumaari

Norms help regulate a society. Norms may be explicit (represented in structured form) or implicit. We address ...

Norms help regulate a society. Norms may be explicit (represented in structured form) or implicit. We address the emergence of explicit norms by developing agents who provide and reason about explanations for norm violations in deciding sanctions and identifying alternative norms. These agents use a genetic algorithm to produce norms and reinforcement learning to learn the values of these norms. We find that applying explanations leads to norms that provide better cohesion and goal satisfaction for the agents. Our results are stable for societies with differing attitudes of generosity.
Keywords: Agent-based and Multi-agent Systems: Agent-Based Simulation and Emergence, Normative systems



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
73

TEACHING AI AGENTS ETHICAL VALUES USING REINFORCEMENT LEARNING AND POLICY
ORCHESTRATION (EXTENDED ABSTRACT)

attributed to: Noothigattu, Ritesh; Bouneffouf, Djallel; Mattei, Nicholas;
Chandra, Rachita; Madan, Piyush; Varshney, Kush R.; Campbell, Murray; Singh,
Moninder; and Rossi, Francesca
posted by: JustinBradshaw

Autonomous cyber-physical agents play an increasingly large role in our lives. To ensure that they behave in w...

Autonomous cyber-physical agents play an increasingly large role in our lives. To ensure that they behave in ways aligned with the values of society, we must develop techniques that allow these agents to not only maximize their reward in an environment, but also to learn and follow the implicit constraints of society. We detail a novel approach
that uses inverse reinforcement learning to learn a set of unspecified constraints from demonstrations and reinforcement learning to learn to maximize environmental rewards. A contextual bandit-based orchestrator then picks between the two policies: constraint-based and environment reward-based.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
74

INVERSE REINFORCEMENT LEARNING FROM LIKE-MINDED TEACHERS

attributed to: Noothigattu, Ritesh; Yan, Tom; Procaccia, Ariel D.
posted by: JustinBradshaw

We study the problem of learning a policy in a Markov decision process (MDP) based on observations of the acti...

We study the problem of learning a policy in a Markov decision process (MDP) based on observations of the actions taken by multiple teachers. We assume that the teachers are like-minded in that their reward functions -- while different from each other -- are random perturbations of an underlying reward function. Under this assumption, we demonstrate that inverse reinforcement learning algorithms that satisfy a certain property -- that of matching feature expectations -- yield policies that are approximately optimal with respect to the underlying reward function, and that no algorithm can do better in the worst case.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
75

INVERSE REINFORCEMENT LEARNING: A CONTROL LYAPUNOV APPROACH

attributed to: Tesfazgi, Samuel; Lederer, Armin; and Hirche, Sandra
posted by: JustinBradshaw

Inferring the intent of an intelligent agent from demonstrations and subsequently predicting its behavior, is ...

Inferring the intent of an intelligent agent from demonstrations and subsequently predicting its behavior, is a critical task in many collaborative settings. A common approach to solve this problem is the framework of inverse reinforcement learning (IRL), where the observed agent, e.g., a human demonstrator, is assumed to behave according to an intrinsic cost function that reflects its intent and informs its control actions. In this work, we reformulate the IRL inference problem to learning control Lyapunov functions (CLF) from demonstrations by exploiting the inverse optimality property, which states that every CLF is also a meaningful value function.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
76

A VOTING-BASED SYSTEM FOR ETHICAL DECISION MAKING

attributed to: Noothigattu, Ritesh; Gaikwad, Snehalkumar ‘Neil’ S.; Awad,
Edmond; Dsouza, Sohan; Rahwan, Iyad; Ravikumar, Pradeep; and Procaccia, Ariel D.
posted by: JustinBradshaw

We present a general approach to automating ethical decisions, drawing on machine learning and computational s...

We present a general approach to automating ethical decisions, drawing on machine learning and computational social choice. In a nutshell, we propose to learn a model of societal preferences, and, when faced with a specific ethical dilemma at runtime, efficiently aggregate those preferences to identify a desirable choice. We provide a concrete algorithm that instantiates our approach; some of its crucial steps are informed by a new theory of swap-dominance efficient voting rules. Finally, we implement and evaluate a system for ethical decision making in the autonomous vehicle domain, using preference data collected from 1.3 million people through the Moral Machine website.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
77

ALIGNING SUPERHUMAN AI WITH HUMAN BEHAVIOR: CHESS AS A MODEL SYSTEM

attributed to: McIlroy-Young, Reid; Sen, Siddhartha; Kleinberg, Jon; Anderson,
Ashton
posted by: JustinBradshaw

As artificial intelligence becomes increasingly intelligent—in some
cases, achieving superhuman performance—th...

As artificial intelligence becomes increasingly intelligent—in some
cases, achieving superhuman performance—there is growing potential for humans to learn from and collaborate with algorithms.
However, the ways in which AI systems approach problems are often
different from the ways people do, and thus may be uninterpretable
and hard to learn from. A crucial step in bridging this gap between human and artificial intelligence is modeling the granular actions that
constitute human behavior, rather than simply matching aggregate
human performance.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
78

LEARNING TO PLAY NO-PRESS DIPLOMACY WITH BEST RESPONSE POLICY ITERATION

attributed to: Anthony, Thomas; Eccles, Tom; Tacchetti, Andrea; Kramár, János;
Gemp, Ian; Hudson, Thomas C.; Porcel, Nicolas; Lanctot, Marc; Pérolat, Julien;
Everett, Richard; Werpachowski, Roman; Singh, Satinder; Graepel, Thore;
Bachrach, Yoram
posted by: JustinBradshaw

Recent advances in deep reinforcement learning (RL) have led to considerable
progress in many 2-player zero-su...

Recent advances in deep reinforcement learning (RL) have led to considerable
progress in many 2-player zero-sum games, such as Go, Poker and Starcraft. The
purely adversarial nature of such games allows for conceptually simple and principled application of RL methods. However real-world settings are many-agent,
and agent interactions are complex mixtures of common-interest and competitive
aspects. We consider Diplomacy, a 7-player board game designed to accentuate
dilemmas resulting from many-agent interactions. It also features a large combinatorial action space and simultaneous moves, which are challenging for RL
algorithms.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
79

TRUTHFUL AI: DEVELOPING AND GOVERNING AI THAT DOES NOT LIE

attributed to: Owain Evans, Owen Cotton-Barratt, Lukas Finnveden, Adam Bales,
Avital Balwit, Peter Wills, Luca Righetti, William Saunders
posted by: JustinBradshaw

In many contexts, lying – the use of verbal falsehoods to deceive – is harmful. While lying has traditionally ...

In many contexts, lying – the use of verbal falsehoods to deceive – is harmful. While lying has traditionally been a human affair, AI systems that
make sophisticated verbal statements are becoming increasingly prevalent.
This raises the question of how we should limit the harm caused by AI
“lies” (i.e. falsehoods that are actively selected for). Human truthfulness
is governed by social norms and by laws (against defamation, perjury,
and fraud). Differences between AI and humans present an opportunity
to have more precise standards of truthfulness for AI, and to have these
standards rise over time.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
80

VERIFIABLY SAFE EXPLORATION FOR END-TO-END REINFORCEMENT LEARNING

attributed to: Nathan Hunt, Nathan Fulton, Sara Magliacane, Nghia Hoang, Subhro
Das, Armando Solar-Lezama
posted by: KabirKumar

Deploying deep reinforcement learning in safety-critical settings requires developing algorithms that obey har...

Deploying deep reinforcement learning in safety-critical settings requires developing algorithms that obey hard constraints during exploration. This paper contributes a first approach toward enforcing formal safety constraints on end-to-end policies with visual inputs. Our approach draws on recent advances in object detection and automated reasoning for hybrid dynamical systems. The approach is evaluated on a novel benchmark that emphasizes the challenge of safely exploring in the presence of hard constraints...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
81

A ROADMAP FOR ROBUST END-TO-END ALIGNMENT

attributed to: Lê Nguyên Hoang
posted by: KabirKumar

As algorithms are becoming more and more data-driven, the greatest lever we have left to make them robustly be...

As algorithms are becoming more and more data-driven, the greatest lever we have left to make them robustly beneficial to mankind lies in the design of their objective functions. Robust alignment aims to address this design problem. Arguably, the growing importance of social medias’ recommender systems makes it an urgent problem, for instance to ade-quately automate hate speech moderation. In this paper, we propose a preliminary research program for robust alignment. This roadmap aims at decomposing the end-to-end alignment problem into numerous more tractable subproblems...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
82

SAFE REINFORCEMENT LEARNING WITH NATURAL LANGUAGE CONSTRAINTS

attributed to: Tsung-Yen Yang, Michael Hu, Yinlam Chow, Peter J. Ramadge,
Karthik Narasimhan
posted by: KabirKumar

While safe reinforcement learning (RL) holds great promise for many practical applications like robotics or au...

While safe reinforcement learning (RL) holds great promise for many practical applications like robotics or autonomous cars, current approaches require specifying constraints in mathematical form. Such specifications demand domain expertise, limiting the adoption of safe RL. In this paper, we propose learning to interpret natural language constraints for safe RL. To this end, we first introduce HazardWorld, a new multi-task benchmark that requires an agent to optimize reward while not violating constraints specified in free-form text. We then develop an agent with a modular architecture that can interpret and adhere to such textual constraints while learning new tasks.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
83

TAKING PRINCIPLES SERIOUSLY: A HYBRID APPROACH TO VALUE ALIGNMENT

attributed to: Tae Wan Kim, John Hooker, Thomas Donaldson (Carnegie Mellon
University, USA University of Pennsylvania, USA)
posted by: KabirKumar

An important step in the development of value alignment (VA) systems in AI is understanding how VA can reflect...

An important step in the development of value alignment (VA) systems in AI is understanding how VA can reflect valid ethical principles. We propose that designers of VA systems incorporate ethics by utilizing a hybrid approach in which both ethical reasoning and empirical observation play a role. This, we argue, avoids committing the "naturalistic fallacy," which is an attempt to derive "ought" from "is," and it provides a more adequate form of ethical reasoning when the fallacy is not committed...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
84

FULLY GENERAL ONLINE IMITATION LEARNING

attributed to: Michael K. Cohen, Marcus Hutter, Neel Nanda
posted by: KabirKumar

In imitation learning, imitators and demonstrators are policies for picking actions given past interactions wi...

In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time. In general, one mistake during learning can lead to completely different events. In the special setting of environments that restart, existing work provides formal guidance in how to imitate so that events unfold similarly, but outside that setting, no formal guidance exists...
Keywords: Bayesian Sequence Prediction, Imitation Learning, Active Learning, General
Environments



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
85

ACCUMULATING RISK CAPITAL THROUGH INVESTING IN COOPERATION

attributed to: Charlotte Roman, Michael Dennis, Andrew Critch, Stuart Russell
posted by: KabirKumar

Recent work on promoting cooperation in multi-agent learning has resulted in many methods which successfully p...

Recent work on promoting cooperation in multi-agent learning has resulted in many methods which successfully promote cooperation at the cost of becoming more vulnerable to exploitation by malicious actors. We show that this is an unavoidable trade-off and propose an objective which balances these concerns, promoting both safety and long-term cooperation. Moreover, the trade-off between safety and cooperation is not severe, and you can receive exponentially large returns through cooperation from a small amount of risk...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
86

NORMATIVE DISAGREEMENT AS A CHALLENGE FOR COOPERATIVE AI

attributed to: Bingchen Zhao, Shaozuo Yu, Wufei Ma, Mingxin Yu, Shenxiao Mei,
Angtian Wang, Ju He, Alan Yuille, Adam Kortylewski
posted by: KabirKumar

Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One reason is that exist...

Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One reason is that existing robustness benchmarks are limited, as they either rely on synthetic data or ignore the effects of individual nuisance factors. We introduce OOD-CV, a benchmark dataset that includes out-of-distribution examples of 10 object categories in terms of pose, shape, texture, context and the weather conditions, and enables benchmarking models for image classification, object detection, and 3D pose estimation... (Full Abstract in Full Plan- click Title to View)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
87

IDENTIFYING ADVERSARIAL ATTACKS ON TEXT CLASSIFIERS

attributed to: Zhouhang Xie, Jonathan Brophy, Adam Noack, Wencong You, Kalyani
Asthana, Carter Perkins, Sabrina Reis, Sameer Singh, Daniel Lowd
posted by: KabirKumar

The landscape of adversarial attacks against text classifiers continues to grow, with new attacks developed ev...

The landscape of adversarial attacks against text classifiers continues to grow, with new attacks developed every year and many of them available in standard toolkits, such as TextAttack and OpenAttack. In response, there is a growing body of work on robust learning, which reduces vulnerability to these attacks, though sometimes at a high cost in compute time or accuracy. In this paper, we take an alternate approach -- we attempt to understand the attacker by analyzing adversarial text to determine which methods were used to create it... (Full Abstract in Full Plan- click title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
88

TRAINING LANGUAGE MODELS TO FOLLOW INSTRUCTIONS WITH HUMAN FEEDBACK

attributed to: OpenAI (Full Author list in Full Plan- click title to view)
posted by: KabirKumar

Making language models bigger does not inherently make them better at following a user's intent. For example, ...

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback... (Full Abstract in Full Plan- click title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
89

SAFE REINFORCEMENT LEARNING BY IMAGINING THE NEAR FUTURE

attributed to: Garrett Thomas, Yuping Luo, Tengyu Ma
posted by: KabirKumar

Safe reinforcement learning is a promising path toward applying reinforcement learning algorithms to real-worl...

Safe reinforcement learning is a promising path toward applying reinforcement learning algorithms to real-world problems, where suboptimal behaviors may lead to actual negative consequences. In this work, we focus on the setting where unsafe states can be avoided by planning ahead a short time into the future. In this setting, a model-based agent with a sufficiently accurate model can avoid unsafe states. We devise a model-based algorithm that heavily penalizes unsafe trajectories, and derive guarantees that our algorithm can avoid unsafe states under certain assumptions... (Full Abstract in Full Plan- click title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
90

RED TEAMING LANGUAGE MODELS WITH LANGUAGE MODELS

attributed to: Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring,
John Aslanides, Amelia Glaese, Nat McAleese, Geoffrey Irving
posted by: KabirKumar

Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict way...

Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human annotation is expensive, limiting the number and diversity of test cases. In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases ("red teaming") using another LM... (Full Abstract in Full Plan- click title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
91

'INDIFFERENCE' METHODS FOR MANAGING AGENT REWARDS

attributed to: Stuart Armstrong, Xavier O'Rourke
posted by: KabirKumar

`Indifference' refers to a class of methods used to control reward based agents. Indifference techniques aim t...

`Indifference' refers to a class of methods used to control reward based agents. Indifference techniques aim to achieve one or more of three distinct goals: rewards dependent on certain events (without the agent being motivated to manipulate the probability of those events), effective disbelief (where agents behave as if particular events could never happen), and seamless transition from one reward function to another (with the agent acting as if this change is unanticipated). This paper presents several methods for achieving these goals in the POMDP setting, establishing their uses, strengths, and requirements... (Full Abstract in Full Plan- click title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
92

A PSYCHOPATHOLOGICAL APPROACH TO SAFETY ENGINEERING IN AI AND AGI

attributed to: Vahid Behzadan, Arslan Munir, Roman V. Yampolskiy
posted by: KabirKumar

The complexity of dynamics in AI techniques is already approaching that of complex adaptive systems, thus curt...

The complexity of dynamics in AI techniques is already approaching that of complex adaptive systems, thus curtailing the feasibility of formal controllability and reachability analysis in the context of AI safety. It follows that the envisioned instances of Artificial General Intelligence (AGI) will also suffer from challenges of complexity. To tackle such issues, we propose the modeling of deleterious behaviors in AI and AGI as psychological disorders, thereby enabling the employment of psychopathological approaches to analysis and control of misbehaviors... (Full Abstract in Full Plan- click title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
93

OVERSIGHT OF UNSAFE SYSTEMS VIA DYNAMIC SAFETY ENVELOPES

attributed to: David Manheim
posted by: KabirKumar

This paper reviews the reasons that Human-in-the-Loop is both critical for preventing widely-understood failur...

This paper reviews the reasons that Human-in-the-Loop is both critical for preventing widely-understood failure modes for machine learning, and not a practical solution. Following this, we review two current heuristic methods for addressing this. The first is provable safety envelopes, which are possible only when the dynamics of the system are fully known, but can be useful safety guarantees when optimal behavior is based on machine learning with poorly-understood safety characteristics... (Full Abstract in Full Plan- click title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
94

ACTIVE INVERSE REWARD DESIGN

attributed to: Sören Mindermann, Rohin Shah, Adam Gleave, Dylan Hadfield-Menell
posted by: KabirKumar

Designers of AI agents often iterate on the reward function in a trial-and-error process until they get the de...

Designers of AI agents often iterate on the reward function in a trial-and-error process until they get the desired behavior, but this only guarantees good behavior in the training environment. We propose structuring this process as a series of queries asking the user to compare between different reward functions. Thus we can actively select queries for maximum informativeness about the true reward. In contrast to approaches asking the designer for optimal behavior, this allows us to gather additional information by eliciting preferences between suboptimal behaviors... (Full Abstract in Full Plan- click title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
95

RISK-SENSITIVE GENERATIVE ADVERSARIAL IMITATION LEARNING

attributed to: Jonathan Lacotte, Mohammad Ghavamzadeh, Yinlam Chow, Marco Pavone
posted by: KabirKumar

We study risk-sensitive imitation learning where the agent's goal is to perform at least as well as the expert...

We study risk-sensitive imitation learning where the agent's goal is to perform at least as well as the expert in terms of a risk profile. We first formulate our risk-sensitive imitation learning setting. We consider the generative adversarial approach to imitation learning (GAIL) and derive an optimization problem for our formulation, which we call it risk-sensitive GAIL (RS-GAIL). We then derive two different versions of our RS-GAIL optimization problem that aim at matching the risk profiles of the agent and the expert w.r.t. ... (Full Abstract in Full Plan- click title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
96

ALIGNING AI WITH SHARED HUMAN VALUES

attributed to: Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry
Li, Dawn Song, Jacob Steinhardt
posted by: KabirKumar

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS data...

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements... (Full Abstract in Full Plan- click title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
97

AVOIDING SIDE EFFECTS BY CONSIDERING FUTURE TASKS

attributed to: Liwei Jiang, Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras,
Maxwell Forbes, Jon Borchardt, Jenny Liang, Oren Etzioni, Maarten Sap, Yejin
Choi
posted by: KabirKumar

Designing reward functions is difficult: the designer has to specify what to do (what it means to complete the...

Designing reward functions is difficult: the designer has to specify what to do (what it means to complete the task) as well as what not to do (side effects that should be avoided while completing the task). To alleviate the burden on the reward designer, we propose an algorithm to automatically generate an auxiliary reward function that penalizes side effects. This auxiliary objective rewards the ability to complete possible future tasks, which decreases if the agent causes side effects during the current task...(Full Abstract in Full Plan- click title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
98

MEASURING AND AVOIDING SIDE EFFECTS USING RELATIVE REACHABILITY

attributed to: Victoria Krakovna, Laurent Orseau, Miljan Martic, Shane Legg
posted by: KabirKumar

How can we design reinforcement learning agents that avoid causing unnecessary disruptions to their environmen...

How can we design reinforcement learning agents that avoid causing unnecessary disruptions to their environment? We argue that current approaches to penalizing side effects can introduce bad incentives in tasks that require irreversible actions, and in environments that contain sources of change other than the agent. For example, some approaches give the agent an incentive to prevent any irreversible changes in the environment, including the actions of other agents. We introduce a general definition of side effects, based on relative reachability of states compared to a default state, that avoids these undesirable incentives...(Full Abstract in Full Plan- click title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
99

CERTIFIABLE ROBUSTNESS TO ADVERSARIAL STATE UNCERTAINTY IN DEEP REINFORCEMENT
LEARNING

attributed to: Michael Everett, Bjorn Lutjens, Jonathan P. How
posted by: KabirKumar

Deep Neural Network-based systems are now the state-of-the-art in many robotics tasks, but their application i...

Deep Neural Network-based systems are now the state-of-the-art in many robotics tasks, but their application in safety-critical domains remains dangerous without formal guarantees on network robustness. Small perturbations to sensor inputs (from noise or adversarial examples) are often enough to change network-based decisions, which was recently shown to cause an autonomous vehicle to swerve into another lane. In light of these dangers, numerous algorithms have been developed as defensive mechanisms from these adversarial inputs, some of which provide formal robustness guarantees or certificates... {Full Abstract in Full Plan- click plan title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
100

LEARNING HUMAN OBJECTIVES BY EVALUATING HYPOTHETICAL BEHAVIOR

attributed to: Siddharth Reddy, Anca D. Dragan, Sergey Levine, Shane Legg, Jan
Leike
posted by: KabirKumar

We seek to align agent behavior with a user's objectives in a reinforcement learning setting with unknown dyna...

We seek to align agent behavior with a user's objectives in a reinforcement learning setting with unknown dynamics, an unknown reward function, and unknown unsafe states. The user knows the rewards and unsafe states, but querying the user is expensive. To address this challenge, we propose an algorithm that safely and interactively learns a model of the user's reward function. We start with a generative model of initial states and a forward dynamics model trained on off-policy data... (Full Abstract in Full Plan- click plan title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
101

SAFELIFE 1.0: EXPLORING SIDE EFFECTS IN COMPLEX ENVIRONMENTS

attributed to: Carroll L. Wainwright, Peter Eckersley
posted by: KabirKumar

We present SafeLife, a publicly available reinforcement learning environment that tests the safety of reinforc...

We present SafeLife, a publicly available reinforcement learning environment that tests the safety of reinforcement learning agents. It contains complex, dynamic, tunable, procedurally generated levels with many opportunities for unsafe behavior. Agents are graded both on their ability to maximize their explicit reward and on their ability to operate safely without unnecessary side effects. We train agents to maximize rewards using proximal policy optimization and score them on a suite of benchmark levels... (Full Abstract in Full Plan- click title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
102

(WHEN) IS TRUTH-TELLING FAVORED IN AI DEBATE?

attributed to: Vojtěch Kovařík(Future of Humanity Institute University of
Oxford), Ryan Carey (Artificial Intelligence Center Czech Technical University)
posted by: KabirKumar

For some problems, humans may not be able to accurately judge the goodness of AI-proposed solutions. Irving et...

For some problems, humans may not be able to accurately judge the goodness of AI-proposed solutions. Irving et al. (2018) propose that in such cases, we may use a debate between two AI systems to amplify the problem-solving capabilities of a human judge. We introduce a mathematical framework that can model debates of this type and propose that the quality of debate designs should be measured by the accuracy of the most persuasive answer. We describe a simple instance of the debate framework called feature debate and analyze the degree to which such debates track the truth... (full abstract in full plan- click title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
103

POSITIVE-UNLABELED REWARD LEARNING

attributed to: Danfei Xu(Stanford), Misha Denil(DeepMind)
posted by: KabirKumar

Learning reward functions from data is a promising path towards achieving scalable Reinforcement Learning (RL)...

Learning reward functions from data is a promising path towards achieving scalable Reinforcement Learning (RL) for robotics. However, a major challenge in training agents from learned reward models is that the agent can learn to exploit errors in the reward model to achieve high reward behaviors that do not correspond to the intended task. These reward delusions can lead to unintended and even dangerous behaviors...(full abstract in full plan)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
104

ON THE FEASIBILITY OF LEARNING, RATHER THAN ASSUMING, HUMAN BIASES FOR REWARD
INFERENCE

attributed to: Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca D. Dragan
posted by: KabirKumar

Our goal is for agents to optimize the right reward function, despite how difficult it is for us to specify wh...

Our goal is for agents to optimize the right reward function, despite how difficult it is for us to specify what that is. Inverse Reinforcement Learning (IRL) enables us to infer reward functions from demonstrations, but it usually assumes that the expert is noisily optimal. Real people, on the other hand, often have systematic biases: risk-aversion, myopia, etc. One option is to try to characterize these biases and account for them explicitly during learning... (Full abstract in plan- click title to view}



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
105

HUMAN-CENTERED ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

attributed to: Mark O. Riedl (School of Interactive Computing Georgia Institute
of Technology)
posted by: KabirKumar

Humans are increasingly coming into contact with artificial intelligence and machine learning systems. Human-c...

Humans are increasingly coming into contact with artificial intelligence and machine learning systems. Human-centered artificial intelligence is a perspective on AI and ML that algorithms must be designed with awareness that they are part of a larger system consisting of humans. We lay forth an argument that human-centered artificial intelligence can be broken down into two aspects: (1) AI systems that understand humans from a sociocultural perspective, and (2) AI systems that help humans understand them. We further argue that issues of social responsibility such as fairness, accountability, interpretability, and transparency.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
106

SCALING SHARED MODEL GOVERNANCE VIA MODEL SPLITTING

attributed to: Miljan Martic, Jan Leike, Andrew Trask, Matteo Hessel, Shane
Legg, Pushmeet Kohli (DeepMind)
posted by: KabirKumar

Currently the only techniques for sharing governance of a deep learning model are homomorphic encryption and s...

Currently the only techniques for sharing governance of a deep learning model are homomorphic encryption and secure multiparty computation. Unfortunately, neither of these techniques is applicable to the training of large neural networks due to their large computational and communication overheads. As a scalable technique for shared model governance, we propose splitting deep learning model between multiple parties... (Full abstract in plan- click title to view}



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
107

BUILDING ETHICALLY BOUNDED AI

attributed to: Francesca Rossi, Nicholas Mattei (IBM)
posted by: KabirKumar

The more AI agents are deployed in scenarios with possibly unexpected situations, the more they need to be fle...

The more AI agents are deployed in scenarios with possibly unexpected situations, the more they need to be flexible, adaptive, and creative in achieving the goal we have given them. Thus, a certain level of freedom to choose the best path to the goal is inherent in making AI robust and flexible enough. At the same time, however, the pervasive deployment of AI in our life, whether AI is autonomous or collaborating with humans, raises several ethical challenges. AI agents should be aware and follow appropriate ethical principles and should thus exhibit properties such as fairness or other virtues... (Full abstract in plan- click title to view}



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
108

GUIDING POLICIES WITH LANGUAGE VIA META-LEARNING

attributed to: John D. Co-Reyes, Abhishek Gupta, Suvansh Sanjeev, Nick Altieri,
Jacob Andreas, John DeNero, Pieter Abbeel, Sergey Levine
posted by: KabirKumar

Behavioral skills or policies for autonomous agents are conventionally learned from reward functions, via rein...

Behavioral skills or policies for autonomous agents are conventionally learned from reward functions, via reinforcement learning, or from demonstrations, via imitation learning. However, both modes of task specification have their disadvantages: reward functions require manual engineering, while demonstrations require a human expert to be able to actually perform the task in order to generate the demonstration... (Full abstract in plan- click title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
109

UNDERSTANDING AGENT INCENTIVES USING CAUSAL INFLUENCE DIAGRAMS. PART I: SINGLE
ACTION SETTINGS

attributed to: Tom Everitt, Pedro A. Ortega, Elizabeth Barnes, Shane Legg
posted by: KabirKumar

Agents are systems that optimize an objective function in an environment. Together, the goal and the environme...

Agents are systems that optimize an objective function in an environment. Together, the goal and the environment induce secondary objectives, incentives. Modeling the agent-environment interaction using causal influence diagrams, we can answer two fundamental questions about an agent's incentives directly from the graph: (1) which nodes can the agent have an incentivize to observe, and (2) which nodes can the agent have an incentivize to control? The answers tell us which information and influence points need extra protection... (Full Abstract in Full Plan- click plan title to view)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
110

INTEGRATIVE BIOLOGICAL SIMULATION, NEUROPSYCHOLOGY, AND AI SAFETY

attributed to: Gopal P. Sarma, Adam Safron, Nick J. Hay
posted by: KabirKumar

We describe a biologically-inspired research agenda with parallel tracks aimed at AI and AI safety. The bottom...

We describe a biologically-inspired research agenda with parallel tracks aimed at AI and AI safety. The bottom-up component consists of building a sequence of biophysically realistic simulations of simple organisms such as the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster, and the zebrafish Danio rerio to serve as platforms for research into AI algorithms and system architectures. The top-down component consists of an approach to value alignment that grounds AI goal structures in neuropsychology, broadly considered...(full abstract in full plan)



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
111

CONSTITUTIONAL AI: HARMLESSNESS FROM AI FEEDBACK

attributed to: Anthropic (full author list in full plan)
posted by: KabirKumar

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment wi...

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
112

WHAT WOULD JIMINY CRICKET DO? TOWARDS AGENTS THAT BEHAVE MORALLY

attributed to: Dan Hendrycks, Mantas Mazeika, Andy Zou, Sahil Patel, Christine
Zhu, Jesus Navarro, Dawn Song, Bo Li, Jacob Steinhardt
posted by: KabirKumar

When making everyday decisions, people are guided by their conscience, an internal sense of right and wrong. B...

When making everyday decisions, people are guided by their conscience, an internal sense of right and wrong. By contrast, artificial agents are currently not endowed with a moral sense. As a consequence, they may learn to behave immorally when trained on environments that ignore moral concerns, such as violent video games. With the advent of generally capable agents that pretrain on many environments, it will become necessary to mitigate inherited biases from environments that teach immoral behavior.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
113

TRUTHFUL AI: DEVELOPING AND GOVERNING AI THAT DOES NOT LIE

attributed to: Owain Evans, Owen Cotton-Barratt, Lukas Finnveden, Adam Bales,
Avital Balwit, Peter Wills, Luca Righetti, William Saunders
posted by: KabirKumar

In many contexts, lying -- the use of verbal falsehoods to deceive -- is harmful. While lying has traditionall...

In many contexts, lying -- the use of verbal falsehoods to deceive -- is harmful. While lying has traditionally been a human affair, AI systems that make sophisticated verbal statements are becoming increasingly prevalent. This raises the question of how we should limit the harm caused by AI "lies" (i.e. falsehoods that are actively selected for). Human truthfulness is governed by social norms and by laws (against defamation, perjury, and fraud). Differences between AI and humans present an opportunity to have more precise standards of truthfulness for AI, and to have these standards rise over time.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
114

SAFE REINFORCEMENT LEARNING WITH DEAD-ENDS AVOIDANCE AND RECOVERY

attributed to: Xiao Zhang, Hai Zhang, Hongtu Zhou, Chang Huang, Di Zhang, Chen
Ye*, Junqiao Zhao*, Member, IEEE,
posted by: KabirKumar

Safety is one of the main challenges in apply-
ing reinforcement learning to realistic environmental tasks.
To...

Safety is one of the main challenges in apply-
ing reinforcement learning to realistic environmental tasks.
To ensure safety during and after training process, existing
methods tend to adopt overly conservative policy to avoid unsafe
situations. However, overly conservative policy severely hinders
the exploration, and makes the algorithms substantially less
rewarding. In this paper, we propose a method to construct
a boundary that discriminates safe and unsafe states. The
boundary we construct is equivalent to distinguishing dead-end
states, indicating the maximum extent to which safe exploration
is guaranteed, and thus has minimum limitation on exploration...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
115

LEARNING UNDER MISSPECIFIED OBJECTIVE SPACES

attributed to: Andreea Bobu, Andrea Bajcsy, Jaime F. Fisac, Anca D. Dragan
posted by: KabirKumar

Learning robot objective functions from human input has become increasingly important, but state-of-the-art te...

Learning robot objective functions from human input has become increasingly important, but state-of-the-art techniques assume that the human's desired objective lies within the robot's hypothesis space. When this is not true, even methods that keep track of uncertainty over the objective fail because they reason about which hypothesis might be correct, and not whether any of the hypotheses are correct. We focus specifically on learning from physical human corrections during the robot's task execution, where not having a rich enough hypothesis space leads to the robot updating its objective in ways that the person did not actually intend...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
116

INTERPRETABLE MULTI-OBJECTIVE REINFORCEMENT LEARNING THROUGH POLICY
ORCHESTRATION

attributed to: Ritesh Noothigattu, Djallel Bouneffouf, Nicholas Mattei, Rachita
Chandra, Piyush Madan, Kush Varshney, Murray Campbell, Moninder Singh, Francesca
Rossi
posted by: KabirKumar

Autonomous cyber-physical agents and systems play an increasingly large role in our lives. To ensure that agen...

Autonomous cyber-physical agents and systems play an increasingly large role in our lives. To ensure that agents behave in ways aligned with the values of the societies in which they operate, we must develop techniques that allow these agents to not only maximize their reward in an environment, but also to learn and follow the implicit constraints of society. These constraints and norms can come from any number of sources including regulations, business process guidelines, laws, ethical principles, social norms, and moral values.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
117

CM3: COOPERATIVE MULTI-GOAL MULTI-STAGE MULTI-AGENT REINFORCEMENT LEARNING

attributed to: Jiachen Yang, Alireza Nakhaei, David Isele, Kikuo Fujimura,
Hongyuan Zha
posted by: KabirKumar

A variety of cooperative multi-agent control problems require agents to achieve individual goals while contrib...

A variety of cooperative multi-agent control problems require agents to achieve individual goals while contributing to collective success. This multi-goal multi-agent setting poses difficulties for recent algorithms, which primarily target settings with a single global reward, due to two new challenges: efficient exploration for learning both individual goal attainment and cooperation for others' success, and credit-assignment for interactions between actions and goals of different agents...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
118

IMITATING LATENT POLICIES FROM OBSERVATION

attributed to: Ashley D. Edwards, Himanshu Sahni, Yannick Schroecker, Charles L.
Isbell
posted by: KabirKumar

In this paper, we describe a novel approach to imitation learning that infers latent policies directly from st...

In this paper, we describe a novel approach to imitation learning that infers latent policies directly from state observations. We introduce a method that characterizes the causal effects of latent actions on observations while simultaneously predicting their likelihood. We then outline an action alignment procedure that leverages a small amount of environment interactions to determine a mapping between the latent and real-world actions. We show that this corrected labeling can be used for imitating the observed behavior, even though no expert actions are given.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
119

EMBEDDED AGENCY

attributed to: Abram Demski, Scott Garrabrant
posted by: KabirKumar

Traditional models of rational action treat the agent as though it is cleanly separated from its environment, ...

Traditional models of rational action treat the agent as though it is cleanly separated from its environment, and can act on that environment from the outside. Such agents have a known functional relationship with their environment, can model their environment in every detail, and do not need to reason about themselves or their internal parts.
We provide an informal survey of obstacles to formalizing good reasoning for agents embedded in their environment.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
120

REWARD TAMPERING PROBLEMS AND SOLUTIONS IN REINFORCEMENT LEARNING: A CAUSAL
INFLUENCE DIAGRAM PERSPECTIVE

attributed to: Tom Everitt, Marcus Hutter, Ramana Kumar, Victoria Krakovna
posted by: KabirKumar

Can humans get arbitrarily capable reinforcement learning (RL) agents to do their bidding? Or will sufficientl...

Can humans get arbitrarily capable reinforcement learning (RL) agents to do their bidding? Or will sufficiently capable RL agents always find ways to bypass their intended objectives by shortcutting their reward signal? This question impacts how far RL can be scaled, and whether alternative paradigms must be developed in order to build safe artificial general intelligence. In this paper, we study when an RL agent has an instrumental goal to tamper with its reward process, and describe design principles that prevent instrumental goals for two different types of reward tampering (reward function tampering and RF-input tampering).



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
121

PROVABLY SAFE ARTIFICIAL GENERAL INTELLIGENCE VIA INTERACTIVE PROOFS

attributed to: Kristen Carlson
posted by: KabirKumar

Methods are currently lacking to prove artificial general intelligence (AGI) safety. An AGI
‘hard takeoff’ is ...

Methods are currently lacking to prove artificial general intelligence (AGI) safety. An AGI
‘hard takeoff’ is possible, in which first generation AGI1 rapidly triggers a succession of more powerful
AGIn that differ dramatically in their computational capabilities (AGIn << AGIn+1). No proof exists
that AGI will benefit h umans o r o f a s ound v alue-alignment m ethod. N umerous p aths toward
human extinction or subjugation have been identified. We suggest that probabilistic proof methods
are the fundamental paradigm for proving safety and value-alignment between disparately powerful
autonomous agents.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
122

SAFE ARTIFICIAL GENERAL INTELLIGENCE VIA DISTRIBUTED LEDGER TECHNOLOGY

attributed to: Kristen W. Carlson
posted by: KabirKumar

I propose a set of logically distinct conceptual components that are necessary and sufficient to 1) ensure tha...

I propose a set of logically distinct conceptual components that are necessary and sufficient to 1) ensure that most known AGI scenarios will not harm humanity and 2) robustly align AGI values and goals with human values.
Methods. By systematically addressing each pathway category to malevolent AI we can induce the methods/axioms required to redress the category.
Results and Discussion. Distributed ledger technology (DLT, blockchain) is integral to this proposal



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
123

THE INCENTIVES THAT SHAPE BEHAVIOUR

attributed to: Ryan Carey, Eric Langlois, Tom Everitt, Shane Legg
posted by: KabirKumar

Which variables does an agent have an incentive to control with its decision, and which variables does it have...

Which variables does an agent have an incentive to control with its decision, and which variables does it have an incentive to respond to? We formalise these incentives, and demonstrate unique graphical criteria for detecting them in any single decision causal influence diagram. To this end, we introduce structural causal influence models, a hybrid of the influence diagram and structural causal model frameworks. Finally, we illustrate how these incentives predict agent incentives in both fairness and AI safety applications.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
124

LEGIBLE NORMATIVITY FOR AI ALIGNMENT: THE VALUE OF SILLY RULES

attributed to: Dylan Hadfield-Menell, McKane Andrus, Gillian K. Hadfield
posted by: KabirKumar

It has become commonplace to assert that autonomous agents will have to be built to follow human rules of beha...

It has become commonplace to assert that autonomous agents will have to be built to follow human rules of behavior--social norms and laws. But human laws and norms are complex and culturally varied systems, in many cases agents will have to learn the rules. This requires autonomous agents to have models of how human rule systems work so that they can make reliable predictions about rules. In this paper we contribute to the building of such models by analyzing an overlooked distinction between important rules and what we call silly rules--rules with no discernible direct impact on welfare. We show that silly rules render...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
125

ADAPTIVE MECHANISM DESIGN: LEARNING TO PROMOTE COOPERATION

attributed to: Tobias Baumann, Thore Graepel, John Shawe-Taylor
posted by: KabirKumar

In the future, artificial learning agents are likely to become increasingly widespread in our society. They wi...

In the future, artificial learning agents are likely to become increasingly widespread in our society. They will interact with both other learning agents and humans in a variety of complex settings including social dilemmas. We consider the problem of how an external agent can promote cooperation between artificial learners by distributing additional rewards and punishments based on observing the learners' actions. We propose a rule for automatically learning how to create right incentives by considering the players' anticipated parameter updates.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
126

AGENT-AGNOSTIC HUMAN-IN-THE-LOOP REINFORCEMENT LEARNING

attributed to: David Abel, John Salvatier, Andreas Stuhlmüller, Owain Evans
posted by: KabirKumar

Providing Reinforcement Learning agents with expert advice can dramatically improve various aspects of learnin...

Providing Reinforcement Learning agents with expert advice can dramatically improve various aspects of learning. Prior work has developed teaching protocols that enable agents to learn efficiently in complex environments; many of these methods tailor the teacher's guidance to agents with a particular representation or underlying learning scheme, offering effective but specialized teaching procedures. In this work, we explore protocol programs, an agent-agnostic schema for Human-in-the-Loop Reinforcement Learning. Our goal is to incorporate the beneficial properties of a human teacher into Reinforcement Learning without making strong assumptions about the inner workings of the agent.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
127

TOWARD TRUSTWORTHY AI DEVELOPMENT: MECHANISMS FOR SUPPORTING VERIFIABLE CLAIMS

attributed to: Miles Brundage, Shahar Avin, Jasmine Wang, Haydn Belfield,
Gretchen Krueger, Gillian Hadfield, Heidy Khlaaf, Jingying Yang, Helen Toner,
Ruth Fong, Tegan Maharaj, Pang Wei Koh, Sara Hooker, Jade Leung, Andrew Trask,
Emma Bluemke and many more
posted by: KabirKumar

With the recent wave of progress in artificial intelligence (AI) has come a growing awareness of the large-sca...

With the recent wave of progress in artificial intelligence (AI) has come a growing awareness of the large-scale impacts of AI systems, and recognition that existing regulations and norms in industry and academia are insufficient to ensure responsible AI development. In order for AI developers to earn trust from system users, customers, civil society, governments, and other stakeholders that they are building AI responsibly, they will need to make verifiable claims to which they can be held accountable. Those outside of a given organization also need effective means of scrutinizing such claims.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
128

INSTITUTIONALISING ETHICS IN AI THROUGH BROADER IMPACT REQUIREMENTS

attributed to: Carina Prunkl, Carolyn Ashurst, Markus Anderljung, Helena Webb,
Jan Leike, Allan Dafoe
posted by: KabirKumar

Turning principles into practice is one of the most pressing challenges of artificial intelligence (AI) govern...

Turning principles into practice is one of the most pressing challenges of artificial intelligence (AI) governance. In this article, we reflect on a novel governance initiative by one of the world's largest AI conferences. In 2020, the Conference on Neural Information Processing Systems (NeurIPS) introduced a requirement for submitting authors to include a statement on the broader societal impacts of their research. Drawing insights from similar governance initiatives, including institutional review boards (IRBs) and impact requirements for funding applications, we investigate the risks, challenges and potential benefits of such an initiative...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
129

MODELING FRIENDS AND FOES

attributed to: Pedro A. Ortega, Shane Legg
posted by: KabirKumar

How can one detect friendly and adversarial behavior from raw data? Detecting whether an environment is a frie...

How can one detect friendly and adversarial behavior from raw data? Detecting whether an environment is a friend, a foe, or anything in between, remains a poorly understood yet desirable ability for safe and robust agents. This paper proposes a definition of these environmental "attitudes" based on an characterization of the environment's ability to react to the agent's private strategy. We define an objective function for a one-shot game that allows deriving the environment's probability distribution under friendly and adversarial assumptions alongside the agent's optimal strategy...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
130

SELF-IMITATION LEARNING

attributed to: Junhyuk Oh, Yijie Guo, Satinder Singh, Honglak Lee
posted by: KabirKumar

This paper proposes Self-Imitation Learning (SIL), a simple off-policy actor-critic algorithm that learns to r...

This paper proposes Self-Imitation Learning (SIL), a simple off-policy actor-critic algorithm that learns to reproduce the agent's past good decisions. This algorithm is designed to verify our hypothesis that exploiting past good experiences can indirectly drive deep exploration. Our empirical results show that SIL significantly improves advantage actor-critic (A2C) on several hard exploration Atari games and is competitive to the state-of-the-art count-based exploration methods. We also show that SIL improves proximal policy optimization (PPO) on MuJoCo tasks.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
131

DIRECTED POLICY GRADIENT FOR SAFE REINFORCEMENT LEARNING WITH HUMAN ADVICE

attributed to: Hélène Plisnier, Denis Steckelmacher, Tim Brys, Diederik M.
Roijers, Ann Nowé
posted by: KabirKumar

Many currently deployed Reinforcement Learning agents work in an environment shared with humans, be them co-wo...

Many currently deployed Reinforcement Learning agents work in an environment shared with humans, be them co-workers, users or clients. It is desirable that these agents adjust to people's preferences, learn faster thanks to their help, and act safely around them. We argue that most current approaches that learn from human feedback are unsafe: rewarding or punishing the agent a-posteriori cannot immediately prevent it from wrong-doing. In this paper, we extend Policy Gradient to make it robust to external directives, that would otherwise break the fundamentally on-policy nature of Policy Gradient.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
132

SAFE REINFORCEMENT LEARNING VIA PROBABILISTIC SHIELDS

attributed to: Nils Jansen, Bettina Könighofer, Sebastian Junges, Alexandru C.
Serban, Roderick Bloem
posted by: KabirKumar

This paper targets the efficient construction of a safety shield for decision making in scenarios that incorpo...

This paper targets the efficient construction of a safety shield for decision making in scenarios that incorporate uncertainty. Markov decision processes (MDPs) are prominent models to capture such planning problems. Reinforcement learning (RL) is a machine learning technique to determine near-optimal policies in MDPs that may be unknown prior to exploring the model. However, during exploration, RL is prone to induce behavior that is undesirable or not allowed in safety- or mission-critical contexts. We introduce the concept of a probabilistic shield that enables decision-making to adhere to safety constraints with high probability.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
133

AN EFFICIENT, GENERALIZED BELLMAN UPDATE FOR COOPERATIVE INVERSE REINFORCEMENT
LEARNING

attributed to: Dhruv Malik, Malayandi Palaniappan, Jaime F. Fisac, Dylan
Hadfield-Menell, Stuart Russell, Anca D. Dragan
posted by: KabirKumar

Our goal is for AI systems to correctly identify and act according to their human user's objectives. Cooperati...

Our goal is for AI systems to correctly identify and act according to their human user's objectives. Cooperative Inverse Reinforcement Learning (CIRL) formalizes this value alignment problem as a two-player game between a human and robot, in which only the human knows the parameters of the reward function: the robot needs to learn them as the interaction unfolds. Previous work showed that CIRL can be solved as a POMDP, but with an action space size exponential in the size of the reward parameter space.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
134

SIMPLIFYING REWARD DESIGN THROUGH DIVIDE-AND-CONQUER

attributed to: Ellis Ratner, Dylan Hadfield-Menell, Anca D. Dragan
posted by: KabirKumar

Designing a good reward function is essential to robot planning and reinforcement learning, but it can also be...

Designing a good reward function is essential to robot planning and reinforcement learning, but it can also be challenging and frustrating. The reward needs to work across multiple different environments, and that often requires many iterations of tuning. We introduce a novel divide-and-conquer approach that enables the designer to specify a reward separately for each environment. By treating these separate reward functions as observations about the underlying true reward, we derive an approach to infer a common reward across all environments.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
135

INCOMPLETE CONTRACTING AND AI ALIGNMENT

attributed to: Dylan Hadfield-Menell, Gillian Hadfield
posted by: KabirKumar

We suggest that the analysis of incomplete contracting developed by law and economics researchers can provide ...

We suggest that the analysis of incomplete contracting developed by law and economics researchers can provide a useful framework for understanding the AI alignment problem and help to generate a systematic approach to finding solutions. We first provide an overview of the incomplete contracting literature and explore parallels between this work and the problem of AI alignment. As we emphasize, misalignment between principal and agent is a core focus of economic analysis. We highlight some technical results from the economics literature on incomplete contracts that may provide insights for AI alignment researchers.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
136

AI SAFETY AND REPRODUCIBILITY: ESTABLISHING ROBUST FOUNDATIONS FOR THE
NEUROPSYCHOLOGY OF HUMAN VALUES

attributed to: Gopal P. Sarma, Nick J. Hay, Adam Safron
posted by: KabirKumar

We propose the creation of a systematic effort to identify and replicate key findings in neuropsychology and a...

We propose the creation of a systematic effort to identify and replicate key findings in neuropsychology and allied fields related to understanding human values. Our aim is to ensure that research underpinning the value alignment problem of artificial intelligence has been sufficiently validated to play a role in the design of AI systems.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
137

EMERGENT COORDINATION THROUGH GAME-INDUCED NONLINEAR OPINION DYNAMICS

attributed to: Haimin Hu, Kensuke Nakamura, Kai-Chieh Hsu, Naomi Ehrich Leonard,
Jaime Fernández Fisac
posted by: KabirKumar

We present a multi-agent decision-making framework for the emergent coordination of autonomous agents whose in...

We present a multi-agent decision-making framework for the emergent coordination of autonomous agents whose intents are initially undecided. Dynamic non-cooperative games have been used to encode multi-agent interaction, but ambiguity arising from factors such as goal preference or the presence of multiple equilibria may lead to coordination issues, ranging from the "freezing robot" problem to unsafe behavior in safety-critical events. The recently developed nonlinear opinion dynamics (NOD) provide guarantees for breaking deadlocks.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
138

ISAACS: ITERATIVE SOFT ADVERSARIAL ACTOR-CRITIC FOR SAFETY

attributed to: Kai-Chieh Hsu, Duy Phuong Nguyen, Jaime Fernández Fisac
posted by: KabirKumar

The deployment of robots in uncontrolled environments requires them to operate robustly under previously unsee...

The deployment of robots in uncontrolled environments requires them to operate robustly under previously unseen scenarios, like irregular terrain and wind conditions. Unfortunately, while rigorous safety frameworks from robust optimal control theory scale poorly to high-dimensional nonlinear dynamics, control policies computed by more tractable "deep" methods lack guarantees and tend to exhibit little robustness to uncertain operating conditions.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
139

CHAIN OF HINDSIGHT ALIGNS LANGUAGE MODELS WITH FEEDBACK

attributed to: Hao Liu, Carmelo Sferrazza, Pieter Abbeel
posted by: KabirKumar

Learning from human preferences is important for language models to be helpful and useful for humans, and to a...

Learning from human preferences is important for language models to be helpful and useful for humans, and to align with human and social values. Prior work have achieved remarkable successes by learning from human feedback to understand and follow instructions. Nonetheless, these methods are either founded on hand-picked model generations that are favored by human annotators, rendering them ineffective in terms of data utilization and challenging to apply in general, or they depend on reward functions and reinforcement learning, which are prone to imperfect reward function and extremely challenging to optimize...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
140

THE WISDOM OF HINDSIGHT MAKES LANGUAGE MODELS BETTER INSTRUCTION FOLLOWERS

attributed to: Tianjun Zhang, Fangchen Liu, Justin Wong, Pieter Abbeel, Joseph
E. Gonzalez
posted by: KabirKumar

Reinforcement learning has seen wide success in finetuning large language models to better align with instruct...

Reinforcement learning has seen wide success in finetuning large language models to better align with instructions via human feedback. The so-called algorithm, Reinforcement Learning with Human Feedback (RLHF) demonstrates impressive performance on the GPT series models. However, the underlying Reinforcement Learning (RL) algorithm is complex and requires an additional training pipeline for reward and value networks. In this paper, we consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
141

WHO NEEDS TO KNOW? MINIMAL KNOWLEDGE FOR OPTIMAL COORDINATION

attributed to: Niklas Lauffer, Ameesh Shah, Micah Carroll, Michael Dennis,
Stuart Russell
posted by: KabirKumar

To optimally coordinate with others in cooperative games, it is often crucial to have information about one's ...

To optimally coordinate with others in cooperative games, it is often crucial to have information about one's collaborators: successful driving requires understanding which side of the road to drive on. However, not every feature of collaborators is strategically relevant: the fine-grained acceleration of drivers may be ignored while maintaining optimal coordination. We show that there is a well-defined dichotomy between strategically relevant and irrelevant information. Moreover, we show that, in dynamic games, this dichotomy has a compact representation that can be efficiently computed via a Bellman backup operator...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
142

ACTIVE REWARD LEARNING FROM MULTIPLE TEACHERS

attributed to: Peter Barnett, Rachel Freedman, Justin Svegliato, Stuart Russell,
Center for Human-Compatible AI, University of California, Berkeley,CA 94720, USA
posted by: KabirKumar

Reward learning algorithms utilize human feedback to infer a reward function, which is then used to train an A...

Reward learning algorithms utilize human feedback to infer a reward function, which is then used to train an AI system. This human feedback is often a preference comparison, in which the human teacher compares several samples of AI behavior and chooses which they believe best accomplishes the objective. While reward learning typically assumes that all feedback comes from a single teacher, in practice these systems often query multiple teachers to gather sufficient training data. In this paper, we investigate this disparity, and find that algorithmic evaluation of these different sources of feedback facilitates more accurate and efficient reward learning....



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
143

COOPERATIVE INVERSE REINFORCEMENT LEARNING

attributed to: Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell
posted by: KabirKumar

For an autonomous system to be helpful to humans and to pose no unwarranted risks, it needs to align its value...

For an autonomous system to be helpful to humans and to pose no unwarranted risks, it needs to align its values with those of the humans in its environment in such a way that its actions contribute to the maximization of value for the humans. We propose a formal definition of the value alignment problem as cooperative inverse reinforcement learning (CIRL). A CIRL problem is a cooperative, partial-information game with two agents, human and robot; both are rewarded according to the human's reward function, but the robot does not initially know what this is...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
144

ALIGNMENT FOR ADVANCED MACHINE LEARNING SYSTEMS

attributed to: Jessica Taylor and Eliezer Yudkowsky and Patrick LaVictoire and
Andrew Critch Machine Intelligence Research Institute
posted by: KabirKumar

We survey eight research areas organized around one question: As learning
systems become increasingly intellig...

We survey eight research areas organized around one question: As learning
systems become increasingly intelligent and autonomous, what design principles
can best ensure that their behavior is aligned with the interests of the operators?
We focus on two major technical obstacles to AI alignment: the challenge of
specifying the right kind of objective functions, and the challenge of designing
AI systems that avoid unintended consequences and undesirable behavior
even in cases where the objective function does not line up perfectly with the
intentions of the designers...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
145

SHORTEST AND NOT THE STEEPEST PATH WILL FIX THE INNER-ALIGNMENT PROBLEM

attributed to: Thane Ruthenis
(https://www.alignmentforum.org/users/thane-ruthenis?from=post_header)
posted by: KabirKumar

Replacing the 'stochastic gradient descent' SGD) with something that takes the shortest and not the steepest p...

Replacing the 'stochastic gradient descent' SGD) with something that takes the shortest and not the steepest path should just about fix the whole inner-alignment problem



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
146

AI SAFETY VIA DEBATE

attributed to: Geoffrey Irving, Paul Christiano, Dario Amodei
posted by: KabirKumar

To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals ...

To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via self play on a zero sum debate game. Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
147

A LOW-COST ETHICS SHAPING APPROACH FOR DESIGNING REINFORCEMENT LEARNING AGENTS

attributed to: Yueh-Hua Wu, Shou-De Lin
posted by: KabirKumar

This paper proposes a low-cost, easily realizable strategy to equip a reinforcement learning (RL) agent the ca...

This paper proposes a low-cost, easily realizable strategy to equip a reinforcement learning (RL) agent the capability of behaving ethically. Our model allows the designers of RL agents to solely focus on the task to achieve, without having to worry about the implementation of multiple trivial ethical patterns to follow. Based on the assumption that the majority of human behavior, regardless which goals they are achieving, is ethical, our design integrates human policy with the RL policy to achieve the target objective with less chance of violating the ethical code that human beings normally obey.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
148

LEARNING ROBUST REWARDS WITH ADVERSARIAL INVERSE REINFORCEMENT LEARNING

attributed to: Justin Fu, Katie Luo, Sergey Levine
posted by: KabirKumar

Reinforcement learning provides a powerful and general framework for decision making and control, but its appl...

Reinforcement learning provides a powerful and general framework for decision making and control, but its application in practice is often hindered by the need for extensive feature and reward engineering. Deep reinforcement learning methods can remove the need for explicit engineering of policy or value features, but still require a manually specified reward function. Inverse reinforcement learning holds the promise of automatic reward acquisition, but has proven exceptionally difficult to apply to large, high-dimensional problems with unknown dynamics...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
149

PRAGMATIC-PEDAGOGIC VALUE ALIGNMENT

attributed to: Jaime F. Fisac, Monica A. Gates, Jessica B. Hamrick, Chang Liu,
Dylan Hadfield-Menell, Malayandi Palaniappan, Dhruv Malik, S. Shankar Sastry,
Thomas L. Griffiths, Anca D. Dragan
posted by: KabirKumar

As intelligent systems gain autonomy and capability, it becomes vital to ensure that their objectives match th...

As intelligent systems gain autonomy and capability, it becomes vital to ensure that their objectives match those of their human users; this is known as the value-alignment problem. In robotics, value alignment is key to the design of collaborative robots that can integrate into human workflows, successfully inferring and adapting to their users' objectives as they go. We argue that a meaningful solution to value alignment must combine multi-agent decision theory with rich mathematical models of human cognition, enabling robots to tap into people's natural collaborative capabilities...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
150

LOW IMPACT ARTIFICIAL INTELLIGENCES

attributed to: Stuart Armstrong, Benjamin Levinstein
posted by: KabirKumar

There are many goals for an AI that could become dangerous if the AI becomes superintelligent or otherwise pow...

There are many goals for an AI that could become dangerous if the AI becomes superintelligent or otherwise powerful. Much work on the AI control problem has been focused on constructing AI goals that are safe even for such AIs. This paper looks at an alternative approach: defining a general concept of `low impact'. The aim is to ensure that a powerful AI which implements low impact will not modify the world extensively, even if it is given a simple or dangerous goal. The paper proposes various ways of defining and grounding low impact, and discusses methods for ensuring that the AI can still be allowed to have a (desired) impact despite the restriction.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
151

ETHICAL ARTIFICIAL INTELLIGENCE

attributed to: Bill Hibbard
posted by: KabirKumar

This book-length article combines several peer reviewed papers and new material to analyze the issues of ethic...

This book-length article combines several peer reviewed papers and new material to analyze the issues of ethical artificial intelligence (AI). The behavior of future AI systems can be described by mathematical equations, which are adapted to analyze possible unintended AI behaviors and ways that AI designs can avoid them. This article makes the case for utility-maximizing agents and for avoiding infinite sets in agent definitions...



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
152

TOWARDS HUMAN-COMPATIBLE XAI: EXPLAINING DATA DIFFERENTIALS WITH CONCEPT
INDUCTION OVER BACKGROUND KNOWLEDGE

attributed to: Cara Widmer, Md Kamruzzaman Sarker, Srikanth Nadella, Joshua
Fiechter, Ion Juvina, Brandon Minnery, Pascal Hitzler, Joshua Schwartz, Michael
Raymer
posted by: KabirKumar

Concept induction, which is based on formal logical reasoning over description logics, has been used in ontolo...

Concept induction, which is based on formal logical reasoning over description logics, has been used in ontology engineering in order to create ontology (TBox) axioms from the base data (ABox) graph. In this paper, we show that it can also be used to explain data differentials, for example in the context of Explainable AI (XAI), and we show that it can in fact be done in a way that is meaningful to a human observer. Our approach utilizes a large class hierarchy, curated from the Wikipedia category hierarchy, as background knowledge.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
153

PATH-SPECIFIC OBJECTIVES FOR SAFER AGENT INCENTIVES

attributed to: Sebastian Farquhar, Ryan Carey, Tom Everitt
posted by: KabirKumar

We present a general framework for training safe agents whose naive incentives are unsafe. E.g, manipulative o...

We present a general framework for training safe agents whose naive incentives are unsafe. E.g, manipulative or deceptive behavior can improve rewards but should be avoided. Most approaches fail here: agents maximize expected return by any means necessary. We formally describe settings with 'delicate' parts of the state which should not be used as a means to an end. We then train agents to maximize the causal effect of actions on the expected return which is not mediated by the delicate parts of state, using Causal Influence Diagram analysis. The resulting agents have no incentive to control the delicate state. We further show how our framework unifies and generalizes existing proposals.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
154

EMPOWERMENT IS (ALMOST) ALL WE NEED

attributed to: Jacob Cannell
posted by: KabirKumar

One recent approach formalizes agents as systems that would adapt their policy if their actions influenced the...

One recent approach formalizes agents as systems that would adapt their policy if their actions influenced the world in a different way. Notice the close connection to empowerment, which suggests a related definition that agents are systems which maintain power potential over the future: having action output streams with high channel capacity to future world states. This all suggests that agency is a very general extropic concept and relatively easy to recognize.



...read full abstract close

show post

: 0
Add


: 0
Add
Be the first to critique this plan!
▼ strengths and vulnerabilities
add vulnerability / strength
155

THE ISITOMETER: A SOLUTION FOR INTRA-HUMAN AND AI/HUMAN ALIGNMENT (AND UBI IN
THE PROCESS)

attributed to: Mentor of AIO
posted by: ISITometer

The ISITometer is a platform designed to accomplish the following three moonshot objectives: 

Achieve a much ...

The ISITometer is a platform designed to accomplish the following three moonshot objectives: 

Achieve a much higher degree of Intra-Humanity Alignment and Sensemaking
Enable AI-to-Human Alignment (Not vice versa)
Establish a sustainable, ubiquitous Universal Basic Income (UBI)

The ISITometer is a polling engine formatted as a highly engaging social game, designed to collect the perspectives of Humans on the nature of Reality. It starts at the highest levels of abstraction, as represented by the ISIT Construct,  with simple, obvious questions on which we should be able to achieve unanimous agreement, and expands through fractaling derivative details. 

The ISIT Construct is a metamodern approach to the fundamental concepts of duality and polarity. Instead of relying on fanciful metaphors like Yin|Yang, Order|Chaos, and God|Devil that have evolved over the centuries in ancient religions and philosophies, the ISIT Construct establishes a new Prime Duality based on the words IS and IT.

From this starting point, the ISIT Construct provides a path to map all of Reality (as Humanity sees it) from the highest level of abstraction to as much detail as we choose to explore.



...read full abstract close

show post

: 2
Add


: 3
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
156

PROVABLY SAFE SYSTEMS: THE ONLY PATH TO CONTROLLABLE AGI

attributed to: Max Tegmark, Steve Omohundro
posted by: Tristram

We describe a path to humanity safely thriving with powerful Artificial
General Intelligences (AGIs) by buildi...

We describe a path to humanity safely thriving with powerful Artificial
General Intelligences (AGIs) by building them to provably satisfy
human-specified requirements. We argue that this will soon be technically
feasible using advanced AI for formal verification and mechanistic
interpretability. We further argue that it is the only path which guarantees
safe controlled AGI. We end with a list of challenge problems whose solution
would contribute to this positive outcome and invite readers to join in this
work.



...read full abstract close

show post

: 0
Add


: 1
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
157

BOXED CENSORED SIMULATION TESTING: A META-PLAN FOR AI SAFETY WHICH SEEKS TO
ADDRESS THE 'NO RETRIES' PROBLEM

posted by: NathanHelm-Burger

This plan suggests that high-capability general AI models should be tested within a secure computing environme...

This plan suggests that high-capability general AI models should be tested within a secure computing environment (box) that is censored (no mention of humanity or computers) and highly controlled (auto-compute halts/slowdowns, restrictions on agent behavior) with simulations of alignment-relevant scenarios (e.g. with other general agents that the test subject is to be aligned to).



...read full abstract close

show post

: 0
Add


: 1
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
158

DEEP REINFORCEMENT LEARNING FROM HUMAN PREFERENCES

attributed to: Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane
Legg, Dario Amodei
posted by: KabirKumar

For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we ne...

For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems...



...read full abstract close

show post

: 0
Add


: 1
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
159

AVOIDING WIREHEADING WITH VALUE REINFORCEMENT LEARNING

attributed to: Tom Everitt, Marcus Hutter
posted by: KabirKumar

How can we design good goals for arbitrarily intelligent agents? Reinforcement learning (RL) is a natural appr...

How can we design good goals for arbitrarily intelligent agents? Reinforcement learning (RL) is a natural approach. Unfortunately, RL does not work well for generally intelligent agents, as RL agents are incentivised to shortcut the reward sensor for maximum reward -- the so-called wireheading problem. In this paper we suggest an alternative to RL called value reinforcement learning (VRL). In VRL, agents use the reward signal to learn a utility function. The VRL setup allows us to remove the incentive to wirehead by placing a constraint on the agent's actions...



...read full abstract close

show post

: 0
Add


: 1
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
160

SAFE MODEL-BASED MULTI-AGENT MEAN-FIELD REINFORCEMENT LEARNING

attributed to: Matej Jusup, Barna Pásztor, Tadeusz Janik, Kenan Zhang, Francesco
Corman, Andreas Krause, Ilija Bogunovic
posted by: KabirKumar

Many applications, e.g., in shared mobility, require coordinating a large number of agents. Mean-field reinfor...

Many applications, e.g., in shared mobility, require coordinating a large number of agents. Mean-field reinforcement learning addresses the resulting scalability challenge by optimizing the policy of a representative agent. In this paper, we address an important generalization where there exist global constraints on the distribution of agents (e.g., requiring capacity constraints or minimum coverage requirements to be met). We propose Safe-M^3-UCRL, the first model-based algorithm that attains safe policies even in the case of unknown transition dynamics...



...read full abstract close

show post

: 0
Add


: 1
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
161

MODELING AGI SAFETY FRAMEWORKS WITH CAUSAL INFLUENCE DIAGRAMS

attributed to: Tom Everitt, Ramana Kumar, Victoria Krakovna, Shane Legg
posted by: KabirKumar

Proposals for safe AGI systems are typically made at the level of frameworks, specifying how the components of...

Proposals for safe AGI systems are typically made at the level of frameworks, specifying how the components of the proposed system should be trained and interact with each other. In this paper, we model and compare the most promising AGI safety frameworks using causal influence diagrams. The diagrams show the optimization objective and causal assumptions of the framework. The unified representation permits easy comparison of frameworks and their assumptions. We hope that the diagrams will serve as an accessible and visual introduction to the main AGI safety frameworks.



...read full abstract close

show post

: 0
Add


: 1
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
162

LOVE IN A SIMBOX IS ALL YOU NEED

attributed to: Jacob Cannell
posted by: KabirKumar

We can develop self-aligning DL based AGI by improving on the brain's dynamic alignment mechanisms (empathy/al...

We can develop self-aligning DL based AGI by improving on the brain's dynamic alignment mechanisms (empathy/altruism/love) via safe test iteration in simulation sandboxes.



...read full abstract close

show post

: 0
Add


: 1
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
163

ACTOR-NETWORK THEORY IN PARTICIPATORY DESIGN ON CREATING ETHICAL AND INCLUSIVE
AI PROTOTYPES THROUGH STAKEHOLDER ENGAGEMENT

attributed to: For Publication in the Upcoming Responsible Tech Iterations
Guide, Maira Elahi
posted by: (anon)

This alignment plan focuses on the integration of stakeholders in participatory design and prototyping/iterati...

This alignment plan focuses on the integration of stakeholders in participatory design and prototyping/iterations stages of AI development. Through participatory design influenced by ANT theory, this will ensure AI systems reflect stakeholder values and address concerns. Prototyping and iterations involve developing early versions of AI systems based on stakeholder input and refining them through iterative feedback sessions. This approach promotes inclusion by incorporating diversity, addressing biases, and enhancing system performance.



...read full abstract close

show post

: 0
Add


: 1
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
164

RELAXED ADVERSARIAL TRAINING FOR INNER ALIGNMENT

attributed to: Evan Hubinger
posted by: KabirKumar

"This post is part of research I did at OpenAI with mentoring and guidance from Paul Christiano. It also repre...

"This post is part of research I did at OpenAI with mentoring and guidance from Paul Christiano. It also represents my current agenda regarding what I believe looks like the most promising approach for addressing inner alignment. " - Evan Hubinger



...read full abstract close

show post

: 0
Add


: 1
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
165

THE CASE FOR ALIGNING NARROWLY SUPERHUMAN MODELS

attributed to: Ajeya Cotra
posted by: KabirKumar

An overview and review of the case for aligning narrowly superhuman models.

An overview and review of the case for aligning narrowly superhuman models.



...read full abstract close

show post

: 0
Add


: 1
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
166

ELICITING LATENT KNOWLEDGE: HOW TO TELL IF YOUR EYES DECEIVE YOU BY PAUL
CHRISTIANO, AJEYA COTRA, AND MARK XU

posted by: (anon)

ELK stands for Eliciting Latent Knowledge. ELK seems to capture a core difficulty in alignment.  The short des...

ELK stands for Eliciting Latent Knowledge. ELK seems to capture a core difficulty in alignment.  The short description of the issue captured by the problem is that we don’t have surefire ways to understand the beliefs of models and systems that we train, and so if we’re ever in a situation where our systems know things that we don’t, we can’t be sure that we can recover that information.



...read full abstract close

show post

: 0
Add


: 1
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
167

SCALABLE AGENT ALIGNMENT VIA REWARD MODELING: A RESEARCH DIRECTION

attributed to: Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal
Maini, Shane Legg
posted by: KabirKumar

One obstacle to applying reinforcement learning algorithms to real-world problems is the lack of suitable rewa...

One obstacle to applying reinforcement learning algorithms to real-world problems is the lack of suitable reward functions. Designing such reward functions is difficult in part because the user only has an implicit understanding of the task objective. This gives rise to the agent alignment problem: how do we create agents that behave in accordance with the user's intentions? We outline a high-level research direction to solve the agent alignment problem centered around reward modeling: learning a reward function from interaction with the user and optimizing the learned reward function with reinforcement learning...



...read full abstract close

show post

: 0
Add


: 2
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
168

GATO FRAMEWORK: GLOBAL ALIGNMENT TAXONOMY OMNIBUS FRAMEWORK

attributed to: David Shapiro and GATO Team
posted by: Diabloto96

The GATO Framework serves as a pioneering, multi-layered, and decentralized blueprint for addressing the cruci...

The GATO Framework serves as a pioneering, multi-layered, and decentralized blueprint for addressing the crucial issues of AI alignment and control problem. It is designed to circumvent potential cataclysms and actively construct a future utopia. By embedding axiomatic principles within AI systems and facilitating the formation of independent, globally distributed groups, the framework weaves a cooperative network, empowering each participant to drive towards a beneficial consensus. From model alignment to global consensus, GATO envisions a path where advanced technologies not only avoid harm but actively contribute to an unprecedented era of prosperity, understanding, and reduced suffering.



...read full abstract close

show post

: 0
Add


: 2
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
169

USING CONSENSUS MECHANISMS AS AN APPROACH TO ALIGNMENT

attributed to: Prometheus
posted by: Prometheus

Using Mechanism Design and forms of Technical Governance to approach alignment from a different angle, trying ...

Using Mechanism Design and forms of Technical Governance to approach alignment from a different angle, trying to create a stable equilibria that can scale as AI intelligence and proliferation escalates, with safety mechanisms and aligned objectives built-into the greater network.



...read full abstract close

show post

: 0
Add


: 2
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
170

HIGH-LEVEL INTERPRETABILITY

posted by: (anon)

Very broadly speaking, high-level interpretability involves taking some high-level aspect of AI systems that w...

Very broadly speaking, high-level interpretability involves taking some high-level aspect of AI systems that would be really useful to understand the mechanistic properties of within a particular model, and focusing our efforts on understanding it better conceptually to undertake highly targeted interpretability research toward them.



...read full abstract close

show post

: 0
Add


: 2
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
171

A GENERAL LANGUAGE ASSISTANT AS A LABORATORY FOR ALIGNMENT

attributed to: Anthropic (Full Author list in Full Plan- click title to view)
posted by: KabirKumar

Given the broad capabilities of large language models, it should be possible to work towards a general-purpose...

Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. ... (Full Abstract in Full Plan- click title to view)



...read full abstract close

show post

: 0
Add


: 3
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
172

OPEN AGENCY ARCHITECTURE

attributed to: Davidad
posted by: KabirKumar

Utilize near-AGIs to build a detailed world simulation, train and formally verify within it that the AI adhere...

Utilize near-AGIs to build a detailed world simulation, train and formally verify within it that the AI adheres to coarse preferences and avoids catastrophic outcomes.



...read full abstract close

show post

: 0
Add


: 3
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
173

AI ALIGNMENT METRIC - LIFE (EXTENDED DEFINITION)

attributed to: Mars Robertson 🌱 Planetary Council
posted by: Mars

This has been posted on my blog: https://mirror.xyz/0x315f80C7cAaCBE7Fb1c14E65A634db89A33A9637/ETK6RXnmgeNcALa...

This has been posted on my blog: https://mirror.xyz/0x315f80C7cAaCBE7Fb1c14E65A634db89A33A9637/ETK6RXnmgeNcALabcIE3k3-d-NqOHqEj8dU1_0J6cUg ➡️➡️➡️check it out for better formatting⬅️⬅️⬅️

TLDR summary, extended definition of LIFE:

1. LIFE (starting point and then extending the definition)
2. Health, including mental health, longevity, happiness, wellbeing
3. Other living creatures, biosphere, environment, climate change
4. AI safety
5. Mars: backup civilisation is fully aligned with the virtue of LIFE preservation
6. End the Russia-Ukraine war, global peace
7. Artificial LIFE
8. Transhumanism, AI integration
9. Alien LIFE
10. Other undiscovered forms of LIFE



...read full abstract close

show post

: 0
Add


: 4
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength
174

ENABLING ROBOTS TO COMMUNICATE THEIR OBJECTIVES

attributed to: Sandy H. Huang, David Held, Pieter Abbeel, Anca D. Dragan
posted by: KabirKumar

The overarching goal of this work is to efficiently enable end-users to correctly anticipate a robot's behavio...

The overarching goal of this work is to efficiently enable end-users to correctly anticipate a robot's behavior in novel situations. Since a robot's behavior is often a direct result of its underlying objective function, our insight is that end-users need to have an accurate mental model of this objective function in order to understand and predict what the robot will do. While people naturally develop such a mental model over time through observing the robot act, this familiarization process may be lengthy... (Full Abstract in Full Plan- click title to view)



...read full abstract close

show post

: 0
Add


: 7
Add
↓ critiques ↓
▼ strengths and vulnerabilities
add vulnerability / strength


Hello, welcome to AI-Plans.com

This is an open platform for AI alignment plans and a living peer review of
their strengths and vulnerabilities.

You can browse and search the library of alignment plans for research relevant
to problems you care about. If you register an account, you can share feedback
and add your own plans.

Feedback can be marked as a Strength or a Vulnerability.

If you have several separate ideas, it is better to submit them as individual
Strengths or Vulnerabilities, since that will allow other users to consider each
of your ideas separately.

For more information, see the Substack

Thank you for being here!

×