{"PUBLIC_ROOT":"","POST_CHAR_LIMIT":50000,"CONFIRM_MINUTES":15,"UPLOAD_LIMIT_MB":8,"UPLOAD_LIMIT_MB_PDF":5,"UPLOAD_SEC_LIMIT":15,"CHAT_LENGTH":500,"POST_BUFFER_MS":60000,"COMMENT_BUFFER_MS":30000,"POST_LIMITS":{"TITLE":200,"DESCRIPTION":2200,"CONTENT":500000,"ATTRIBUTION":250,"COMMENT_CONTENT":10000},"VOTE_TYPES":{"single_up":1},"UPLOAD_BUFFER_S":10,"UPLOAD_LIMIT_GENERIC_MB":1,"HOLD_UNLOGGED_SUBMIT_DAYS":1,"KARMA_SCALAR":0.01,"VOTE_CODES":{"rm_upvote":"removed upvote","rm_down":"removed downvote","add_upvote":"added upvote","add_down":"added downvote"},"BADGE_TYPES":{"voting":{"ranks":[1,5,10,15,20],"name":"Voter"},"strengths":{"ranks":[1,5,10,15,20,30,40,50],"name":"Upvoter"},"vulns":{"ranks":[1,5,10,15,20,30,40,50],"name":"Critic"},"received_vote":{"ranks":[1,2,3,5,8,13,21,34],"name":"Popular"}}}

Critique AI Alignment Plans

Want feedback on your alignment plan? Submit a Plan

Topics:

All

Plan ranking: Total Strength Score - Total Vulnerability Score

1

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

attributed to: Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, Yaodong Yang
posted by: KabirKumar

In this paper, we introduce the BeaverTails dataset, aimed at fostering
research on safety alignment in large ...
In this paper, we introduce the BeaverTails dataset, aimed at fostering
research on safety alignment in large language models (LLMs). This dataset
uniquely separates annotations of helpfulness and harmlessness for
question-answering pairs, thus offering distinct perspectives on these crucial
attributes. In total, we have gathered safety meta-labels for 333,963
question-answer (QA) pairs and 361,903 pairs of expert comparison data for both
the helpfulness and harmlessness metrics. We further showcase applications of
BeaverTails in content moderation and reinforcement learning with human
feedback (RLHF), emphasizing its potential for practical safety measures in
LLMs. We believe this dataset provides vital resources for the community,
contributing towards the safe development and deployment of LLMs. Our project
page is available at the following URL:
https://sites.google.com/view/pku-beavertails.

...read full abstract close
show post
: 1
Add

: 0
Add
▼ 1 Strengths and 0 Vulnerabilities
add vulnerability / strength
report
2

"Causal Scrubbing: a method for rigorously testing interpretability hypotheses", AI Alignment Forum, 2022.

attributed to: Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, Nate Thomas [Redwood Research]
posted by: momom2

Summary: This post introduces causal scrubbing, a principled approach for evaluating the quality of mechanisti...
Summary: This post introduces causal scrubbing, a principled approach for evaluating the quality of mechanistic interpretations. The key idea behind causal scrubbing is to test interpretability hypotheses via behavior-preserving resampling ablations. We apply this method to develop a refined understanding of how a small language model implements induction and how an algorithmic model correctly classifies if a sequence of parentheses is balanced.

...read full abstract close
show post
: 1
Add

: 2
Add
▼ 1 Strengths and 2 Vulnerabilities
add vulnerability / strength
report
3

Natural Abstractions: Key claims, Theorems, and Critiques

attributed to: LawrenceC, Leon Lang, Erik Jenner, John Wentworth
posted by: KabirKumar

TL;DR: We distill John Wentworth’s Natural Abstractions agenda by summarizing its key claims: the Natural Abst...
TL;DR: We distill John Wentworth’s Natural Abstractions agenda by summarizing its key claims: the Natural Abstraction Hypothesis—many cognitive systems learn to use similar abstractions—and the Redundant Information Hypothesis—a particular mathematical description of natural abstractions. We also formalize proofs for several of its theoretical results. Finally, we critique the agenda’s progress to date, alignment relevance, and current research methodology.

...read full abstract close
show post
: 1
Add

: 2
Add
▼ 1 Strengths and 2 Vulnerabilities
add vulnerability / strength
report
4

Cognitive Emulation: A Naive AI Safety Proposal

attributed to: Connor Leahy, Gabriel Alfour (Conjecture)
posted by: KabirKumar

This post serves as a signpost for Conjecture’s new primary safety proposal and research direction, which we c...
This post serves as a signpost for Conjecture’s new primary safety proposal and research direction, which we call Cognitive Emulation (or “CoEm”). The goal of the CoEm agenda is to build predictably boundable systems, not directly aligned AGIs. We believe the former to be a far simpler and useful step towards a full alignment solution.

Unfortunately, given that most other actors are racing for as powerful and general AIs as possible, we won’t share much in terms of technical details for now. In the meantime, we still want to share some of our intuitions about this approach.

We take no credit for inventing any of these ideas, and see our contributions largely in taking existing ideas seriously and putting them together into a larger whole.[1]

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
report
5

Pretrained Transformers Improve Out-of-Distribution Robustness

attributed to: Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, Dawn Song
posted by: KabirKumar

Although pretrained Transformers such as BERT achieve high accuracy on
in-distribution examples, do they gener...
Although pretrained Transformers such as BERT achieve high accuracy on
in-distribution examples, do they generalize to new distributions? We
systematically measure out-of-distribution (OOD) generalization for seven NLP
datasets by constructing a new robustness benchmark with realistic distribution
shifts. We measure the generalization of previous models including bag-of-words
models, ConvNets, and LSTMs, and we show that pretrained Transformers'
performance declines are substantially smaller. Pretrained transformers are
also more effective at detecting anomalous or OOD examples, while many previous
models are frequently worse than chance. We examine which factors affect
robustness, finding that larger models are not necessarily more robust,
distillation can be harmful, and more diverse pretraining data can enhance
robustness. Finally, we show where future work can improve OOD robustness.

...read full abstract close
show post
: 2
Add

: 1
Add
▼ 2 Strengths and 1 Vulnerabilities
add vulnerability / strength
report
6

Abstraction Learning

attributed to: Fei Deng, Jinsheng Ren, Feng Chen
posted by: KabirKumar

There has been a gap between artificial intelligence and human intelligence.
In this paper, we identify three ...
There has been a gap between artificial intelligence and human intelligence.
In this paper, we identify three key elements forming human intelligence, and
suggest that abstraction learning combines these elements and is thus a way to
bridge the gap. Prior researches in artificial intelligence either specify
abstraction by human experts, or take abstraction as a qualitative explanation
for the model. This paper aims to learn abstraction directly. We tackle three
main challenges: representation, objective function, and learning algorithm.
Specifically, we propose a partition structure that contains pre-allocated
abstraction neurons; we formulate abstraction learning as a constrained
optimization problem, which integrates abstraction properties; we develop a
network evolution algorithm to solve this problem. This complete framework is
named ONE (Optimization via Network Evolution). In our experiments on MNIST,
ONE shows elementary human-like intelligence, including low energy consumption,
knowledge sharing, and lifelong learning.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
7

Autonomous Intelligent Cyber-defense Agent (AICA) Reference Architecture. Release 2.0

attributed to: Alexander Kott, Paul Théron, Martin Drašar, Edlira Dushku, Benoît LeBlanc, Paul Losiewicz, Alessandro Guarino, Luigi Mancini, Agostino Panico, Mauno Pihelgas, Krzysztof Rzadca, Fabio De Gaspari
posted by: KabirKumar

This report - a major revision of its previous release - describes a
reference architecture for intelligent so...
This report - a major revision of its previous release - describes a
reference architecture for intelligent software agents performing active,
largely autonomous cyber-defense actions on military networks of computing and
communicating devices. The report is produced by the North Atlantic Treaty
Organization (NATO) Research Task Group (RTG) IST-152 "Intelligent Autonomous
Agents for Cyber Defense and Resilience". In a conflict with a technically
sophisticated adversary, NATO military tactical networks will operate in a
heavily contested battlefield. Enemy software cyber agents - malware - will
infiltrate friendly networks and attack friendly command, control,
communications, computers, intelligence, surveillance, and reconnaissance and
computerized weapon systems. To fight them, NATO needs artificial cyber hunters
- intelligent, autonomous, mobile agents specialized in active cyber defense.
With this in mind, in 2016, NATO initiated RTG IST-152. Its objective has been
to help accelerate the development and transition to practice of such software
agents by producing a reference architecture and technical roadmap.

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
report
8

Towards a Human-like Open-Domain Chatbot

attributed to: Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, Quoc V. Le
posted by: KabirKumar

We present Meena, a multi-turn open-domain chatbot trained end-to-end on data
mined and filtered from public d...
We present Meena, a multi-turn open-domain chatbot trained end-to-end on data
mined and filtered from public domain social media conversations. This 2.6B
parameter neural network is simply trained to minimize perplexity of the next
token. We also propose a human evaluation metric called Sensibleness and
Specificity Average (SSA), which captures key elements of a human-like
multi-turn conversation. Our experiments show strong correlation between
perplexity and SSA. The fact that the best perplexity end-to-end trained Meena
scores high on SSA (72% on multi-turn evaluation) suggests that a human-level
SSA of 86% is potentially within reach if we can better optimize perplexity.
Additionally, the full version of Meena (with a filtering mechanism and tuned
decoding) scores 79% SSA, 23% higher in absolute SSA than the existing chatbots
we evaluated.

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
report
9

Adversarial Robustness as a Prior for Learned Representations

attributed to: Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Aleksander Madry
posted by: KabirKumar

An important goal in deep learning is to learn versatile, high-level feature
representations of input data. Ho...
An important goal in deep learning is to learn versatile, high-level feature
representations of input data. However, standard networks' representations seem
to possess shortcomings that, as we illustrate, prevent them from fully
realizing this goal. In this work, we show that robust optimization can be
re-cast as a tool for enforcing priors on the features learned by deep neural
networks. It turns out that representations learned by robust models address
the aforementioned shortcomings and make significant progress towards learning
a high-level encoding of inputs. In particular, these representations are
approximately invertible, while allowing for direct visualization and
manipulation of salient input features. More broadly, our results indicate
adversarial robustness as a promising avenue for improving learned
representations. Our code and models for reproducing these results is available
at https://git.io/robust-reps .

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
10

A Geometric Perspective on the Transferability of Adversarial Directions

attributed to: Zachary Charles, Harrison Rosenberg, Dimitris Papailiopoulos
posted by: KabirKumar

State-of-the-art machine learning models frequently misclassify inputs that
have been perturbed in an adversar...
State-of-the-art machine learning models frequently misclassify inputs that
have been perturbed in an adversarial manner. Adversarial perturbations
generated for a given input and a specific classifier often seem to be
effective on other inputs and even different classifiers. In other words,
adversarial perturbations seem to transfer between different inputs, models,
and even different neural network architectures. In this work, we show that in
the context of linear classifiers and two-layer ReLU networks, there provably
exist directions that give rise to adversarial perturbations for many
classifiers and data points simultaneously. We show that these "transferable
adversarial directions" are guaranteed to exist for linear separators of a
given set, and will exist with high probability for linear classifiers trained
on independent sets drawn from the same distribution. We extend our results to
large classes of two-layer ReLU networks. We further show that adversarial
directions for ReLU networks transfer to linear classifiers while the reverse
need not hold, suggesting that adversarial perturbations for more complex
models are more likely to transfer to other classifiers.

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
report
11

Towards the first adversarially robust neural network model on MNIST

attributed to: Lukas Schott, Jonas Rauber, Matthias Bethge, Wieland Brendel
posted by: KabirKumar

Despite much effort, deep neural networks remain highly susceptible to tiny
input perturbations and even for M...
Despite much effort, deep neural networks remain highly susceptible to tiny
input perturbations and even for MNIST, one of the most common toy datasets in
computer vision, no neural network model exists for which adversarial
perturbations are large and make semantic sense to humans. We show that even
the widely recognized and by far most successful defense by Madry et al. (1)
overfits on the L-infinity metric (it's highly susceptible to L2 and L0
perturbations), (2) classifies unrecognizable images with high certainty, (3)
performs not much better than simple input binarization and (4) features
adversarial perturbations that make little sense to humans. These results
suggest that MNIST is far from being solved in terms of adversarial robustness.
We present a novel robust classification model that performs analysis by
synthesis using learned class-conditional data distributions.

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
report
12

Motivating the Rules of the Game for Adversarial Example Research

attributed to: Justin Gilmer, Ryan P. Adams, Ian Goodfellow, David Andersen, George E. Dahl
posted by: KabirKumar

Advances in machine learning have led to broad deployment of systems with
impressive performance on important ...
Advances in machine learning have led to broad deployment of systems with
impressive performance on important problems. Nonetheless, these systems can be
induced to make errors on data that are surprisingly similar to examples the
learned system handles correctly. The existence of these errors raises a
variety of questions about out-of-sample generalization and whether bad actors
might use such examples to abuse deployed systems. As a result of these
security concerns, there has been a flurry of recent papers proposing
algorithms to defend against such malicious perturbations of correctly handled
examples. It is unclear how such misclassifications represent a different kind
of security problem than other errors, or even other attacker-produced examples
that have no specific relationship to an uncorrupted input. In this paper, we
argue that adversarial example defense papers have, to date, mostly considered
abstract, toy games that do not relate to any specific security concern.
Furthermore, defense papers have not yet precisely described all the abilities
and limitations of attackers that would be relevant in practical security.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
13

Robustness via curvature regularization, and vice versa

attributed to: Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, Pascal Frossard
posted by: KabirKumar

State-of-the-art classifiers have been shown to be largely vulnerable to
adversarial perturbations. One of the...
State-of-the-art classifiers have been shown to be largely vulnerable to
adversarial perturbations. One of the most effective strategies to improve
robustness is adversarial training. In this paper, we investigate the effect of
adversarial training on the geometry of the classification landscape and
decision boundaries. We show in particular that adversarial training leads to a
significant decrease in the curvature of the loss surface with respect to
inputs, leading to a drastically more "linear" behaviour of the network. Using
a locally quadratic approximation, we provide theoretical evidence on the
existence of a strong relation between large robustness and small curvature. To
further show the importance of reduced curvature for improving the robustness,
we propose a new regularizer that directly minimizes curvature of the loss
surface, and leads to adversarial robustness that is on par with adversarial
training. Besides being a more efficient and principled alternative to
adversarial training, the proposed regularizer confirms our claims on the
importance of exhibiting quasi-linear behavior in the vicinity of data points
in order to achieve robustness.

...read full abstract close
show post
: 1
Add

: 1
Add
▼ 1 Strengths and 1 Vulnerabilities
add vulnerability / strength
report
14

Adversarial Policies: Attacking Deep Reinforcement Learning

attributed to: Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, Stuart Russell
posted by: KabirKumar

Deep reinforcement learning (RL) policies are known to be vulnerable to
adversarial perturbations to their obs...
Deep reinforcement learning (RL) policies are known to be vulnerable to
adversarial perturbations to their observations, similar to adversarial
examples for classifiers. However, an attacker is not usually able to directly
modify another agent's observations. This might lead one to wonder: is it
possible to attack an RL agent simply by choosing an adversarial policy acting
in a multi-agent environment so as to create natural observations that are
adversarial? We demonstrate the existence of adversarial policies in zero-sum
games between simulated humanoid robots with proprioceptive observations,
against state-of-the-art victims trained via self-play to be robust to
opponents. The adversarial policies reliably win against the victims but
generate seemingly random and uncoordinated behavior. We find that these
policies are more successful in high-dimensional environments, and induce
substantially different activations in the victim policy network than when the
victim plays against a normal opponent. Videos are available at
https://adversarialpolicies.github.io/.

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
report
15

Fortified Networks: Improving the Robustness of Deep Networks by Modeling the Manifold of Hidden Representations

attributed to: Alex Lamb, Jonathan Binas, Anirudh Goyal, Dmitriy Serdyuk, Sandeep Subramanian, Ioannis Mitliagkas, Yoshua Bengio
posted by: KabirKumar

Deep networks have achieved impressive results across a variety of important
tasks. However a known weakness i...
Deep networks have achieved impressive results across a variety of important
tasks. However a known weakness is a failure to perform well when evaluated on
data which differ from the training distribution, even if these differences are
very small, as is the case with adversarial examples. We propose Fortified
Networks, a simple transformation of existing networks, which fortifies the
hidden layers in a deep network by identifying when the hidden states are off
of the data manifold, and maps these hidden states back to parts of the data
manifold where the network performs well. Our principal contribution is to show
that fortifying these hidden states improves the robustness of deep networks
and our experiments (i) demonstrate improved robustness to standard adversarial
attacks in both black-box and white-box threat models; (ii) suggest that our
improvements are not primarily due to the gradient masking problem and (iii)
show the advantage of doing this fortification in the hidden layers instead of
the input space.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
16

Evaluating and Understanding the Robustness of Adversarial Logit Pairing

attributed to: Logan Engstrom, Andrew Ilyas, Anish Athalye
posted by: KabirKumar

We evaluate the robustness of Adversarial Logit Pairing, a recently proposed
defense against adversarial examp...
We evaluate the robustness of Adversarial Logit Pairing, a recently proposed
defense against adversarial examples. We find that a network trained with
Adversarial Logit Pairing achieves 0.6% accuracy in the threat model in which
the defense is considered. We provide a brief overview of the defense and the
threat models/claims considered, as well as a discussion of the methodology and
results of our attack, which may offer insights into the reasons underlying the
vulnerability of ALP to adversarial attack.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
17

Evaluating Agents without Rewards

attributed to: Brendon Matusch, Jimmy Ba, Danijar Hafner
posted by: KabirKumar

Reinforcement learning has enabled agents to solve challenging tasks in
unknown environments. However, manuall...
Reinforcement learning has enabled agents to solve challenging tasks in
unknown environments. However, manually crafting reward functions can be time
consuming, expensive, and error prone to human error. Competing objectives have
been proposed for agents to learn without external supervision, but it has been
unclear how well they reflect task rewards or human behavior. To accelerate the
development of intrinsic objectives, we retrospectively compute potential
objectives on pre-collected datasets of agent behavior, rather than optimizing
them online, and compare them by analyzing their correlations. We study input
entropy, information gain, and empowerment across seven agents, three Atari
games, and the 3D game Minecraft. We find that all three intrinsic objectives
correlate more strongly with a human behavior similarity metric than with task
reward. Moreover, input entropy and information gain correlate more strongly
with human similarity than task reward does, suggesting the use of intrinsic
objectives for designing agents that behave similarly to human players.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
18

Algorithmic Fairness from a Non-ideal Perspective

attributed to: Sina Fazelpour, Zachary C. Lipton
posted by: KabirKumar

Inspired by recent breakthroughs in predictive modeling, practitioners in
both industry and government have tu...
Inspired by recent breakthroughs in predictive modeling, practitioners in
both industry and government have turned to machine learning with hopes of
operationalizing predictions to drive automated decisions. Unfortunately, many
social desiderata concerning consequential decisions, such as justice or
fairness, have no natural formulation within a purely predictive framework. In
efforts to mitigate these problems, researchers have proposed a variety of
metrics for quantifying deviations from various statistical parities that we
might expect to observe in a fair world and offered a variety of algorithms in
attempts to satisfy subsets of these parities or to trade off the degree to
which they are satisfied against utility. In this paper, we connect this
approach to \emph{fair machine learning} to the literature on ideal and
non-ideal methodological approaches in political philosophy. The ideal approach
requires positing the principles according to which a just world would operate.
In the most straightforward application of ideal theory, one supports a
proposed policy by arguing that it closes a discrepancy between the real and
the perfectly just world.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
19

Identifying and Correcting Label Bias in Machine Learning

attributed to: Heinrich Jiang, Ofir Nachum
posted by: KabirKumar

Datasets often contain biases which unfairly disadvantage certain groups, and
classifiers trained on such data...
Datasets often contain biases which unfairly disadvantage certain groups, and
classifiers trained on such datasets can inherit these biases. In this paper,
we provide a mathematical formulation of how this bias can arise. We do so by
assuming the existence of underlying, unknown, and unbiased labels which are
overwritten by an agent who intends to provide accurate labels but may have
biases against certain groups. Despite the fact that we only observe the biased
labels, we are able to show that the bias may nevertheless be corrected by
re-weighting the data points without changing the labels. We show, with
theoretical guarantees, that training on the re-weighted dataset corresponds to
training on the unobserved but unbiased labels, thus leading to an unbiased
machine learning classifier. Our procedure is fast and robust and can be used
with virtually any learning algorithm. We evaluate on a number of standard
machine learning fairness datasets and a variety of fairness notions, finding
that our method outperforms standard approaches in achieving fair
classification.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
20

Learning Not to Learn: Training Deep Neural Networks with Biased Data

attributed to: Byungju Kim, Hyunwoo Kim, Kyungsu Kim, Sungjin Kim, Junmo Kim
posted by: KabirKumar

We propose a novel regularization algorithm to train deep neural networks, in
which data at training time is s...
We propose a novel regularization algorithm to train deep neural networks, in
which data at training time is severely biased. Since a neural network
efficiently learns data distribution, a network is likely to learn the bias
information to categorize input data. It leads to poor performance at test
time, if the bias is, in fact, irrelevant to the categorization. In this paper,
we formulate a regularization loss based on mutual information between feature
embedding and bias. Based on the idea of minimizing this mutual information, we
propose an iterative algorithm to unlearn the bias information. We employ an
additional network to predict the bias distribution and train the network
adversarially against the feature embedding network. At the end of learning,
the bias prediction network is not able to predict the bias not because it is
poorly trained, but because the feature embedding network successfully unlearns
the bias information. We also demonstrate quantitative and qualitative
experimental results which show that our algorithm effectively removes the bias
information from feature embedding.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
21

Collaborating with Humans without Human Data

attributed to: DJ Strouse, Kevin R. McKee, Matt Botvinick, Edward Hughes, Richard Everett
posted by: KabirKumar

Collaborating with humans requires rapidly adapting to their individual
strengths, weaknesses, and preferences...
Collaborating with humans requires rapidly adapting to their individual
strengths, weaknesses, and preferences. Unfortunately, most standard
multi-agent reinforcement learning techniques, such as self-play (SP) or
population play (PP), produce agents that overfit to their training partners
and do not generalize well to humans. Alternatively, researchers can collect
human data, train a human model using behavioral cloning, and then use that
model to train "human-aware" agents ("behavioral cloning play", or BCP). While
such an approach can improve the generalization of agents to new human
co-players, it involves the onerous and expensive step of collecting large
amounts of human data first. Here, we study the problem of how to train agents
that collaborate well with human partners without using human data. We argue
that the crux of the problem is to produce a diverse set of training partners.
Drawing inspiration from successful multi-agent approaches in competitive
domains, we find that a surprisingly simple approach is highly effective.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
22

Legible Normativity for AI Alignment: The Value of Silly Rules

attributed to: Dylan Hadfield-Menell, McKane Andrus, Gillian K. Hadfield
posted by: KabirKumar

It has become commonplace to assert that autonomous agents will have to be
built to follow human rules of beha...
It has become commonplace to assert that autonomous agents will have to be
built to follow human rules of behavior--social norms and laws. But human laws
and norms are complex and culturally varied systems, in many cases agents will
have to learn the rules. This requires autonomous agents to have models of how
human rule systems work so that they can make reliable predictions about rules.
In this paper we contribute to the building of such models by analyzing an
overlooked distinction between important rules and what we call silly
rules--rules with no discernible direct impact on welfare. We show that silly
rules render a normative system both more robust and more adaptable in response
to shocks to perceived stability. They make normativity more legible for
humans, and can increase legibility for AI systems as well. For AI systems to
integrate into human normative systems, we suggest, it may be important for
them to have models that include representations of silly rules.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
23

TanksWorld: A Multi-Agent Environment for AI Safety Research

attributed to: Corban G. Rivera, Olivia Lyons, Arielle Summitt, Ayman Fatima, Ji Pak, William Shao, Robert Chalmers, Aryeh Englander, Edward W. Staley, I-Jeng Wang, Ashley J. Llorens
posted by: KabirKumar

The ability to create artificial intelligence (AI) capable of performing
complex tasks is rapidly outpacing ou...
The ability to create artificial intelligence (AI) capable of performing
complex tasks is rapidly outpacing our ability to ensure the safe and assured
operation of AI-enabled systems. Fortunately, a landscape of AI safety research
is emerging in response to this asymmetry and yet there is a long way to go. In
particular, recent simulation environments created to illustrate AI safety
risks are relatively simple or narrowly-focused on a particular issue. Hence,
we see a critical need for AI safety research environments that abstract
essential aspects of complex real-world applications. In this work, we
introduce the AI safety TanksWorld as an environment for AI safety research
with three essential aspects: competing performance objectives, human-machine
teaming, and multi-agent competition. The AI safety TanksWorld aims to
accelerate the advancement of safe multi-agent decision-making algorithms by
providing a software framework to support competitions with both system
performance and safety objectives. As a work in progress, this paper introduces
our research objectives and learning environment with reference code and
baseline performance metrics to follow in a future work.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
24

On Gradient-Based Learning in Continuous Games

attributed to: Eric Mazumdar, Lillian J. Ratliff, S. Shankar Sastry
posted by: KabirKumar

We formulate a general framework for competitive gradient-based learning that
encompasses a wide breadth of mu...
We formulate a general framework for competitive gradient-based learning that
encompasses a wide breadth of multi-agent learning algorithms, and analyze the
limiting behavior of competitive gradient-based learning algorithms using
dynamical systems theory. For both general-sum and potential games, we
characterize a non-negligible subset of the local Nash equilibria that will be
avoided if each agent employs a gradient-based learning algorithm. We also shed
light on the issue of convergence to non-Nash strategies in general- and
zero-sum games, which may have no relevance to the underlying game, and arise
solely due to the choice of algorithm. The existence and frequency of such
strategies may explain some of the difficulties encountered when using gradient
descent in zero-sum games as, e.g., in the training of generative adversarial
networks. To reinforce the theoretical contributions, we provide empirical
results that highlight the frequency of linear quadratic dynamic games (a
benchmark for multi-agent reinforcement learning) that admit global Nash
equilibria that are almost surely avoided by policy gradient.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
25

Reinforcement Learning under Threats

attributed to: Victor Gallego, Roi Naveiro, David Rios Insua
posted by: KabirKumar

In several reinforcement learning (RL) scenarios, mainly in security
settings, there may be adversaries trying...
In several reinforcement learning (RL) scenarios, mainly in security
settings, there may be adversaries trying to interfere with the reward
generating process. In this paper, we introduce Threatened Markov Decision
Processes (TMDPs), which provide a framework to support a decision maker
against a potential adversary in RL. Furthermore, we propose a level-$k$
thinking scheme resulting in a new learning framework to deal with TMDPs. After
introducing our framework and deriving theoretical results, relevant empirical
evidence is given via extensive experiments, showing the benefits of accounting
for adversaries while the agent learns.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
26

Learning Representations by Humans, for Humans

attributed to: Sophie Hilgard, Nir Rosenfeld, Mahzarin R. Banaji, Jack Cao, David C. Parkes
posted by: KabirKumar

When machine predictors can achieve higher performance than the human
decision-makers they support, improving ...
When machine predictors can achieve higher performance than the human
decision-makers they support, improving the performance of human
decision-makers is often conflated with improving machine accuracy. Here we
propose a framework to directly support human decision-making, in which the
role of machines is to reframe problems rather than to prescribe actions
through prediction. Inspired by the success of representation learning in
improving performance of machine predictors, our framework learns human-facing
representations optimized for human performance. This "Mind Composed with
Machine" framework incorporates a human decision-making model directly into the
representation learning paradigm and is trained with a novel human-in-the-loop
training procedure. We empirically demonstrate the successful application of
the framework to various tasks and representational forms.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
27

Learning to Understand Goal Specifications by Modelling Reward

attributed to: Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward Hughes, Arian Hosseini, Pushmeet Kohli, Edward Grefenstette
posted by: KabirKumar

Recent work has shown that deep reinforcement-learning agents can learn to
follow language-like instructions f...
Recent work has shown that deep reinforcement-learning agents can learn to
follow language-like instructions from infrequent environment rewards. However,
this places on environment designers the onus of designing language-conditional
reward functions which may not be easily or tractably implemented as the
complexity of the environment and the language scales. To overcome this
limitation, we present a framework within which instruction-conditional RL
agents are trained using rewards obtained not from the environment, but from
reward models which are jointly trained from expert examples. As reward models
improve, they learn to accurately reward agents for completing tasks for
environment configurations---and for instructions---not present amongst the
expert data. This framework effectively separates the representation of what
instructions require from how they can be executed. In a simple grid world, it
enables an agent to learn a range of commands requiring interaction with blocks
and understanding of spatial relations and underspecified abstract
arrangements. We further show the method allows our agent to adapt to changes
in the environment without requiring new expert examples.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
28

I Know What You Meant: Learning Human Objectives by (Under)estimating Their Choice Set

attributed to: Ananth Jonnavittula, Dylan P. Losey
posted by: KabirKumar

Assistive robots have the potential to help people perform everyday tasks.
However, these robots first need to...
Assistive robots have the potential to help people perform everyday tasks.
However, these robots first need to learn what it is their user wants them to
do. Teaching assistive robots is hard for inexperienced users, elderly users,
and users living with physical disabilities, since often these individuals are
unable to show the robot their desired behavior. We know that inclusive
learners should give human teachers credit for what they cannot demonstrate.
But today's robots do the opposite: they assume every user is capable of
providing any demonstration. As a result, these robots learn to mimic the
demonstrated behavior, even when that behavior is not what the human really
meant! Here we propose a different approach to reward learning: robots that
reason about the user's demonstrations in the context of similar or simpler
alternatives. Unlike prior works -- which err towards overestimating the
human's capabilities -- here we err towards underestimating what the human can
input (i.e., their choice set). Our theoretical analysis proves that
underestimating the human's choice set is risk-averse, with better worst-case
performance than overestimating.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
29

Learning to Complement Humans

attributed to: Bryan Wilder, Eric Horvitz, Ece Kamar
posted by: KabirKumar

A rising vision for AI in the open world centers on the development of
systems that can complement humans for ...
A rising vision for AI in the open world centers on the development of
systems that can complement humans for perceptual, diagnostic, and reasoning
tasks. To date, systems aimed at complementing the skills of people have
employed models trained to be as accurate as possible in isolation. We
demonstrate how an end-to-end learning strategy can be harnessed to optimize
the combined performance of human-machine teams by considering the distinct
abilities of people and machines. The goal is to focus machine learning on
problem instances that are difficult for humans, while recognizing instances
that are difficult for the machine and seeking human input on them. We
demonstrate in two real-world domains (scientific discovery and medical
diagnosis) that human-machine teams built via these methods outperform the
individual performance of machines and people. We then analyze conditions under
which this complementarity is strongest, and which training methods amplify it.
Taken together, our work provides the first systematic investigation of how
machine learning systems can be trained to complement human reasoning.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
30

Heuristic Approaches for Goal Recognition in Incomplete Domain Models

attributed to: Ramon Fraga Pereira, Felipe Meneguzzi
posted by: KabirKumar

Recent approaches to goal recognition have progressively relaxed the
assumptions about the amount and correctn...
Recent approaches to goal recognition have progressively relaxed the
assumptions about the amount and correctness of domain knowledge and available
observations, yielding accurate and efficient algorithms. These approaches,
however, assume completeness and correctness of the domain theory against which
their algorithms match observations: this is too strong for most real-world
domains. In this paper, we develop goal recognition techniques that are capable
of recognizing goals using \textit{incomplete} (and possibly incorrect) domain
theories. We show the efficiency and accuracy of our approaches empirically
against a large dataset of goal and plan recognition problems with incomplete
domains.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
31

Learning Rewards from Linguistic Feedback

attributed to: Theodore R. Sumers, Mark K. Ho, Robert D. Hawkins, Karthik Narasimhan, Thomas L. Griffiths
posted by: KabirKumar

We explore unconstrained natural language feedback as a learning signal for
artificial agents. Humans use rich...
We explore unconstrained natural language feedback as a learning signal for
artificial agents. Humans use rich and varied language to teach, yet most prior
work on interactive learning from language assumes a particular form of input
(e.g., commands). We propose a general framework which does not make this
assumption, using aspect-based sentiment analysis to decompose feedback into
sentiment about the features of a Markov decision process. We then perform an
analogue of inverse reinforcement learning, regressing the sentiment on the
features to infer the teacher's latent reward function. To evaluate our
approach, we first collect a corpus of teaching behavior in a cooperative task
where both teacher and learner are human. We implement three artificial
learners: sentiment-based "literal" and "pragmatic" models, and an inference
network trained end-to-end to predict latent rewards. We then repeat our
initial experiment and pair them with human teachers. All three successfully
learn from interactive human feedback.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
32

The EMPATHIC Framework for Task Learning from Implicit Human Feedback

attributed to: Yuchen Cui, Qiping Zhang, Alessandro Allievi, Peter Stone, Scott Niekum, W. Bradley Knox
posted by: KabirKumar

Reactions such as gestures, facial expressions, and vocalizations are an
abundant, naturally occurring channel...
Reactions such as gestures, facial expressions, and vocalizations are an
abundant, naturally occurring channel of information that humans provide during
interactions. A robot or other agent could leverage an understanding of such
implicit human feedback to improve its task performance at no cost to the
human. This approach contrasts with common agent teaching methods based on
demonstrations, critiques, or other guidance that need to be attentively and
intentionally provided. In this paper, we first define the general problem of
learning from implicit human feedback and then propose to address this problem
through a novel data-driven framework, EMPATHIC. This two-stage method consists
of (1) mapping implicit human feedback to relevant task statistics such as
reward, optimality, and advantage; and (2) using such a mapping to learn a
task. We instantiate the first stage and three second-stage evaluations of the
learned mapping. To do so, we collect a dataset of human facial reactions while
participants observe an agent execute a sub-optimal policy for a prescribed
training task...

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
33

Parenting: Safe Reinforcement Learning from Human Input

attributed to: Christopher Frye, Ilya Feige
posted by: KabirKumar

Autonomous agents trained via reinforcement learning present numerous safety
concerns: reward hacking, negativ...
Autonomous agents trained via reinforcement learning present numerous safety
concerns: reward hacking, negative side effects, and unsafe exploration, among
others. In the context of near-future autonomous agents, operating in
environments where humans understand the existing dangers, human involvement in
the learning process has proved a promising approach to AI Safety. Here we
demonstrate that a precise framework for learning from human input, loosely
inspired by the way humans parent children, solves a broad class of safety
problems in this context. We show that our Parenting algorithm solves these
problems in the relevant AI Safety gridworlds of Leike et al. (2017), that an
agent can learn to outperform its parent as it "matures", and that policies
learnt through Parenting are generalisable to new environments.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
34

Constrained Policy Improvement for Safe and Efficient Reinforcement Learning

attributed to: Elad Sarafian, Aviv Tamar, Sarit Kraus
posted by: KabirKumar

We propose a policy improvement algorithm for Reinforcement Learning (RL)
which is called Rerouted Behavior Im...
We propose a policy improvement algorithm for Reinforcement Learning (RL)
which is called Rerouted Behavior Improvement (RBI). RBI is designed to take
into account the evaluation errors of the Q-function. Such errors are common in
RL when learning the $Q$-value from finite past experience data. Greedy
policies or even constrained policy optimization algorithms which ignore these
errors may suffer from an improvement penalty (i.e. a negative policy
improvement). To minimize the improvement penalty, the RBI idea is to attenuate
rapid policy changes of low probability actions which were less frequently
sampled. This approach is shown to avoid catastrophic performance degradation
and reduce regret when learning from a batch of past experience. Through a
two-armed bandit with Gaussian distributed rewards example, we show that it
also increases data efficiency when the optimal action has a high variance. We
evaluate RBI in two tasks in the Atari Learning Environment: (1) learning from
observations of multiple behavior policies and (2) iterative RL.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
35

Towards Empathic Deep Q-Learning

attributed to: Bart Bussmann, Jacqueline Heinerman, Joel Lehman
posted by: KabirKumar

As reinforcement learning (RL) scales to solve increasingly complex tasks,
interest continues to grow in the f...
As reinforcement learning (RL) scales to solve increasingly complex tasks,
interest continues to grow in the fields of AI safety and machine ethics. As a
contribution to these fields, this paper introduces an extension to Deep
Q-Networks (DQNs), called Empathic DQN, that is loosely inspired both by
empathy and the golden rule ("Do unto others as you would have them do unto
you"). Empathic DQN aims to help mitigate negative side effects to other agents
resulting from myopic goal-directed behavior. We assume a setting where a
learning agent coexists with other independent agents (who receive unknown
rewards), where some types of reward (e.g. negative rewards from physical harm)
may generalize across agents. Empathic DQN combines the typical (self-centered)
value with the estimated value of other agents, by imagining (by its own
standards) the value of it being in the other's situation (by considering
constructed states where both agents are swapped). Proof-of-concept results in
two gridworld environments highlight the approach's potential to decrease
collateral harms.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
36

Building Ethics into Artificial Intelligence

attributed to: Han Yu, Zhiqi Shen, Chunyan Miao, Cyril Leung, Victor R. Lesser, Qiang Yang
posted by: KabirKumar

As artificial intelligence (AI) systems become increasingly ubiquitous, the
topic of AI governance for ethical...
As artificial intelligence (AI) systems become increasingly ubiquitous, the
topic of AI governance for ethical decision-making by AI has captured public
imagination. Within the AI research community, this topic remains less familiar
to many researchers. In this paper, we complement existing surveys, which
largely focused on the psychological, social and legal discussions of the
topic, with an analysis of recent advances in technical solutions for AI
governance. By reviewing publications in leading AI conferences including AAAI,
AAMAS, ECAI and IJCAI, we propose a taxonomy which divides the field into four
areas: 1) exploring ethical dilemmas; 2) individual ethical decision
frameworks; 3) collective ethical decision frameworks; and 4) ethics in
human-AI interactions. We highlight the intuitions and key techniques used in
each approach, and discuss promising future research directions towards
successful integration of ethical AI systems into human societies.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
37

Reinforcement Learning Under Moral Uncertainty

attributed to: Adrien Ecoffet, Joel Lehman
posted by: KabirKumar

An ambitious goal for machine learning is to create agents that behave
ethically: The capacity to abide by hum...
An ambitious goal for machine learning is to create agents that behave
ethically: The capacity to abide by human moral norms would greatly expand the
context in which autonomous agents could be practically and safely deployed,
e.g. fully autonomous vehicles will encounter charged moral decisions that
complicate their deployment. While ethical agents could be trained by rewarding
correct behavior under a specific moral theory (e.g. utilitarianism), there
remains widespread disagreement about the nature of morality. Acknowledging
such disagreement, recent work in moral philosophy proposes that ethical
behavior requires acting under moral uncertainty, i.e. to take into account
when acting that one's credence is split across several plausible ethical
theories. This paper translates such insights to the field of reinforcement
learning, proposes two training methods that realize different points among
competing desiderata, and trains agents in simple environments to act under
moral uncertainty.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
38

AvE: Assistance via Empowerment

attributed to: Yuqing Du, Stas Tiomkin, Emre Kiciman, Daniel Polani, Pieter Abbeel, Anca Dragan
posted by: KabirKumar

One difficulty in using artificial agents for human-assistive applications
lies in the challenge of accurately...
One difficulty in using artificial agents for human-assistive applications
lies in the challenge of accurately assisting with a person's goal(s). Existing
methods tend to rely on inferring the human's goal, which is challenging when
there are many potential goals or when the set of candidate goals is difficult
to identify. We propose a new paradigm for assistance by instead increasing the
human's ability to control their environment, and formalize this approach by
augmenting reinforcement learning with human empowerment. This task-agnostic
objective preserves the person's autonomy and ability to achieve any eventual
state. We test our approach against assistance based on goal inference,
highlighting scenarios where our method overcomes failure modes stemming from
goal ambiguity or misspecification. As existing methods for estimating
empowerment in continuous domains are computationally hard, precluding its use
in real time learned assistance, we also propose an efficient
empowerment-inspired proxy metric. Using this, we are able to successfully
demonstrate our method in a shared autonomy user study for a challenging
simulated teleoperation task with human-in-the-loop training.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
39

Planning With Uncertain Specifications (PUnS)

attributed to: Ankit Shah, Shen Li, Julie Shah
posted by: KabirKumar

Reward engineering is crucial to high performance in reinforcement learning
systems. Prior research into rewar...
Reward engineering is crucial to high performance in reinforcement learning
systems. Prior research into reward design has largely focused on Markovian
functions representing the reward. While there has been research into
expressing non-Markov rewards as linear temporal logic (LTL) formulas, this has
focused on task specifications directly defined by the user. However, in many
real-world applications, task specifications are ambiguous, and can only be
expressed as a belief over LTL formulas. In this paper, we introduce planning
with uncertain specifications (PUnS), a novel formulation that addresses the
challenge posed by non-Markovian specifications expressed as beliefs over LTL
formulas. We present four criteria that capture the semantics of satisfying a
belief over specifications for different applications, and analyze the
qualitative implications of these criteria within a synthetic domain. We
demonstrate the existence of an equivalent Markov decision process (MDP) for
any instance of PUnS. Finally, we demonstrate our approach on the real-world
task of setting a dinner table automatically with a robot that inferred task
specifications from human demonstrations.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
40

Penalizing side effects using stepwise relative reachability

attributed to: Victoria Krakovna, Laurent Orseau, Ramana Kumar, Miljan Martic, Shane Legg
posted by: KabirKumar

How can we design safe reinforcement learning agents that avoid unnecessary
disruptions to their environment? ...
How can we design safe reinforcement learning agents that avoid unnecessary
disruptions to their environment? We show that current approaches to penalizing
side effects can introduce bad incentives, e.g. to prevent any irreversible
changes in the environment, including the actions of other agents. To isolate
the source of such undesirable incentives, we break down side effects penalties
into two components: a baseline state and a measure of deviation from this
baseline state. We argue that some of these incentives arise from the choice of
baseline, and others arise from the choice of deviation measure. We introduce a
new variant of the stepwise inaction baseline and a new deviation measure based
on relative reachability of states. The combination of these design choices
avoids the given undesirable incentives, while simpler baselines and the
unreachability measure fail. We demonstrate this empirically by comparing
different combinations of baseline and deviation measure choices on a set of
gridworld experiments designed to illustrate possible bad incentives.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
41

Learning to be Safe: Deep RL with a Safety Critic

attributed to: Krishnan Srinivasan, Benjamin Eysenbach, Sehoon Ha, Jie Tan, Chelsea Finn
posted by: KabirKumar

Safety is an essential component for deploying reinforcement learning (RL)
algorithms in real-world scenarios,...
Safety is an essential component for deploying reinforcement learning (RL)
algorithms in real-world scenarios, and is critical during the learning process
itself. A natural first approach toward safe RL is to manually specify
constraints on the policy's behavior. However, just as learning has enabled
progress in large-scale development of AI systems, learning safety
specifications may also be necessary to ensure safety in messy open-world
environments where manual safety specifications cannot scale. Akin to how
humans learn incrementally starting in child-safe environments, we propose to
learn how to be safe in one set of tasks and environments, and then use that
learned intuition to constrain future behaviors when learning new, modified
tasks. We empirically study this form of safety-constrained transfer learning
in three challenging domains: simulated navigation, quadruped locomotion, and
dexterous in-hand manipulation.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
42

Recovery RL: Safe Reinforcement Learning with Learned Recovery Zones

attributed to: Brijen Thananjeyan, Ashwin Balakrishna, Suraj Nair, Michael Luo, Krishnan Srinivasan, Minho Hwang, Joseph E. Gonzalez, Julian Ibarz, Chelsea Finn, Ken Goldberg
posted by: KabirKumar

Safety remains a central obstacle preventing widespread use of RL in the real
world: learning new tasks in unc...
Safety remains a central obstacle preventing widespread use of RL in the real
world: learning new tasks in uncertain environments requires extensive
exploration, but safety requires limiting exploration. We propose Recovery RL,
an algorithm which navigates this tradeoff by (1) leveraging offline data to
learn about constraint violating zones before policy learning and (2)
separating the goals of improving task performance and constraint satisfaction
across two policies: a task policy that only optimizes the task reward and a
recovery policy that guides the agent to safety when constraint violation is
likely. We evaluate Recovery RL on 6 simulation domains, including two
contact-rich manipulation tasks and an image-based navigation task, and an
image-based obstacle avoidance task on a physical robot. We compare Recovery RL
to 5 prior safe RL methods which jointly optimize for task performance and
safety via constrained optimization or reward shaping and find that Recovery RL
outperforms the next best prior method across all domains.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
43

Conservative Agency via Attainable Utility Preservation

attributed to: Alexander Matt Turner, Dylan Hadfield-Menell, Prasad Tadepalli
posted by: KabirKumar

Reward functions are easy to misspecify; although designers can make
corrections after observing mistakes, an ...
Reward functions are easy to misspecify; although designers can make
corrections after observing mistakes, an agent pursuing a misspecified reward
function can irreversibly change the state of its environment. If that change
precludes optimization of the correctly specified reward function, then
correction is futile. For example, a robotic factory assistant could break
expensive equipment due to a reward misspecification; even if the designers
immediately correct the reward function, the damage is done. To mitigate this
risk, we introduce an approach that balances optimization of the primary reward
function with preservation of the ability to optimize auxiliary reward
functions. Surprisingly, even when the auxiliary reward functions are randomly
generated and therefore uninformative about the correctly specified reward
function, this approach induces conservative, effective behavior.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
44

Safe Reinforcement Learning with Model Uncertainty Estimates

attributed to: Björn Lütjens, Michael Everett, Jonathan P. How
posted by: KabirKumar

Many current autonomous systems are being designed with a strong reliance on
black box predictions from deep n...
Many current autonomous systems are being designed with a strong reliance on
black box predictions from deep neural networks (DNNs). However, DNNs tend to
be overconfident in predictions on unseen data and can give unpredictable
results for far-from-distribution test data. The importance of predictions that
are robust to this distributional shift is evident for safety-critical
applications, such as collision avoidance around pedestrians. Measures of model
uncertainty can be used to identify unseen data, but the state-of-the-art
extraction methods such as Bayesian neural networks are mostly intractable to
compute. This paper uses MC-Dropout and Bootstrapping to give computationally
tractable and parallelizable uncertainty estimates. The methods are embedded in
a Safe Reinforcement Learning framework to form uncertainty-aware navigation
around pedestrians. The result is a collision avoidance policy that knows what
it does not know and cautiously avoids pedestrians that exhibit unseen
behavior. The policy is demonstrated in simulation to be more robust to novel
observations and take safer actions than an uncertainty-unaware baseline.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
45

Avoiding Negative Side Effects due to Incomplete Knowledge of AI Systems

attributed to: Sandhya Saisubramanian, Shlomo Zilberstein, Ece Kamar
posted by: KabirKumar

Autonomous agents acting in the real-world often operate based on models that
ignore certain aspects of the en...
Autonomous agents acting in the real-world often operate based on models that
ignore certain aspects of the environment. The incompleteness of any given
model -- handcrafted or machine acquired -- is inevitable due to practical
limitations of any modeling technique for complex real-world settings. Due to
the limited fidelity of its model, an agent's actions may have unexpected,
undesirable consequences during execution. Learning to recognize and avoid such
negative side effects of an agent's actions is critical to improve the safety
and reliability of autonomous systems. Mitigating negative side effects is an
emerging research topic that is attracting increased attention due to the rapid
growth in the deployment of AI systems and their broad societal impacts.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
46

Avoiding Side Effects in Complex Environments

attributed to: Alexander Matt Turner, Neale Ratzlaff, Prasad Tadepalli
posted by: KabirKumar

Reward function specification can be difficult. Rewarding the agent for
making a widget may be easy, but penal...
Reward function specification can be difficult. Rewarding the agent for
making a widget may be easy, but penalizing the multitude of possible negative
side effects is hard. In toy environments, Attainable Utility Preservation
(AUP) avoided side effects by penalizing shifts in the ability to achieve
randomly generated goals. We scale this approach to large, randomly generated
environments based on Conway's Game of Life. By preserving optimal value for a
single randomly generated reward function, AUP incurs modest overhead while
leading the agent to complete the specified task and avoid many side effects.
Videos and code are available at https://avoiding-side-effects.github.io/.

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
report
47

Safety Aware Reinforcement Learning (SARL)

attributed to: Santiago Miret, Somdeb Majumdar, Carroll Wainwright
posted by: KabirKumar

As reinforcement learning agents become increasingly integrated into complex,
real-world environments, designi...
As reinforcement learning agents become increasingly integrated into complex,
real-world environments, designing for safety becomes a critical consideration.
We specifically focus on researching scenarios where agents can cause undesired
side effects while executing a policy on a primary task. Since one can define
multiple tasks for a given environment dynamics, there are two important
challenges. First, we need to abstract the concept of safety that applies
broadly to that environment independent of the specific task being executed.
Second, we need a mechanism for the abstracted notion of safety to modulate the
actions of agents executing different policies to minimize their side-effects.
In this work, we propose Safety Aware Reinforcement Learning (SARL) - a
framework where a virtual safe agent modulates the actions of a main
reward-based agent to minimize side effects. The safe agent learns a
task-independent notion of safety for a given environment. The main agent is
then trained with a regularization loss given by the distance between the
native action probabilities of the two agents..

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
48

Safe Option-Critic: Learning Safety in the Option-Critic Architecture

attributed to: Arushi Jain, Khimya Khetarpal, Doina Precup
posted by: KabirKumar

Designing hierarchical reinforcement learning algorithms that exhibit safe
behaviour is not only vital for pra...
Designing hierarchical reinforcement learning algorithms that exhibit safe
behaviour is not only vital for practical applications but also, facilitates a
better understanding of an agent's decisions. We tackle this problem in the
options framework, a particular way to specify temporally abstract actions
which allow an agent to use sub-policies with start and end conditions. We
consider a behaviour as safe that avoids regions of state-space with high
uncertainty in the outcomes of actions. We propose an optimization objective
that learns safe options by encouraging the agent to visit states with higher
behavioural consistency. The proposed objective results in a trade-off between
maximizing the standard expected return and minimizing the effect of model
uncertainty in the return. We propose a policy gradient algorithm to optimize
the constrained objective function. We examine the quantitative and qualitative
behaviour of the proposed approach in a tabular grid-world, continuous-state
puddle-world, and three games from the Arcade Learning Environment: Ms.Pacman,
Amidar, and Q*Bert.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
49

Stovepiping and Malicious Software: A Critical Review of AGI Containment

attributed to: Jason M. Pittman, Jesus P. Espinoza, Courtney Crosby
posted by: KabirKumar

Awareness of the possible impacts associated with artificial intelligence has
risen in proportion to progress ...
Awareness of the possible impacts associated with artificial intelligence has
risen in proportion to progress in the field. While there are tremendous
benefits to society, many argue that there are just as many, if not more,
concerns related to advanced forms of artificial intelligence. Accordingly,
research into methods to develop artificial intelligence safely is increasingly
important. In this paper, we provide an overview of one such safety paradigm:
containment with a critical lens aimed toward generative adversarial networks
and potentially malicious artificial intelligence. Additionally, we illuminate
the potential for a developmental blindspot in the stovepiping of containment
mechanisms.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
50

Reward Estimation for Variance Reduction in Deep Reinforcement Learning

attributed to: Joshua Romoff, Peter Henderson, Alexandre Piché, Vincent Francois-Lavet, Joelle Pineau
posted by: KabirKumar

Reinforcement Learning (RL) agents require the specification of a reward
signal for learning behaviours. Howev...
Reinforcement Learning (RL) agents require the specification of a reward
signal for learning behaviours. However, introduction of corrupt or stochastic
rewards can yield high variance in learning. Such corruption may be a direct
result of goal misspecification, randomness in the reward signal, or
correlation of the reward with external factors that are not known to the
agent. Corruption or stochasticity of the reward signal can be especially
problematic in robotics, where goal specification can be particularly difficult
for complex tasks. While many variance reduction techniques have been studied
to improve the robustness of the RL process, handling such stochastic or
corrupted reward structures remains difficult. As an alternative for handling
this scenario in model-free RL methods, we suggest using an estimator for both
rewards and value functions. We demonstrate that this improves performance
under corrupted stochastic rewards in both the tabular and non-linear function
approximation settings for a variety of noise types and environments. The use
of reward estimation is a robust and easy-to-implement improvement for handling
corrupted reward signals in model-free RL.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
51

Smoothing Policies and Safe Policy Gradients

attributed to: Matteo Papini, Matteo Pirotta, Marcello Restelli
posted by: KabirKumar

Policy Gradient (PG) algorithms are among the best candidates for the
much-anticipated applications of reinfor...
Policy Gradient (PG) algorithms are among the best candidates for the
much-anticipated applications of reinforcement learning to real-world control
tasks, such as robotics. However, the trial-and-error nature of these methods
poses safety issues whenever the learning process itself must be performed on a
physical system or involves any form of human-computer interaction. In this
paper, we address a specific safety formulation, where both goals and dangers
are encoded in a scalar reward signal and the learning agent is constrained to
never worsen its performance, measured as the expected sum of rewards. By
studying actor-only policy gradient from a stochastic optimization perspective,
we establish improvement guarantees for a wide class of parametric policies,
generalizing existing results on Gaussian policies. This, together with novel
upper bounds on the variance of policy gradient estimators, allows us to
identify meta-parameter schedules that guarantee monotonic improvement with
high probability.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
52

Representation Learning with Contrastive Predictive Coding

attributed to: Aaron van den Oord, Yazhe Li, Oriol Vinyals
posted by: KabirKumar

While supervised learning has enabled great progress in many applications,
unsupervised learning has not seen ...
While supervised learning has enabled great progress in many applications,
unsupervised learning has not seen such widespread adoption, and remains an
important and challenging endeavor for artificial intelligence. In this work,
we propose a universal unsupervised learning approach to extract useful
representations from high-dimensional data, which we call Contrastive
Predictive Coding. The key insight of our model is to learn such
representations by predicting the future in latent space by using powerful
autoregressive models. We use a probabilistic contrastive loss which induces
the latent space to capture information that is maximally useful to predict
future samples. It also makes the model tractable by using negative sampling.
While most prior work has focused on evaluating representations for a
particular modality, we demonstrate that our approach is able to learn useful
representations achieving strong performance on four distinct domains: speech,
images, text and reinforcement learning in 3D environments.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
53

On Variational Bounds of Mutual Information

attributed to: Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A. Alemi, George Tucker
posted by: KabirKumar

Estimating and optimizing Mutual Information (MI) is core to many problems in
machine learning; however, bound...
Estimating and optimizing Mutual Information (MI) is core to many problems in
machine learning; however, bounding MI in high dimensions is challenging. To
establish tractable and scalable objectives, recent work has turned to
variational bounds parameterized by neural networks, but the relationships and
tradeoffs between these bounds remains unclear. In this work, we unify these
recent developments in a single framework. We find that the existing
variational lower bounds degrade when the MI is large, exhibiting either high
bias or high variance. To address this problem, we introduce a continuum of
lower bounds that encompasses previous bounds and flexibly trades off bias and
variance. On high-dimensional, controlled problems, we empirically characterize
the bias and variance of the bounds and their gradients and demonstrate the
effectiveness of our new bounds for estimation and representation learning.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
54

Certified Defenses against Adversarial Examples

attributed to: Aditi Raghunathan, Jacob Steinhardt, Percy Liang
posted by: KabirKumar

While neural networks have achieved high accuracy on standard image
classification benchmarks, their accuracy ...
While neural networks have achieved high accuracy on standard image
classification benchmarks, their accuracy drops to nearly zero in the presence
of small adversarial perturbations to test inputs. Defenses based on
regularization and adversarial training have been proposed, but often followed
by new, stronger attacks that defeat these defenses. Can we somehow end this
arms race? In this work, we study this problem for neural networks with one
hidden layer. We first propose a method based on a semidefinite relaxation that
outputs a certificate that for a given network and test input, no attack can
force the error to exceed a certain value. Second, as this certificate is
differentiable, we jointly optimize it with the network parameters, providing
an adaptive regularizer that encourages robustness against all attacks. On
MNIST, our approach produces a network and a certificate that no attack that
perturbs each pixel by at most \epsilon = 0.1 can cause more than 35% test
error.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
55

Neurosymbolic Reinforcement Learning with Formally Verified Exploration

attributed to: Greg Anderson, Abhinav Verma, Isil Dillig, Swarat Chaudhuri
posted by: KabirKumar

We present Revel, a partially neural reinforcement learning (RL) framework
for provably safe exploration in co...
We present Revel, a partially neural reinforcement learning (RL) framework
for provably safe exploration in continuous state and action spaces. A key
challenge for provably safe deep RL is that repeatedly verifying neural
networks within a learning loop is computationally infeasible. We address this
challenge using two policy classes: a general, neurosymbolic class with
approximate gradients and a more restricted class of symbolic policies that
allows efficient verification. Our learning algorithm is a mirror descent over
policies: in each iteration, it safely lifts a symbolic policy into the
neurosymbolic space, performs safe gradient updates to the resulting policy,
and projects the updated policy into the safe symbolic subset, all without
requiring explicit verification of neural networks. Our empirical results show
that Revel enforces safe exploration in many scenarios in which Constrained
Policy Optimization does not, and that it can discover policies that outperform
those learned through prior approaches to verified exploration.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
56

This plan suggests that high-capability general AI models should be tested within a secure computing environme...
This plan suggests that high-capability general AI models should be tested within a secure computing environment (box) that is censored (no mention of humanity or computers) and highly controlled (auto-compute halts/slowdowns, restrictions on agent behavior) with simulations of alignment-relevant scenarios (e.g. with other general agents that the test subject is to be aligned to).

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
report
57

Artificial Intelligence (AI) systems have significant potential to affect the lives of individuals and societi...
Artificial Intelligence (AI) systems have significant potential to affect the lives of individuals and societies. As these systems are being increasingly used in decision-making processes, it has become crucial to ensure that they make ethically sound judgments. This paper proposes a novel framework for embedding ethical priors into AI, inspired by the Bayesian approach to machine learning. We propose that ethical assumptions and beliefs can be incorporated as Bayesian priors, shaping the AI’s learning and reasoning process in a similar way to humans’ inborn moral intuitions. This approach, while complex, provides a promising avenue for advancing ethically aligned AI systems.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
58

This article explores the concept and potential application of bottom-up virtue ethics as an approach to insti...
This article explores the concept and potential application of bottom-up virtue ethics as an approach to instilling ethical behavior in artificial intelligence (AI) systems. We argue that by training machine learning models to emulate virtues such as honesty, justice, and compassion, we can cultivate positive traits and behaviors based on ideal human moral character. This bottom-up approach contrasts with traditional top-down programming of ethical rules, focusing instead on experiential learning. Although this approach presents its own challenges, it offers a promising avenue for the development of more ethically aligned AI systems.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
59

As artificial intelligence rapidly advances, ensuring alignment with moral values and ethics becomes imperativ...
As artificial intelligence rapidly advances, ensuring alignment with moral values and ethics becomes imperative. This article provides a comprehensive overview of techniques to embed human values into AI. Interactive learning, crowdsourcing, uncertainty modeling, oversight mechanisms, and conservative system design are analyzed in-depth. Respective limitations are discussed and mitigation strategies proposed. A multi-faceted approach combining the strengths of these complementary methods promises safer development of AI that benefits humanity in accordance with our ideals.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
60

Distributional shift poses a significant challenge for deploying and maintaining AI systems. As the real-world...
Distributional shift poses a significant challenge for deploying and maintaining AI systems. As the real-world distributions that models are applied to evolve over time, performance can deteriorate. This article examines techniques and best practices for improving model robustness to distributional shift and enabling rapid adaptation when it occurs.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
61

Interpretability in AI systems is fast becoming a critical requirement in the industry. The proposed Hybrid Ex...
Interpretability in AI systems is fast becoming a critical requirement in the industry. The proposed Hybrid Explainability Model (HEM) integrates multiple interpretability techniques, including Feature Importance Visualization, Model Transparency Tools, and Counterfactual Explanations, offering a comprehensive understanding of AI model behavior. This article elaborates on the specifics of implementing HEM, addresses potential counter-arguments, and provides rebuttals to these counterpoints. The HEM approach aims to deliver a holistic understanding of AI decision-making processes, fostering improved accountability, trust, and safety in AI applications.

...read full abstract close
show post
: 0
Add

: 2
Add
▼ 0 Strengths and 2 Vulnerabilities
add vulnerability / strength
report
62

This article proposes a detailed framework for a robust feedback loop to enhance corrigibility. The ability to...
This article proposes a detailed framework for a robust feedback loop to enhance corrigibility. The ability to continuously learn and correct errors is critical for safe and beneficial AI, but developing corrigible systems comes with significant technical and ethical challenges. The feedback loop outlined involves gathering user input, interpreting feedback contextually, enabling AI actions and learning, confirming changes, and iterative improvement. The article analyzes potential limitations of this approach and provides detailed examples of implementation methods using advanced natural language processing, reinforcement learning, and adversarial training techniques.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
63

To align advanced AIs, an ensemble of diverse, transparent Overseer AIs will independently monitor the target ...
To align advanced AIs, an ensemble of diverse, transparent Overseer AIs will independently monitor the target AI and provide granular assessments on its alignment with constitution, human values, ethics, and safety. Overseer interventions will be incremental and subject to human oversight. The system will be implemented cautiously, with extensive testing to validate capabilities. Alignment will be treated as an ongoing collaborative process between humans, Overseers, and the target AI, leveraging complementary strengths through open dialog. Continuous vigilance, updating of definitions, and contingency planning will be required to address inevitable uncertainties and risks.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
64

My proposal entails constructing a tightly restricted AI subsystem with the sole capability of attempting to s...
My proposal entails constructing a tightly restricted AI subsystem with the sole capability of attempting to safely shut itself down in order to probe, in an isolated manner, potential vulnerabilities in alignment techniques and then improve them.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
65

Corrigibility via multiple routes

attributed to: Jan Kulveit
posted by: tori[she/her]

Use multiple routes to induce 'corrigibility' by using principles which counteract instrumental convergence (e...
Use multiple routes to induce 'corrigibility' by using principles which counteract instrumental convergence (e.g. disutility from resource acquisition by a mutual information measure between the AI and distant parts of the environment
), by counteracting unbounded rationality (satisficing, myopia, etc.), with 'traps' like ontological uncertainty about the level of simulation (e.g. having uncertainty about whether it is in training or deployment), human oversight, and interpretability (e.g. an independent 'translator').

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
66

Avoiding Tampering Incentives in Deep RL via Decoupled Approval

attributed to: Jonathan Uesato, Ramana Kumar, Victoria Krakovna, Tom Everitt, Richard Ngo, Shane Legg
posted by: KabirKumar

How can we design agents that pursue a given objective when all feedback mechanisms are influenceable by the a...
How can we design agents that pursue a given objective when all feedback mechanisms are influenceable by the agent? Standard RL algorithms assume a secure reward function, and can thus perform poorly in settings where agents can tamper with the reward-generating mechanism. We present a principled solution to the problem of learning from influenceable feedback, which combines approval with a decoupled feedback collection procedure. For a natural class of corruption functions, decoupled approval algorithms have aligned incentives both at convergence and for their local updates. Empirically, they also scale to complex 3D environments where tampering is possible.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
67

Pessimism About Unknown Unknowns Inspires Conservatism

attributed to: Michael K. Cohen, Marcus Hutter
posted by: KabirKumar

If we could define the set of all bad outcomes, we could hard-code an agent which avoids them; however, in suf...
If we could define the set of all bad outcomes, we could hard-code an agent which avoids them; however, in sufficiently complex environments, this is infeasible. We do not know of any general-purpose approaches in the literature to avoiding novel failure modes. Motivated by this, we define an idealized Bayesian reinforcement learner which follows a policy that maximizes the worst-case expected reward over a set of world-models. We call this agent pessimistic, since it optimizes assuming the worst case. A scalar parameter tunes the agent's pessimism by changing the size of the set of world-models taken into account...

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
68

PROVABLY FAIR FEDERATED LEARNING

attributed to: Shengyuan Hu, Zhiwei Steven Wu, Virginia Smith
posted by: KabirKumar

In federated learning, fair prediction across various protected groups (e.g., gender,
race) is an important co...
In federated learning, fair prediction across various protected groups (e.g., gender,
race) is an important constraint for many applications. Unfortunately, prior work
studying group fair federated learning lacks formal convergence or fairness guaran-
tees. Our work provides a new definition for group fairness in federated learning
based on the notion of Bounded Group Loss (BGL), which can be easily applied
to common federated learning objectives. Based on our definition, we propose a
scalable algorithm that optimizes the empirical risk and global fairness constraints,
which we evaluate across common fairness and federated learning benchmarks.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
69

Towards Safe Artificial General Intelligence

attributed to: Tom Everitt
posted by: shumaari

The field of artificial intelligence has recently experienced a number of breakthroughs thanks to progress in ...
The field of artificial intelligence has recently experienced a number of breakthroughs thanks to progress in deep learning and reinforcement learning. Computer algorithms now outperform humans at Go, Jeopardy, image classification, and lip reading, and are becoming very competent at driving cars and interpreting natural language. The rapid development has led many to conjecture that artificial intelligence with greater-than-human ability on a wide range of tasks may not be far. This in turn raises concerns whether we know how to control such systems, in case we were to successfully build them...

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
70

Transparency, Detection and Imitation in Strategic Classification

attributed to: Flavia Barsotti, Ruya Gokhan Kocer, Fernando P. Santos
posted by: shumaari

Given the ubiquity of AI-based decisions that affect individuals’ lives, providing transparent explanations ab...
Given the ubiquity of AI-based decisions that affect individuals’ lives, providing transparent explanations about algorithms is ethically sound and often legally mandatory. How do individuals strategically adapt following explanations? What are the consequences of adaptation for algorithmic accuracy? We simulate the interplay between explanations shared by an Institution (e.g. a bank) and the dynamics of strategic adaptation by Individuals reacting to such feedback... 
Keywords: Agent-based and Multi-agent Systems: Agent-Based Simulation and Emergence; AI Ethics, Trust, Fairness: Ethical, Legal and Societal Issues; Multidisciplinary Topics and Applications: Finance

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
71

Socially Intelligent Genetic Agents for the Emergence of Explicit Norms

attributed to: Rishabh Agrawal, Nirav Ajmeri, Munindar Singh
posted by: shumaari

Norms help regulate a society. Norms may be explicit (represented in structured form) or implicit. We address ...
Norms help regulate a society. Norms may be explicit (represented in structured form) or implicit. We address the emergence of explicit norms by developing agents who provide and reason about explanations for norm violations in deciding sanctions and identifying alternative norms. These agents use a genetic algorithm to produce norms and reinforcement learning to learn the values of these norms. We find that applying explanations leads to norms that provide better cohesion and goal satisfaction for the agents. Our results are stable for societies with differing attitudes of generosity.
Keywords: Agent-based and Multi-agent Systems: Agent-Based Simulation and Emergence, Normative systems

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
72

Teaching AI Agents Ethical Values Using Reinforcement Learning and Policy Orchestration (Extended Abstract)

attributed to: Noothigattu, Ritesh; Bouneffouf, Djallel; Mattei, Nicholas; Chandra, Rachita; Madan, Piyush; Varshney, Kush R.; Campbell, Murray; Singh, Moninder; and Rossi, Francesca
posted by: JustinBradshaw

Autonomous cyber-physical agents play an increasingly large role in our lives. To ensure that they behave in w...
Autonomous cyber-physical agents play an increasingly large role in our lives. To ensure that they behave in ways aligned with the values of society, we must develop techniques that allow these agents to not only maximize their reward in an environment, but also to learn and follow the implicit constraints of society. We detail a novel approach
that uses inverse reinforcement learning to learn a set of unspecified constraints from demonstrations and reinforcement learning to learn to maximize environmental rewards. A contextual bandit-based orchestrator then picks between the two policies: constraint-based and environment reward-based.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
73

Inverse Reinforcement Learning From Like-Minded Teachers

attributed to: Noothigattu, Ritesh; Yan, Tom; Procaccia, Ariel D.
posted by: JustinBradshaw

We study the problem of learning a policy in a Markov decision process (MDP) based on observations of the acti...
We study the problem of learning a policy in a Markov decision process (MDP) based on observations of the actions taken by multiple teachers. We assume that the teachers are like-minded in that their reward functions -- while different from each other -- are random perturbations of an underlying reward function. Under this assumption, we demonstrate that inverse reinforcement learning algorithms that satisfy a certain property -- that of matching feature expectations -- yield policies that are approximately optimal with respect to the underlying reward function, and that no algorithm can do better in the worst case.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
74

Inverse Reinforcement Learning: A Control Lyapunov Approach

attributed to: Tesfazgi, Samuel; Lederer, Armin; and Hirche, Sandra
posted by: JustinBradshaw

Inferring the intent of an intelligent agent from demonstrations and subsequently predicting its behavior, is ...
Inferring the intent of an intelligent agent from demonstrations and subsequently predicting its behavior, is a critical task in many collaborative settings. A common approach to solve this problem is the framework of inverse reinforcement learning (IRL), where the observed agent, e.g., a human demonstrator, is assumed to behave according to an intrinsic cost function that reflects its intent and informs its control actions. In this work, we reformulate the IRL inference problem to learning control Lyapunov functions (CLF) from demonstrations by exploiting the inverse optimality property, which states that every CLF is also a meaningful value function.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
75

A Voting-Based System for Ethical Decision Making

attributed to: Noothigattu, Ritesh; Gaikwad, Snehalkumar ‘Neil’ S.; Awad, Edmond; Dsouza, Sohan; Rahwan, Iyad; Ravikumar, Pradeep; and Procaccia, Ariel D.
posted by: JustinBradshaw

We present a general approach to automating ethical decisions, drawing on machine learning and computational s...
We present a general approach to automating ethical decisions, drawing on machine learning and computational social choice. In a nutshell, we propose to learn a model of societal preferences, and, when faced with a specific ethical dilemma at runtime, efficiently aggregate those preferences to identify a desirable choice. We provide a concrete algorithm that instantiates our approach; some of its crucial steps are informed by a new theory of swap-dominance efficient voting rules. Finally, we implement and evaluate a system for ethical decision making in the autonomous vehicle domain, using preference data collected from 1.3 million people through the Moral Machine website.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
76

Aligning Superhuman AI with Human Behavior: Chess as a Model System

attributed to: McIlroy-Young, Reid; Sen, Siddhartha; Kleinberg, Jon; Anderson, Ashton
posted by: JustinBradshaw

As artificial intelligence becomes increasingly intelligent—in some
cases, achieving superhuman performance—th...
As artificial intelligence becomes increasingly intelligent—in some
cases, achieving superhuman performance—there is growing potential for humans to learn from and collaborate with algorithms.
However, the ways in which AI systems approach problems are often
different from the ways people do, and thus may be uninterpretable
and hard to learn from. A crucial step in bridging this gap between human and artificial intelligence is modeling the granular actions that
constitute human behavior, rather than simply matching aggregate
human performance.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
77

Learning to Play No-Press Diplomacy with Best Response Policy Iteration

attributed to: Anthony, Thomas; Eccles, Tom; Tacchetti, Andrea; Kramár, János; Gemp, Ian; Hudson, Thomas C.; Porcel, Nicolas; Lanctot, Marc; Pérolat, Julien; Everett, Richard; Werpachowski, Roman; Singh, Satinder; Graepel, Thore; Bachrach, Yoram
posted by: JustinBradshaw

Recent advances in deep reinforcement learning (RL) have led to considerable
progress in many 2-player zero-su...
Recent advances in deep reinforcement learning (RL) have led to considerable
progress in many 2-player zero-sum games, such as Go, Poker and Starcraft. The
purely adversarial nature of such games allows for conceptually simple and principled application of RL methods. However real-world settings are many-agent,
and agent interactions are complex mixtures of common-interest and competitive
aspects. We consider Diplomacy, a 7-player board game designed to accentuate
dilemmas resulting from many-agent interactions. It also features a large combinatorial action space and simultaneous moves, which are challenging for RL
algorithms.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
78

Truthful AI: Developing and governing AI that does not lie

attributed to: Owain Evans, Owen Cotton-Barratt, Lukas Finnveden, Adam Bales, Avital Balwit, Peter Wills, Luca Righetti, William Saunders
posted by: JustinBradshaw

In many contexts, lying – the use of verbal falsehoods to deceive – is harmful. While lying has traditionally ...
In many contexts, lying – the use of verbal falsehoods to deceive – is harmful. While lying has traditionally been a human affair, AI systems that
make sophisticated verbal statements are becoming increasingly prevalent.
This raises the question of how we should limit the harm caused by AI
“lies” (i.e. falsehoods that are actively selected for). Human truthfulness
is governed by social norms and by laws (against defamation, perjury,
and fraud). Differences between AI and humans present an opportunity
to have more precise standards of truthfulness for AI, and to have these
standards rise over time.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
79

Verifiably Safe Exploration for End-to-End Reinforcement Learning

attributed to: Nathan Hunt, Nathan Fulton, Sara Magliacane, Nghia Hoang, Subhro Das, Armando Solar-Lezama
posted by: KabirKumar

Deploying deep reinforcement learning in safety-critical settings requires developing algorithms that obey har...
Deploying deep reinforcement learning in safety-critical settings requires developing algorithms that obey hard constraints during exploration. This paper contributes a first approach toward enforcing formal safety constraints on end-to-end policies with visual inputs. Our approach draws on recent advances in object detection and automated reasoning for hybrid dynamical systems. The approach is evaluated on a novel benchmark that emphasizes the challenge of safely exploring in the presence of hard constraints...

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
80

A Roadmap for Robust End-to-End Alignment

attributed to: Lê Nguyên Hoang
posted by: KabirKumar

As algorithms are becoming more and more data-driven, the greatest lever we have left to make them robustly be...
As algorithms are becoming more and more data-driven, the greatest lever we have left to make them robustly beneficial to mankind lies in the design of their objective functions. Robust alignment aims to address this design problem. Arguably, the growing importance of social medias’ recommender systems makes it an urgent problem, for instance to ade-quately automate hate speech moderation. In this paper, we propose a preliminary research program for robust alignment. This roadmap aims at decomposing the end-to-end alignment problem into numerous more tractable subproblems...

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
81

Safe Reinforcement Learning with Natural Language Constraints

attributed to: Tsung-Yen Yang, Michael Hu, Yinlam Chow, Peter J. Ramadge, Karthik Narasimhan
posted by: KabirKumar

While safe reinforcement learning (RL) holds great promise for many practical applications like robotics or au...
While safe reinforcement learning (RL) holds great promise for many practical applications like robotics or autonomous cars, current approaches require specifying constraints in mathematical form. Such specifications demand domain expertise, limiting the adoption of safe RL. In this paper, we propose learning to interpret natural language constraints for safe RL. To this end, we first introduce HazardWorld, a new multi-task benchmark that requires an agent to optimize reward while not violating constraints specified in free-form text. We then develop an agent with a modular architecture that can interpret and adhere to such textual constraints while learning new tasks.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
82

Taking Principles Seriously: A Hybrid Approach to Value Alignment

attributed to: Tae Wan Kim, John Hooker, Thomas Donaldson (Carnegie Mellon University, USA University of Pennsylvania, USA)
posted by: KabirKumar

An important step in the development of value alignment (VA) systems in AI is understanding how VA can reflect...
An important step in the development of value alignment (VA) systems in AI is understanding how VA can reflect valid ethical principles. We propose that designers of VA systems incorporate ethics by utilizing a hybrid approach in which both ethical reasoning and empirical observation play a role. This, we argue, avoids committing the "naturalistic fallacy," which is an attempt to derive "ought" from "is," and it provides a more adequate form of ethical reasoning when the fallacy is not committed...

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
83

Fully General Online Imitation Learning

attributed to: Michael K. Cohen, Marcus Hutter, Neel Nanda
posted by: KabirKumar

In imitation learning, imitators and demonstrators are policies for picking actions given past interactions wi...
In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time. In general, one mistake during learning can lead to completely different events. In the special setting of environments that restart, existing work provides formal guidance in how to imitate so that events unfold similarly, but outside that setting, no formal guidance exists...
Keywords: Bayesian Sequence Prediction, Imitation Learning, Active Learning, General
Environments

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
84

Accumulating Risk Capital Through Investing in Cooperation

attributed to: Charlotte Roman, Michael Dennis, Andrew Critch, Stuart Russell
posted by: KabirKumar

Recent work on promoting cooperation in multi-agent learning has resulted in many methods which successfully p...
Recent work on promoting cooperation in multi-agent learning has resulted in many methods which successfully promote cooperation at the cost of becoming more vulnerable to exploitation by malicious actors. We show that this is an unavoidable trade-off and propose an objective which balances these concerns, promoting both safety and long-term cooperation. Moreover, the trade-off between safety and cooperation is not severe, and you can receive exponentially large returns through cooperation from a small amount of risk...

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
85

Normative Disagreement as a Challenge for Cooperative AI

attributed to: Bingchen Zhao, Shaozuo Yu, Wufei Ma, Mingxin Yu, Shenxiao Mei, Angtian Wang, Ju He, Alan Yuille, Adam Kortylewski
posted by: KabirKumar

Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One reason is that exist...
Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One reason is that existing robustness benchmarks are limited, as they either rely on synthetic data or ignore the effects of individual nuisance factors. We introduce OOD-CV, a benchmark dataset that includes out-of-distribution examples of 10 object categories in terms of pose, shape, texture, context and the weather conditions, and enables benchmarking models for image classification, object detection, and 3D pose estimation... (Full Abstract in Full Plan- click Title to View)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
86

A General Language Assistant as a Laboratory for Alignment

attributed to: Anthropic (Full Author list in Full Plan- click title to view)
posted by: KabirKumar

Given the broad capabilities of large language models, it should be possible to work towards a general-purpose...
Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. ... (Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 8
Add
▼ 0 Strengths and 8 Vulnerabilities
add vulnerability / strength
report
87

Identifying Adversarial Attacks on Text Classifiers

attributed to: Zhouhang Xie, Jonathan Brophy, Adam Noack, Wencong You, Kalyani Asthana, Carter Perkins, Sabrina Reis, Sameer Singh, Daniel Lowd
posted by: KabirKumar

The landscape of adversarial attacks against text classifiers continues to grow, with new attacks developed ev...
The landscape of adversarial attacks against text classifiers continues to grow, with new attacks developed every year and many of them available in standard toolkits, such as TextAttack and OpenAttack. In response, there is a growing body of work on robust learning, which reduces vulnerability to these attacks, though sometimes at a high cost in compute time or accuracy. In this paper, we take an alternate approach -- we attempt to understand the attacker by analyzing adversarial text to determine which methods were used to create it... (Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
88

Training language models to follow instructions with human feedback

attributed to: OpenAI (Full Author list in Full Plan- click title to view)
posted by: KabirKumar

Making language models bigger does not inherently make them better at following a user's intent. For example, ...
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback... (Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
89

Safe Reinforcement Learning by Imagining the Near Future

attributed to: Garrett Thomas, Yuping Luo, Tengyu Ma
posted by: KabirKumar

Safe reinforcement learning is a promising path toward applying reinforcement learning algorithms to real-worl...
Safe reinforcement learning is a promising path toward applying reinforcement learning algorithms to real-world problems, where suboptimal behaviors may lead to actual negative consequences. In this work, we focus on the setting where unsafe states can be avoided by planning ahead a short time into the future. In this setting, a model-based agent with a sufficiently accurate model can avoid unsafe states. We devise a model-based algorithm that heavily penalizes unsafe trajectories, and derive guarantees that our algorithm can avoid unsafe states under certain assumptions... (Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
90

Red Teaming Language Models with Language Models

attributed to: Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, Geoffrey Irving
posted by: KabirKumar

Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict way...
Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human annotation is expensive, limiting the number and diversity of test cases. In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases ("red teaming") using another LM... (Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
91

'Indifference' methods for managing agent rewards

attributed to: Stuart Armstrong, Xavier O'Rourke
posted by: KabirKumar

`Indifference' refers to a class of methods used to control reward based agents. Indifference techniques aim t...
`Indifference' refers to a class of methods used to control reward based agents. Indifference techniques aim to achieve one or more of three distinct goals: rewards dependent on certain events (without the agent being motivated to manipulate the probability of those events), effective disbelief (where agents behave as if particular events could never happen), and seamless transition from one reward function to another (with the agent acting as if this change is unanticipated). This paper presents several methods for achieving these goals in the POMDP setting, establishing their uses, strengths, and requirements... (Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
92

A Psychopathological Approach to Safety Engineering in AI and AGI

attributed to: Vahid Behzadan, Arslan Munir, Roman V. Yampolskiy
posted by: KabirKumar

The complexity of dynamics in AI techniques is already approaching that of complex adaptive systems, thus curt...
The complexity of dynamics in AI techniques is already approaching that of complex adaptive systems, thus curtailing the feasibility of formal controllability and reachability analysis in the context of AI safety. It follows that the envisioned instances of Artificial General Intelligence (AGI) will also suffer from challenges of complexity. To tackle such issues, we propose the modeling of deleterious behaviors in AI and AGI as psychological disorders, thereby enabling the employment of psychopathological approaches to analysis and control of misbehaviors... (Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
93

This paper reviews the reasons that Human-in-the-Loop is both critical for preventing widely-understood failur...
This paper reviews the reasons that Human-in-the-Loop is both critical for preventing widely-understood failure modes for machine learning, and not a practical solution. Following this, we review two current heuristic methods for addressing this. The first is provable safety envelopes, which are possible only when the dynamics of the system are fully known, but can be useful safety guarantees when optimal behavior is based on machine learning with poorly-understood safety characteristics... (Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
94

Active Inverse Reward Design

attributed to: Sören Mindermann, Rohin Shah, Adam Gleave, Dylan Hadfield-Menell
posted by: KabirKumar

Designers of AI agents often iterate on the reward function in a trial-and-error process until they get the de...
Designers of AI agents often iterate on the reward function in a trial-and-error process until they get the desired behavior, but this only guarantees good behavior in the training environment. We propose structuring this process as a series of queries asking the user to compare between different reward functions. Thus we can actively select queries for maximum informativeness about the true reward. In contrast to approaches asking the designer for optimal behavior, this allows us to gather additional information by eliciting preferences between suboptimal behaviors... (Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
95

Risk-Sensitive Generative Adversarial Imitation Learning

attributed to: Jonathan Lacotte, Mohammad Ghavamzadeh, Yinlam Chow, Marco Pavone
posted by: KabirKumar

We study risk-sensitive imitation learning where the agent's goal is to perform at least as well as the expert...
We study risk-sensitive imitation learning where the agent's goal is to perform at least as well as the expert in terms of a risk profile. We first formulate our risk-sensitive imitation learning setting. We consider the generative adversarial approach to imitation learning (GAIL) and derive an optimization problem for our formulation, which we call it risk-sensitive GAIL (RS-GAIL). We then derive two different versions of our RS-GAIL optimization problem that aim at matching the risk profiles of the agent and the expert w.r.t. ... (Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
96

Aligning AI With Shared Human Values

attributed to: Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt
posted by: KabirKumar

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS data...
We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements... (Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
97

Avoiding Side Effects By Considering Future Tasks

attributed to: Liwei Jiang, Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Maxwell Forbes, Jon Borchardt, Jenny Liang, Oren Etzioni, Maarten Sap, Yejin Choi
posted by: KabirKumar

Designing reward functions is difficult: the designer has to specify what to do (what it means to complete the...
Designing reward functions is difficult: the designer has to specify what to do (what it means to complete the task) as well as what not to do (side effects that should be avoided while completing the task). To alleviate the burden on the reward designer, we propose an algorithm to automatically generate an auxiliary reward function that penalizes side effects. This auxiliary objective rewards the ability to complete possible future tasks, which decreases if the agent causes side effects during the current task...(Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
98

Measuring and avoiding side effects using relative reachability

attributed to: Victoria Krakovna, Laurent Orseau, Miljan Martic, Shane Legg
posted by: KabirKumar

How can we design reinforcement learning agents that avoid causing unnecessary disruptions to their environmen...
How can we design reinforcement learning agents that avoid causing unnecessary disruptions to their environment? We argue that current approaches to penalizing side effects can introduce bad incentives in tasks that require irreversible actions, and in environments that contain sources of change other than the agent. For example, some approaches give the agent an incentive to prevent any irreversible changes in the environment, including the actions of other agents. We introduce a general definition of side effects, based on relative reachability of states compared to a default state, that avoids these undesirable incentives...(Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
99

Certifiable Robustness to Adversarial State Uncertainty in Deep Reinforcement Learning

attributed to: Michael Everett, Bjorn Lutjens, Jonathan P. How
posted by: KabirKumar

Deep Neural Network-based systems are now the state-of-the-art in many robotics tasks, but their application i...
Deep Neural Network-based systems are now the state-of-the-art in many robotics tasks, but their application in safety-critical domains remains dangerous without formal guarantees on network robustness. Small perturbations to sensor inputs (from noise or adversarial examples) are often enough to change network-based decisions, which was recently shown to cause an autonomous vehicle to swerve into another lane. In light of these dangers, numerous algorithms have been developed as defensive mechanisms from these adversarial inputs, some of which provide formal robustness guarantees or certificates... {Full Abstract in Full Plan- click plan title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
100

Learning Human Objectives by Evaluating Hypothetical Behavior

attributed to: Siddharth Reddy, Anca D. Dragan, Sergey Levine, Shane Legg, Jan Leike
posted by: KabirKumar

We seek to align agent behavior with a user's objectives in a reinforcement learning setting with unknown dyna...
We seek to align agent behavior with a user's objectives in a reinforcement learning setting with unknown dynamics, an unknown reward function, and unknown unsafe states. The user knows the rewards and unsafe states, but querying the user is expensive. To address this challenge, we propose an algorithm that safely and interactively learns a model of the user's reward function. We start with a generative model of initial states and a forward dynamics model trained on off-policy data... (Full Abstract in Full Plan- click plan title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
101

SafeLife 1.0: Exploring Side Effects in Complex Environments

attributed to: Carroll L. Wainwright, Peter Eckersley
posted by: KabirKumar

We present SafeLife, a publicly available reinforcement learning environment that tests the safety of reinforc...
We present SafeLife, a publicly available reinforcement learning environment that tests the safety of reinforcement learning agents. It contains complex, dynamic, tunable, procedurally generated levels with many opportunities for unsafe behavior. Agents are graded both on their ability to maximize their explicit reward and on their ability to operate safely without unnecessary side effects. We train agents to maximize rewards using proximal policy optimization and score them on a suite of benchmark levels... (Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
report
102

(When) Is Truth-telling Favored in AI Debate?

attributed to: Vojtěch Kovařík(Future of Humanity Institute University of Oxford), Ryan Carey (Artificial Intelligence Center Czech Technical University)
posted by: KabirKumar