{"PUBLIC_ROOT":"","POST_CHAR_LIMIT":50000,"CONFIRM_MINUTES":15,"UPLOAD_LIMIT_MB":60,"UPLOAD_LIMIT_MB_PDF":35,"UPLOAD_SEC_LIMIT":35,"CHAT_LENGTH":500,"POST_BUFFER_MS":60000,"COMMENT_BUFFER_MS":30000,"POST_LIMITS":{"TITLE":200,"DESCRIPTION":2200,"CONTENT":500000,"ATTRIBUTION":350,"COMMENT_CONTENT":10000},"VOTE_TYPES":{"single_up":1},"UPLOAD_BUFFER_S":20,"UPLOAD_LIMIT_GENERIC_MB":1,"HOLD_UNLOGGED_SUBMIT_DAYS":1,"KARMA_SCALAR":0.01,"VOTE_CODES":{"rm_upvote":"removed upvote","rm_down":"removed downvote","add_upvote":"added upvote","add_down":"added downvote"},"BADGE_TYPES":{"voting":{"ranks":[1,5,10,15,20],"name":"Voter"},"strengths":{"ranks":[1,5,10,15,20,30,40,50],"name":"Upvoter"},"vulns":{"ranks":[1,5,10,15,20,30,40,50],"name":"Critic"},"received_vote":{"ranks":[1,2,3,5,8,13,21,34],"name":"Popular"}}}

Critique AI Alignment Plans

Want feedback on your alignment plan? Submit a Plan

Topics:

All

Plan ranking: Total Strength Score - Total Vulnerability Score

1

Provably safe systems: the only path to controllable AGI

attributed to: Max Tegmark, Steve Omohundro
posted by: Tristram

We describe a path to humanity safely thriving with powerful Artificial
General Intelligences (AGIs) by build...
We describe a path to humanity safely thriving with powerful Artificial
General Intelligences (AGIs) by building them to provably satisfy
human-specified requirements. We argue that this will soon be technically
feasible using advanced AI for formal verification and mechanistic
interpretability. We further argue that it is the only path which guarantees
safe controlled AGI. We end with a list of challenge problems whose solution
would contribute to this positive outcome and invite readers to join in this
work.

...read full abstract close
show post
: 3
Add

: 7
Add
▼ 3 Strengths and 7 Vulnerabilities
add vulnerability / strength
2

AI alignment metric - LIFE (extended definition)

attributed to: Mars Robertson 🌱 Planetary Council
posted by: Mars

This has been posted on my blog: https://mirror.xyz/0x315f80C7cAaCBE7Fb1c14E65A634db89A33A9637/ETK6RXnmgeNcALa...
This has been posted on my blog: https://mirror.xyz/0x315f80C7cAaCBE7Fb1c14E65A634db89A33A9637/ETK6RXnmgeNcALabcIE3k3-d-NqOHqEj8dU1_0J6cUg ➡️➡️➡️check it out for better formatting⬅️⬅️⬅️

TLDR summary, extended definition of LIFE:

1. LIFE (starting point and then extending the definition)
2. Health, including mental health, longevity, happiness, wellbeing
3. Other living creatures, biosphere, environment, climate change
4. AI safety
5. Mars: backup civilisation is fully aligned with the virtue of LIFE preservation
6. End the Russia-Ukraine war, global peace
7. Artificial LIFE
8. Transhumanism, AI integration
9. Alien LIFE
10. Other undiscovered forms of LIFE

...read full abstract close
show post
: 1
Add

: 4
Add
▼ 1 Strengths and 4 Vulnerabilities
add vulnerability / strength
3

Legible Normativity for AI Alignment: The Value of Silly Rules

attributed to: Dylan Hadfield-Menell, McKane Andrus, Gillian K. Hadfield
posted by: KabirKumar

It has become commonplace to assert that autonomous agents will have to be built to follow human rules of beha...
It has become commonplace to assert that autonomous agents will have to be built to follow human rules of behavior--social norms and laws. But human laws and norms are complex and culturally varied systems, in many cases agents will have to learn the rules. This requires autonomous agents to have models of how human rule systems work so that they can make reliable predictions about rules. In this paper we contribute to the building of such models by analyzing an overlooked distinction between important rules and what we call silly rules--rules with no discernible direct impact on welfare. We show that silly rules render...

...read full abstract close
show post
: 4
Add

: 5
Add
▼ 4 Strengths and 5 Vulnerabilities
add vulnerability / strength
4

Interpretability in AI systems is fast becoming a critical requirement in the industry. The proposed Hybrid Ex...
Interpretability in AI systems is fast becoming a critical requirement in the industry. The proposed Hybrid Explainability Model (HEM) integrates multiple interpretability techniques, including Feature Importance Visualization, Model Transparency Tools, and Counterfactual Explanations, offering a comprehensive understanding of AI model behavior. This article elaborates on the specifics of implementing HEM, addresses potential counter-arguments, and provides rebuttals to these counterpoints. The HEM approach aims to deliver a holistic understanding of AI decision-making processes, fostering improved accountability, trust, and safety in AI applications.

...read full abstract close
show post
: 0
Add

: 2
Add
▼ 0 Strengths and 2 Vulnerabilities
add vulnerability / strength
5

Open Agency Architecture

attributed to: Davidad
posted by: KabirKumar

Utilize near-AGIs to build a detailed world simulation, train and formally verify within it that the AI adhere...
Utilize near-AGIs to build a detailed world simulation, train and formally verify within it that the AI adheres to coarse preferences and avoids catastrophic outcomes.

...read full abstract close
show post
: 0
Add

: 8
Add
▼ 0 Strengths and 8 Vulnerabilities
add vulnerability / strength
6

This plan suggests that high-capability general AI models should be tested within a secure computing environme...
This plan suggests that high-capability general AI models should be tested within a secure computing environment (box) that is censored (no mention of humanity or computers) and highly controlled (auto-compute halts/slowdowns, restrictions on agent behavior) with simulations of alignment-relevant scenarios (e.g. with other general agents that the test subject is to be aligned to).

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
7

Scalable agent alignment via reward modeling: a research direction

attributed to: Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg
posted by: KabirKumar

One obstacle to applying reinforcement learning algorithms to real-world problems is the lack of suitable rewa...
One obstacle to applying reinforcement learning algorithms to real-world problems is the lack of suitable reward functions. Designing such reward functions is difficult in part because the user only has an implicit understanding of the task objective. This gives rise to the agent alignment problem: how do we create agents that behave in accordance with the user's intentions? We outline a high-level research direction to solve the agent alignment problem centered around reward modeling: learning a reward function from interaction with the user and optimizing the learned reward function with reinforcement learning...

...read full abstract close
show post
: 0
Add

: 2
Add
▼ 0 Strengths and 2 Vulnerabilities
add vulnerability / strength
8

Safeguarded AI: constructing safety by design

attributed to: David "davidad" Dalrymple
posted by: agentofuser

Imagine a future where advanced AI powers breakthroughs in science, technology, engineering, and medicine, enh...
Imagine a future where advanced AI powers breakthroughs in science, technology, engineering, and medicine, enhancing global prosperity and safeguarding humanity from disasters—but all with rigorous engineering safety measures, like society has come to expect from our critical infrastructure. This programme shall prototype and demonstrate a toolkit for building such safety measures, designed to channel any frontier AI’s raw potential not only responsibly, but legibly and verifiably so.

This programme envisions a pathway to leverage frontier AI itself to collaborate with humans to construct a “gatekeeper”: a targeted AI whose job is to fully understand the real-world interactions and consequences of an autonomous AI agent, and to ensure the agent only operates within agreed-upon safety guardrails and specifications for a given application. These safeguards would not only reduce the risks of frontier AI and enable its use in safety-critical applications, they would also unlock the upside of frontier AI in business-critical applications and commercial activities where reliability is key[...].

At the end of the programme, we aim to show a compelling proof-of-concept demonstration, in at least one narrow domain, where AI decision-support tools or autonomous control systems can improve on both performance and robustness versus existing operations, in a context where the net present value attainable by full deployment is estimated to be billions of pounds. Some examples of potential such early demonstration areas include: balancing electricity grids, supply chain management, clinical trial optimisation, and 5G beamforming/subchannel allocation for mobile telecommunications networks.

If successful, this would in turn produce a scientific consensus that “AI with quantitative safety guarantees” is a viable R&D pathway that yields key superhuman capabilities for managing cyber-physical systems, unlocking positive economic rewards—while also building up large-scale civilisational resilience, thereby reducing risks from humanity’s vulnerability to potential future “rogue AIs” to an acceptable level within an acceptable time frame.

...read full abstract close
show post
: 0
Add

: 2
Add
▼ 0 Strengths and 2 Vulnerabilities
add vulnerability / strength
9

The ISITometer is a platform designed to accomplish the following three moonshot objectives: 

Achieve a much ...
The ISITometer is a platform designed to accomplish the following three moonshot objectives: 

Achieve a much higher degree of Intra-Humanity Alignment and Sensemaking
Enable AI-to-Human Alignment (Not vice versa)
Establish a sustainable, ubiquitous Universal Basic Income (UBI)

The ISITometer is a polling engine formatted as a highly engaging social game, designed to collect the perspectives of Humans on the nature of Reality. It starts at the highest levels of abstraction, as represented by the ISIT Construct,  with simple, obvious questions on which we should be able to achieve unanimous agreement, and expands through fractaling derivative details. 

The ISIT Construct is a metamodern approach to the fundamental concepts of duality and polarity. Instead of relying on fanciful metaphors like Yin|Yang, Order|Chaos, and God|Devil that have evolved over the centuries in ancient religions and philosophies, the ISIT Construct establishes a new Prime Duality based on the words IS and IT.

From this starting point, the ISIT Construct provides a path to map all of Reality (as Humanity sees it) from the highest level of abstraction to as much detail as we choose to explore.

...read full abstract close
show post
: 3
Add

: 3
Add
▼ 3 Strengths and 3 Vulnerabilities
add vulnerability / strength
10

Deep reinforcement learning from human preferences

attributed to: Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei
posted by: KabirKumar

For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we ne...
For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems...

...read full abstract close
show post
: 0
Add

: 2
Add
▼ 0 Strengths and 2 Vulnerabilities
add vulnerability / strength
11

Avoiding Wireheading with Value Reinforcement Learning

attributed to: Tom Everitt, Marcus Hutter
posted by: KabirKumar

How can we design good goals for arbitrarily intelligent agents? Reinforcement learning (RL) is a natural appr...
How can we design good goals for arbitrarily intelligent agents? Reinforcement learning (RL) is a natural approach. Unfortunately, RL does not work well for generally intelligent agents, as RL agents are incentivised to shortcut the reward sensor for maximum reward -- the so-called wireheading problem. In this paper we suggest an alternative to RL called value reinforcement learning (VRL). In VRL, agents use the reward signal to learn a utility function. The VRL setup allows us to remove the incentive to wirehead by placing a constraint on the agent's actions...

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
12

WEAK-TO-STRONG GENERALIZATION: ELICITING STRONG CAPABILITIES WITH WEAK SUPERVISION

attributed to: Collin Burns∗ Pavel Izmailov∗ Jan Hendrik Kirchner∗ Bowen Baker∗ Leo Gao∗ Leopold Aschenbrenner∗ Yining Chen∗ Adrien Ecoffet∗ Manas Joglekar∗ Jan Leike Ilya Sutskever Jeff Wu∗
posted by: KabirKumar

Widely used alignment techniques, such as reinforcement learning from human
feedback (RLHF), rely on the abili...
Widely used alignment techniques, such as reinforcement learning from human
feedback (RLHF), rely on the ability of humans to supervise model behavior—for
example, to evaluate whether a model faithfully followed instructions or generated
safe outputs. However, future superhuman models will behave in complex ways
too difficult for humans to reliably evaluate; humans will only be able to weakly
supervise superhuman models. We study an analogy to this problem: can weak
model supervision elicit the full capabilities of a much stronger model? We test
this using a range of pretrained language models in the GPT-4 family on natural
language processing (NLP), chess, and reward modeling tasks. We find that when
we naively finetune strong pretrained models on labels generated by a weak model,
they consistently perform better than their weak supervisors, a phenomenon we
call weak-to-strong generalization. However, we are still far from recovering the
full capabilities of strong models with naive finetuning alone, suggesting that tech-
niques like RLHF may scale poorly to superhuman models without further work.
We find that simple methods can often significantly improve weak-to-strong gen-
eralization: for example, when finetuning GPT-4 with a GPT-2-level supervisor
and an auxiliary confidence loss, we can recover close to GPT-3.5-level perfor-
mance on NLP tasks. Our results suggest that it is feasible to make empirical
progress today on a fundamental challenge of aligning superhuman models.

...read full abstract close
show post
: 1
Add

: 2
Add
▼ 1 Strengths and 2 Vulnerabilities
add vulnerability / strength
13

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

attributed to: Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, Yejin Choi
posted by: KabirKumar

The alignment tuning process of large language models (LLMs) typically
involves instruction learning through s...
The alignment tuning process of large language models (LLMs) typically
involves instruction learning through supervised fine-tuning (SFT) and
preference tuning via reinforcement learning from human feedback (RLHF). A
recent study, LIMA (Zhou et al. 2023), shows that using merely 1K examples for
SFT can achieve significant alignment performance as well, suggesting that the
effect of alignment tuning might be "superficial." This raises questions about
how exactly the alignment tuning transforms a base LLM.
  We analyze the effect of alignment tuning by examining the token distribution
shift between base LLMs and their aligned counterpart. Our findings reveal that
base LLMs and their alignment-tuned versions perform nearly identically in
decoding on the majority of token positions. Most distribution shifts occur
with stylistic tokens. These direct evidence strongly supports the Superficial
Alignment Hypothesis suggested by LIMA.
  Based on these findings, we rethink the alignment of LLMs by posing the
research question: how effectively can we align base LLMs without SFT or RLHF?
To address this, we introduce a simple, tuning-free alignment method, URIAL.
URIAL achieves effective alignment purely through in-context learning (ICL)
with base LLMs, requiring as few as three constant stylistic examples and a
system prompt. We conduct a fine-grained and interpretable evaluation on a
diverse set of examples, named JUST-EVAL-INSTRUCT. Results demonstrate that
base LLMs with URIAL can match or even surpass the performance of LLMs aligned
with SFT or SFT+RLHF. We show that the gap between tuning-free and tuning-based
alignment methods can be significantly reduced through strategic prompting and
ICL. Our findings on the superficial nature of alignment tuning and results
with URIAL suggest that deeper analysis and theoretical understanding of
alignment is crucial to future LLM research.

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
14

"Causal Scrubbing: a method for rigorously testing interpretability hypotheses", AI Alignment Forum, 2022.

attributed to: Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, Nate Thomas [Redwood Research]
posted by: momom2

Summary: This post introduces causal scrubbing, a principled approach for evaluating the quality of mechanisti...
Summary: This post introduces causal scrubbing, a principled approach for evaluating the quality of mechanistic interpretations. The key idea behind causal scrubbing is to test interpretability hypotheses via behavior-preserving resampling ablations. We apply this method to develop a refined understanding of how a small language model implements induction and how an algorithmic model correctly classifies if a sequence of parentheses is balanced.

...read full abstract close
show post
: 1
Add

: 2
Add
▼ 1 Strengths and 2 Vulnerabilities
add vulnerability / strength
15

Cognitive Emulation: A Naive AI Safety Proposal

attributed to: Connor Leahy, Gabriel Alfour (Conjecture)
posted by: KabirKumar

This post serves as a signpost for Conjecture’s new primary safety proposal and research direction, which we c...
This post serves as a signpost for Conjecture’s new primary safety proposal and research direction, which we call Cognitive Emulation (or “CoEm”). The goal of the CoEm agenda is to build predictably boundable systems, not directly aligned AGIs. We believe the former to be a far simpler and useful step towards a full alignment solution.

Unfortunately, given that most other actors are racing for as powerful and general AIs as possible, we won’t share much in terms of technical details for now. In the meantime, we still want to share some of our intuitions about this approach.

We take no credit for inventing any of these ideas, and see our contributions largely in taking existing ideas seriously and putting them together into a larger whole.[1]

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
16

Reinforcement Learning Under Moral Uncertainty

attributed to: Adrien Ecoffet, Joel Lehman
posted by: KabirKumar

An ambitious goal for machine learning is to create agents that behave
ethically: The capacity to abide by hum...
An ambitious goal for machine learning is to create agents that behave
ethically: The capacity to abide by human moral norms would greatly expand the
context in which autonomous agents could be practically and safely deployed,
e.g. fully autonomous vehicles will encounter charged moral decisions that
complicate their deployment. While ethical agents could be trained by rewarding
correct behavior under a specific moral theory (e.g. utilitarianism), there
remains widespread disagreement about the nature of morality. Acknowledging
such disagreement, recent work in moral philosophy proposes that ethical
behavior requires acting under moral uncertainty, i.e. to take into account
when acting that one's credence is split across several plausible ethical
theories. This paper translates such insights to the field of reinforcement
learning, proposes two training methods that realize different points among
competing desiderata, and trains agents in simple environments to act under
moral uncertainty.

...read full abstract close
show post
: 4
Add

: 6
Add
▼ 4 Strengths and 6 Vulnerabilities
add vulnerability / strength
17

Path-Specific Objectives for Safer Agent Incentives

attributed to: Sebastian Farquhar, Ryan Carey, Tom Everitt
posted by: KabirKumar

We present a general framework for training safe agents whose naive incentives are unsafe. E.g, manipulative o...
We present a general framework for training safe agents whose naive incentives are unsafe. E.g, manipulative or deceptive behavior can improve rewards but should be avoided. Most approaches fail here: agents maximize expected return by any means necessary. We formally describe settings with 'delicate' parts of the state which should not be used as a means to an end. We then train agents to maximize the causal effect of actions on the expected return which is not mediated by the delicate parts of state, using Causal Influence Diagram analysis. The resulting agents have no incentive to control the delicate state. We further show how our framework unifies and generalizes existing proposals.

...read full abstract close
show post
: 3
Add

: 4
Add
▼ 3 Strengths and 4 Vulnerabilities
add vulnerability / strength
18

LOVE in a simbox is all you need

attributed to: Jacob Cannell
posted by: KabirKumar

We can develop self-aligning DL based AGI by improving on the brain's dynamic alignment mechanisms (empathy/al...
We can develop self-aligning DL based AGI by improving on the brain's dynamic alignment mechanisms (empathy/altruism/love) via safe test iteration in simulation sandboxes.

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
19

Relaxed adversarial training for inner alignment

attributed to: Evan Hubinger
posted by: KabirKumar

"This post is part of research I did at OpenAI with mentoring and guidance from Paul Christiano. It also repre...
"This post is part of research I did at OpenAI with mentoring and guidance from Paul Christiano. It also represents my current agenda regarding what I believe looks like the most promising approach for addressing inner alignment. " - Evan Hubinger

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
20

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

attributed to: David "davidad" Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, Joshu
posted by: KabirKumar

Ensuring that AI systems reliably and robustly avoid harmful or dangerous
behaviours is a crucial challenge, e...
Ensuring that AI systems reliably and robustly avoid harmful or dangerous
behaviours is a crucial challenge, especially for AI systems with a high degree
of autonomy and general intelligence, or systems used in safety-critical
contexts. In this paper, we will introduce and define a family of approaches to
AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature
of these approaches is that they aim to produce AI systems which are equipped
with high-assurance quantitative safety guarantees. This is achieved by the
interplay of three core components: a world model (which provides a
mathematical description of how the AI system affects the outside world), a
safety specification (which is a mathematical description of what effects are
acceptable), and a verifier (which provides an auditable proof certificate that
the AI satisfies the safety specification relative to the world model). We
outline a number of approaches for creating each of these three core
components, describe the main technical challenges, and suggest a number of
potential solutions to them. We also argue for the necessity of this approach
to AI safety, and for the inadequacy of the main alternative approaches.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
21

'Indifference' methods for managing agent rewards

attributed to: Stuart Armstrong, Xavier O'Rourke
posted by: KabirKumar

`Indifference' refers to a class of methods used to control reward based
agents. Indifference techniques aim t...
`Indifference' refers to a class of methods used to control reward based
agents. Indifference techniques aim to achieve one or more of three distinct
goals: rewards dependent on certain events (without the agent being motivated
to manipulate the probability of those events), effective disbelief (where
agents behave as if particular events could never happen), and seamless
transition from one reward function to another (with the agent acting as if
this change is unanticipated). This paper presents several methods for
achieving these goals in the POMDP setting, establishing their uses, strengths,
and requirements. These methods of control work even when the implications of
the agent's reward are otherwise not fully understood.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
22

Every LLM in existence is a blackbox, and alignment relying on tuning the blackbox never succeeds - that is ev...
Every LLM in existence is a blackbox, and alignment relying on tuning the blackbox never succeeds - that is evident by that fact that even models like ChatGPT get jailbroken constantly. Moreover, blackbox tuning has no reason to transfer to bigger models.

A new architecture is required.

I propose using an LLM to parse environment into planner format such as STRIPS, and then using an algorithmic planner such as fast downward in order to implement agentic behaviour. The produced plan is then parsed back into natural language or into commands to execute automatically. Such architecture would also be commercially desirable and would deincentivise investments into bigger monolithic models.

Draft of the architecture:
https://gitlab.com/anomalocaribd/prometheus-planner/-/blob/main/architecture.md

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
23

The Incomplete Preferences Proposal (IPP)

attributed to: Elliott Thornley
posted by: EJT

The Incomplete Preferences Proposal (IPP) is a proposed solution to the shutdown problem: the problem of ensur...
The Incomplete Preferences Proposal (IPP) is a proposed solution to the shutdown problem: the problem of ensuring that artificial agents never resist our attempts to shut them down. The idea is to train agents that lack a preference between every pair of different-length trajectories (that is: every pair of trajectories in which shutdown occurs after different lengths of time). These agents won't pay costs to prevent or cause shutdown. The IPP includes a proposed reward function for training these agents. Since these agents won't resist shutdown, the risk that they overthrow humanity is approximately zero.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
24

Motif: Intrinsic Motivation from Artificial Intelligence Feedback

attributed to: Martin Klissarov, Pierluca D'Oro, Shagun Sodhani, Roberta Raileanu, Pierre-Luc Bacon, Pascal Vincent, Amy Zhang, Mikael Henaff
posted by: KabirKumar

Exploring rich environments and evaluating one's actions without prior
knowledge is immensely challenging. In ...
Exploring rich environments and evaluating one's actions without prior
knowledge is immensely challenging. In this paper, we propose Motif, a general
method to interface such prior knowledge from a Large Language Model (LLM) with
an agent. Motif is based on the idea of grounding LLMs for decision-making
without requiring them to interact with the environment: it elicits preferences
from an LLM over pairs of captions to construct an intrinsic reward, which is
then used to train agents with reinforcement learning. We evaluate Motif's
performance and behavior on the challenging, open-ended and
procedurally-generated NetHack game. Surprisingly, by only learning to maximize
its intrinsic reward, Motif achieves a higher game score than an algorithm
directly trained to maximize the score itself. When combining Motif's intrinsic
reward with the environment reward, our method significantly outperforms
existing approaches and makes progress on tasks where no advancements have ever
been made without demonstrations. Finally, we show that Motif mostly generates
intuitive human-aligned behaviors which can be steered easily through prompt
modifications, while scaling well with the LLM size and the amount of
information given in the prompt.

...read full abstract close
show post
: 4
Add

: 5
Add
▼ 4 Strengths and 5 Vulnerabilities
add vulnerability / strength
25

Create a multiagent system and use game theory to make the agents keep each other in line.
Create a multiagent system and use game theory to make the agents keep each other in line.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
26

RAIN: Your Language Models Can Align Themselves without Finetuning

attributed to: Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, Hongyang Zhang
posted by: momom2

Large language models (LLMs) often demonstrate inconsistencies with human preferences. Previous research typic...
Large language models (LLMs) often demonstrate inconsistencies with human preferences. Previous research typically gathered human preference data and then aligned the pre-trained models using reinforcement learning or instruction tuning, a.k.a. the finetuning step. In contrast, aligning frozen LLMs without requiring alignment data is more appealing. This work explores the potential of the latter setting. We discover that by integrating self-evaluation and rewind mechanisms, unaligned LLMs can directly produce responses consistent with human preferences via self-boosting. We introduce a novel inference method, Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate their own generation and use the evaluation results to guide rewind and generation for AI safety. Notably, RAIN operates without the need of extra data for model alignment and abstains from any training, gradient computation, or parameter updates. Experimental results evaluated by GPT-4 and humans demonstrate the effectiveness of RAIN: on the HH dataset, RAIN improves the harmlessness rate of LLaMA 30B from 82% of vanilla inference to 97%, while maintaining the helpfulness rate. On the TruthfulQA dataset, RAIN improves the truthfulness of the already-well-aligned LLaMA-2-chat 13B model by 5%.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
27

Generalized Preference Optimization: A Unified Approach to Offline Alignment

attributed to: Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Rémi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo Ávila Pires, Bilal Piot
posted by: KabirKumar

Offline preference optimization allows fine-tuning large models directly from
offline data, and has proved eff...
Offline preference optimization allows fine-tuning large models directly from
offline data, and has proved effective in recent alignment practices. We
propose generalized preference optimization (GPO), a family of offline losses
parameterized by a general class of convex functions. GPO enables a unified
view over preference optimization, encompassing existing algorithms such as
DPO, IPO and SLiC as special cases, while naturally introducing new variants.
The GPO framework also sheds light on how offline algorithms enforce
regularization, through the design of the convex function that defines the
loss. Our analysis and experiments reveal the connections and subtle
differences between the offline regularization and the KL divergence
regularization intended by the canonical RLHF formulation. In all, our results
present new algorithmic toolkits and empirical insights to alignment
practitioners.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
28

Specific versus General Principles for Constitutional AI

attributed to: Anthropic (full attribution in post)
posted by: KabirKumar

Human feedback can prevent overtly harmful utterances in conversational
models, but may not automatically miti...
Human feedback can prevent overtly harmful utterances in conversational
models, but may not automatically mitigate subtle problematic behaviors such as
a stated desire for self-preservation or power. Constitutional AI offers an
alternative, replacing human feedback with feedback from AI models conditioned
only on a list of written principles. We find this approach effectively
prevents the expression of such behaviors. The success of simple principles
motivates us to ask: can models learn general ethical behaviors from only a
single written principle? To test this, we run experiments using a principle
roughly stated as "do what's best for humanity". We find that the largest
dialogue models can generalize from this short constitution, resulting in
harmless assistants with no stated interest in specific motivations like power.
A general principle may thus partially avoid the need for a long list of
constitutions targeting potentially harmful behaviors. However, more detailed
constitutions still improve fine-grained control over specific types of harms.
This suggests both general and specific principles have value for steering AI
safely.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
29

Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment

attributed to: Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, Jianshu Chen
posted by: KabirKumar

We consider the problem of multi-objective alignment of foundation models
with human preferences, which is a c...
We consider the problem of multi-objective alignment of foundation models
with human preferences, which is a critical step towards helpful and harmless
AI systems. However, it is generally costly and unstable to fine-tune large
foundation models using reinforcement learning (RL), and the
multi-dimensionality, heterogeneity, and conflicting nature of human
preferences further complicate the alignment process. In this paper, we
introduce Rewards-in-Context (RiC), which conditions the response of a
foundation model on multiple rewards in its prompt context and applies
supervised fine-tuning for alignment. The salient features of RiC are
simplicity and adaptivity, as it only requires supervised fine-tuning of a
single foundation model and supports dynamic adjustment for user preferences
during inference time. Inspired by the analytical solution of an abstracted
convex optimization problem, our dynamic inference-time adjustment method
approaches the Pareto-optimal solution for multiple objectives. Empirical
evidence demonstrates the efficacy of our method in aligning both Large
Language Models (LLMs) and diffusion models to accommodate diverse rewards with
only around 10% GPU hours compared with multi-objective RL baseline.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
30

Agent Alignment in Evolving Social Norms

attributed to: Shimin Li, Tianxiang Sun, Xipeng Qiu
posted by: KabirKumar

Agents based on Large Language Models (LLMs) are increasingly permeating
various domains of human production a...
Agents based on Large Language Models (LLMs) are increasingly permeating
various domains of human production and life, highlighting the importance of
aligning them with human values. The current alignment of AI systems primarily
focuses on passively aligning LLMs through human intervention. However, agents
possess characteristics like receiving environmental feedback and
self-evolution, rendering the LLM alignment methods inadequate. In response, we
propose an evolutionary framework for agent evolution and alignment, named
EvolutionaryAgent, which transforms agent alignment into a process of evolution
and selection under the principle of survival of the fittest. In an environment
where social norms continuously evolve, agents better adapted to the current
social norms will have a higher probability of survival and proliferation,
while those inadequately aligned dwindle over time. Experimental results
assessing the agents from multiple perspectives in aligning with social norms
demonstrate that EvolutionaryAgent can align progressively better with the
evolving social norms while maintaining its proficiency in general tasks.
Effectiveness tests conducted on various open and closed-source LLMs as the
foundation for agents also prove the applicability of our approach.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
31

The Alberta Plan for AI Research

attributed to: Richard S. Sutton, Michael Bowling, Patrick M. Pilarski
posted by: KabirKumar

Herein we describe our approach to artificial intelligence research, which we
call the Alberta Plan. The Alber...
Herein we describe our approach to artificial intelligence research, which we
call the Alberta Plan. The Alberta Plan is pursued within our research groups
in Alberta and by others who are like minded throughout the world. We welcome
all who would join us in this pursuit.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
32

LiPO: Listwise Preference Optimization through Learning-to-Rank

attributed to: Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, Peter J. Liu, Xuanhui Wang
posted by: KabirKumar

Aligning language models (LMs) with curated human feedback is critical to
control their behaviors in real-worl...
Aligning language models (LMs) with curated human feedback is critical to
control their behaviors in real-world applications. Several recent policy
optimization methods, such as DPO and SLiC, serve as promising alternatives to
the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In
practice, human feedback often comes in a format of a ranked list over multiple
responses to amortize the cost of reading prompt. Multiple responses can also
be ranked by reward models or AI feedback. There lacks such a study on directly
fitting upon a list of responses. In this work, we formulate the LM alignment
as a listwise ranking problem and describe the Listwise Preference Optimization
(LiPO) framework, where the policy can potentially learn more effectively from
a ranked list of plausible responses given the prompt. This view draws an
explicit connection to Learning-to-Rank (LTR), where most existing preference
optimization work can be mapped to existing ranking objectives, especially
pairwise ones. Following this connection, we provide an examination of ranking
objectives that are not well studied for LM alignment withDPO and SLiC as
special cases when list size is two. In particular, we highlight a specific
method, LiPO-{\lambda}, which leverages a state-of-the-art listwise ranking
objective and weights each preference pair in a more advanced manner. We show
that LiPO-{\lambda} can outperform DPO and SLiC by a clear margin on two
preference alignment tasks.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
33

WARM: On the Benefits of Weight Averaged Reward Models

attributed to: Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret
posted by: KabirKumar

Aligning large language models (LLMs) with human preferences through
reinforcement learning (RLHF) can lead to...
Aligning large language models (LLMs) with human preferences through
reinforcement learning (RLHF) can lead to reward hacking, where LLMs exploit
failures in the reward model (RM) to achieve seemingly high rewards without
meeting the underlying objectives. We identify two primary challenges when
designing RMs to mitigate reward hacking: distribution shifts during the RL
process and inconsistencies in human preferences. As a solution, we propose
Weight Averaged Reward Models (WARM), first fine-tuning multiple RMs, then
averaging them in the weight space. This strategy follows the observation that
fine-tuned weights remain linearly mode connected when sharing the same
pre-training. By averaging weights, WARM improves efficiency compared to the
traditional ensembling of predictions, while improving reliability under
distribution shifts and robustness to preference inconsistencies. Our
experiments on summarization tasks, using best-of-N and RL methods, shows that
WARM improves the overall quality and alignment of LLM predictions; for
example, a policy RL fine-tuned with WARM has a 79.4% win rate against a policy
RL fine-tuned with a single RM.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
34

'Indifference' methods for managing agent rewards

attributed to: Stuart Armstrong, Xavier O'Rourke
posted by: Knext

`Indifference' refers to a class of methods used to control reward based
agents. Indifference techniques aim t...
`Indifference' refers to a class of methods used to control reward based
agents. Indifference techniques aim to achieve one or more of three distinct
goals: rewards dependent on certain events (without the agent being motivated
to manipulate the probability of those events), effective disbelief (where
agents behave as if particular events could never happen), and seamless
transition from one reward function to another (with the agent acting as if
this change is unanticipated). This paper presents several methods for
achieving these goals in the POMDP setting, establishing their uses, strengths,
and requirements. These methods of control work even when the implications of
the agent's reward are otherwise not fully understood.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
35

Independent Intelligence Oriented Alignment

attributed to: @Rolyataylor2
posted by: rolyataylor2

This paper presents a conceptual exploration of a novel AI alignment framework, pivoting around the integratio...
This paper presents a conceptual exploration of a novel AI alignment framework, pivoting around the integration of superintelligence into everyday human and animal life. The framework is predicated on individualized alignment contracts and the continuous adaptation of AI systems to the evolving preferences and beliefs of intelligences (humans, animals, and AIs). By prioritizing informed consent and ethical autonomy, this model offers a pioneering approach to AI-human interactions, potentially overcoming the constraints and ethical dilemmas inherent in current AI systems.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
36

Design and exclusively use AI agents that:
(1) Do not aim to maximize some form of objective function (such as...
Design and exclusively use AI agents that:
(1) Do not aim to maximize some form of objective function (such as expected total reward / return) but rather aim to fulfill goals specified via "aspirations", which are constraints on the expected value of one or more measurable features of the world trajectory (such as produced pieces of some good, financial costs, energy use, time, etc.).
(2) Learn these quantitative aspirations from humans by actively and continuously inquiring about their goals, and refuse to pursue aspirations that appear unsafe.
(3) Learn a world model that allows predicting aspiration- and safety-related consequences of possible policies.
(4) Use a number of generic safety criteria (such as avoiding extreme actions, avoiding changes in the environment, not seeking power, being predictable, etc.) to choose from the typically very many possible policies that meet the given aspiration.
(5) Whenever circumstances change, adjust aspirations to what is feasible and appears safe enough given the new circumstances.
(6) Collude and cooperate with each other and with humans to prevent other agents from pursuing maximization-goals.

...read full abstract close
show post
: 1
Add

: 3
Add
▼ 1 Strengths and 3 Vulnerabilities
add vulnerability / strength
37

Cyborgism

attributed to: NicholasKees, janus
posted by: KabirKumar

Executive summary: This post proposes a strategy for safely accelerating alignment research. The plan is to se...
Executive summary: This post proposes a strategy for safely accelerating alignment research. The plan is to set up human-in-the-loop systems which empower human agency rather than outsource it, and to use those systems to differentially accelerate progress on alignment. 

    Introduction: An explanation of the context and motivation for this agenda.
    Automated Research Assistants: A discussion of why the paradigm of training AI systems to behave as autonomous agents is both counterproductive and dangerous.
    Becoming a Cyborg: A proposal for an alternative approach/frame, which focuses on a particular type of human-in-the-loop system I am calling a “cyborg”.
    Failure Modes: An analysis of how this agenda could either fail to help or actively cause harm by accelerating AI research more broadly.
    Testimony of a Cyborg: A personal account of how Janus uses GPT as a part of their workflow, and how it relates to the cyborgism approach to intelligence augmentation.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
38

Crowdsource solutions with generalists that know little about The AI Alignment Problem

attributed to: Adam Radivojevic (inspired by David Epstein, author of Range)
posted by: adamradiv

Harvard Medical School healthcare policy researcher Anupam Jena and colleagues examined tens of thousands of p...
Harvard Medical School healthcare policy researcher Anupam Jena and colleagues examined tens of thousands of people who were admitted to hospital with a heart attack, heart failure, or cardiac arrest between 2002 and 2011.

"Among the most severe cases of cardiac arrest, 70 per cent of those admitted when no cardiology conference was taking place died within 30 days. But among those admitted when expert cardiologists were away at meetings, the corresponding death rate was 60 per cent"

Specialists sometimes fall victim to a type of anchoring bias and cannot see the bigger picture. Generalists who don't know that much about the subject sometimes come up with great solutions, solutions even better than the specialist ones.

My proposal is we outsource the AI alignment problem to such people. For example, platforms such as Wazokucrowd would allow us to easily present our problem to the crowd of problem solvers and get new perspectives.

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
39

Gaia Network

attributed to: Roman Leventov, Rafael Kaufmann Nedal
posted by: leventov

Upgrade the Open Agency Architecture (OAA) plan by creating an evolving repository and economy of causal model...
Upgrade the Open Agency Architecture (OAA) plan by creating an evolving repository and economy of causal models and real-world data, used by the community of agents to be incrementally less wrong about the world and the consequences of their decisions.

...read full abstract close
show post
: 1
Add

: 0
Add
▼ 1 Strengths and 0 Vulnerabilities
add vulnerability / strength
40

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

attributed to: Anthropic (full attribution in post)
posted by: KabirKumar

We apply preference modeling and reinforcement learning from human feedback
(RLHF) to finetune language models...
We apply preference modeling and reinforcement learning from human feedback
(RLHF) to finetune language models to act as helpful and harmless assistants.
We find this alignment training improves performance on almost all NLP
evaluations, and is fully compatible with training for specialized skills such
as python coding and summarization. We explore an iterated online mode of
training, where preference models and RL policies are updated on a weekly
cadence with fresh human feedback data, efficiently improving our datasets and
models. Finally, we investigate the robustness of RLHF training, and identify a
roughly linear relation between the RL reward and the square root of the KL
divergence between the policy and its initialization. Alongside our main
results, we perform peripheral analyses on calibration, competing objectives,
and the use of OOD detection, compare our models with human writers, and
provide samples from our models using prompts appearing in recent related work.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
41

Aligning Superhuman AI with Human Behavior: Chess as a Model System

attributed to: Reid McIlroy-Young, Siddhartha Sen, Jon Kleinberg, Ashton Anderson
posted by: KabirKumar

As artificial intelligence becomes increasingly intelligent---in some cases,
achieving superhuman performance-...
As artificial intelligence becomes increasingly intelligent---in some cases,
achieving superhuman performance---there is growing potential for humans to
learn from and collaborate with algorithms. However, the ways in which AI
systems approach problems are often different from the ways people do, and thus
may be uninterpretable and hard to learn from. A crucial step in bridging this
gap between human and artificial intelligence is modeling the granular actions
that constitute human behavior, rather than simply matching aggregate human
performance.
  We pursue this goal in a model system with a long history in artificial
intelligence: chess. The aggregate performance of a chess player unfolds as
they make decisions over the course of a game. The hundreds of millions of
games played online by players at every skill level form a rich source of data
in which these decisions, and their exact context, are recorded in minute
detail. Applying existing chess engines to this data, including an open-source
implementation of AlphaZero, we find that they do not predict human moves well.
  We develop and introduce Maia, a customized version of Alpha-Zero trained on
human chess games, that predicts human moves at a much higher accuracy than
existing engines, and can achieve maximum accuracy when predicting decisions
made by players at a specific skill level in a tuneable way. For a dual task of
predicting whether a human will make a large mistake on the next move, we
develop a deep neural network that significantly outperforms competitive
baselines. Taken together, our results suggest that there is substantial
promise in designing artificial intelligence systems with human collaboration
in mind by first accurately modeling granular human decision-making.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
42

Language model agents (LMAs) expanding on AutoGPT are a highly plausible route to AGI. This route has large po...
Language model agents (LMAs) expanding on AutoGPT are a highly plausible route to AGI. This route has large potential timeline and proliferation downsides, but large alignment advantages relative to other realistic paths to AGI. LMAs allow layered safety measures, including externalized reasoning oversight, RLHF and similar alignment fine-tuning, and specifying top-level alignment goals in natural language. They are relatively interpretable, and the above approaches all have a low alignment tax, making voluntary adoption more likely. 

Here I focus on another advantage of aligning LMAs over other plausible routes to early AGI. This is the advantage of using separate language model instances in different roles. I propose internal independent review for the safety, alignment, and efficacy of plans. Such a review would consist of calling fresh instances of a language model with scripted prompts asking for critiques of plans with regard to accomplishing goals, including safety/alignment goals. This additional safety check seems to create a low alignment tax, since a similar check for efficacy will likely be helpful for capabilities. This type of review adds one additional layer of safety on top of RLHF, explicit alignment goals, and external review, all proposed elsewhere.

This set of safety measures does not guarantee successful alignment. However, it does seem like the most practically viable set of alignment plans that we've got so far.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
43

Nash Learning from Human Feedback

attributed to: Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J. Mankowitz, Doina P
posted by: KabirKumar

Reinforcement learning from human feedback (RLHF) has emerged as the main
paradigm for aligning large language...
Reinforcement learning from human feedback (RLHF) has emerged as the main
paradigm for aligning large language models (LLMs) with human preferences.
Typically, RLHF involves the initial step of learning a reward model from human
feedback, often expressed as preferences between pairs of text generations
produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by
optimizing it to maximize the reward model through a reinforcement learning
algorithm. However, an inherent limitation of current reward models is their
inability to fully represent the richness of human preferences and their
dependency on the sampling distribution.
  In this study, we introduce an alternative pipeline for the fine-tuning of
LLMs using pairwise human feedback. Our approach entails the initial learning
of a preference model, which is conditioned on two inputs given a prompt,
followed by the pursuit of a policy that consistently generates responses
preferred over those generated by any competing policy, thus defining the Nash
equilibrium of this preference model. We term this approach Nash learning from
human feedback (NLHF).
  In the context of a tabular policy representation, we present a novel
algorithmic solution, Nash-MD, founded on the principles of mirror descent.
This algorithm produces a sequence of policies, with the last iteration
converging to the regularized Nash equilibrium. Additionally, we explore
parametric representations of policies and introduce gradient descent
algorithms for deep-learning architectures. To demonstrate the effectiveness of
our approach, we present experimental results involving the fine-tuning of a
LLM for a text summarization task. We believe NLHF offers a compelling avenue
for preference learning and policy optimization with the potential of advancing
the field of aligning LLMs with human preferences.

...read full abstract close
show post
: 0
Add

: 2
Add
▼ 0 Strengths and 2 Vulnerabilities
add vulnerability / strength
44

A General Theoretical Paradigm to Understand Learning from Human Preferences

attributed to: Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos
posted by: KabirKumar

The prevalent deployment of learning from human preferences through
reinforcement learning (RLHF) relies on tw...
The prevalent deployment of learning from human preferences through
reinforcement learning (RLHF) relies on two important approximations: the first
assumes that pairwise preferences can be substituted with pointwise rewards.
The second assumes that a reward model trained on these pointwise rewards can
generalize from collected data to out-of-distribution data sampled by the
policy. Recently, Direct Preference Optimisation (DPO) has been proposed as an
approach that bypasses the second approximation and learn directly a policy
from collected data without the reward modelling stage. However, this method
still heavily relies on the first approximation.
  In this paper we try to gain a deeper theoretical understanding of these
practical algorithms. In particular we derive a new general objective called
$\Psi$PO for learning from human preferences that is expressed in terms of
pairwise preferences and therefore bypasses both approximations. This new
general objective allows us to perform an in-depth analysis of the behavior of
RLHF and DPO (as special cases of $\Psi$PO) and to identify their potential
pitfalls. We then consider another special case for $\Psi$PO by setting $\Psi$
simply to Identity, for which we can derive an efficient optimisation
procedure, prove performance guarantees and demonstrate its empirical
superiority to DPO on some illustrative examples.

...read full abstract close
show post
: 7
Add

: 3
Add
▼ 7 Strengths and 3 Vulnerabilities
add vulnerability / strength
45

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

attributed to: Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, Yaodong Yang
posted by: KabirKumar

In this paper, we introduce the BeaverTails dataset, aimed at fostering
research on safety alignment in large ...
In this paper, we introduce the BeaverTails dataset, aimed at fostering
research on safety alignment in large language models (LLMs). This dataset
uniquely separates annotations of helpfulness and harmlessness for
question-answering pairs, thus offering distinct perspectives on these crucial
attributes. In total, we have gathered safety meta-labels for 333,963
question-answer (QA) pairs and 361,903 pairs of expert comparison data for both
the helpfulness and harmlessness metrics. We further showcase applications of
BeaverTails in content moderation and reinforcement learning with human
feedback (RLHF), emphasizing its potential for practical safety measures in
LLMs. We believe this dataset provides vital resources for the community,
contributing towards the safe development and deployment of LLMs. Our project
page is available at the following URL:
https://sites.google.com/view/pku-beavertails.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
46

Natural Abstractions: Key claims, Theorems, and Critiques

attributed to: LawrenceC, Leon Lang, Erik Jenner, John Wentworth
posted by: KabirKumar

TL;DR: We distill John Wentworth’s Natural Abstractions agenda by summarizing its key claims: the Natural Abst...
TL;DR: We distill John Wentworth’s Natural Abstractions agenda by summarizing its key claims: the Natural Abstraction Hypothesis—many cognitive systems learn to use similar abstractions—and the Redundant Information Hypothesis—a particular mathematical description of natural abstractions. We also formalize proofs for several of its theoretical results. Finally, we critique the agenda’s progress to date, alignment relevance, and current research methodology.

...read full abstract close
show post
: 1
Add

: 2
Add
▼ 1 Strengths and 2 Vulnerabilities
add vulnerability / strength
47

Abstraction Learning

attributed to: Fei Deng, Jinsheng Ren, Feng Chen
posted by: KabirKumar

There has been a gap between artificial intelligence and human intelligence.
In this paper, we identify three ...
There has been a gap between artificial intelligence and human intelligence.
In this paper, we identify three key elements forming human intelligence, and
suggest that abstraction learning combines these elements and is thus a way to
bridge the gap. Prior researches in artificial intelligence either specify
abstraction by human experts, or take abstraction as a qualitative explanation
for the model. This paper aims to learn abstraction directly. We tackle three
main challenges: representation, objective function, and learning algorithm.
Specifically, we propose a partition structure that contains pre-allocated
abstraction neurons; we formulate abstraction learning as a constrained
optimization problem, which integrates abstraction properties; we develop a
network evolution algorithm to solve this problem. This complete framework is
named ONE (Optimization via Network Evolution). In our experiments on MNIST,
ONE shows elementary human-like intelligence, including low energy consumption,
knowledge sharing, and lifelong learning.

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
48

TanksWorld: A Multi-Agent Environment for AI Safety Research

attributed to: Corban G. Rivera, Olivia Lyons, Arielle Summitt, Ayman Fatima, Ji Pak, William Shao, Robert Chalmers, Aryeh Englander, Edward W. Staley, I-Jeng Wang, Ashley J. Llorens
posted by: KabirKumar

The ability to create artificial intelligence (AI) capable of performing
complex tasks is rapidly outpacing ou...
The ability to create artificial intelligence (AI) capable of performing
complex tasks is rapidly outpacing our ability to ensure the safe and assured
operation of AI-enabled systems. Fortunately, a landscape of AI safety research
is emerging in response to this asymmetry and yet there is a long way to go. In
particular, recent simulation environments created to illustrate AI safety
risks are relatively simple or narrowly-focused on a particular issue. Hence,
we see a critical need for AI safety research environments that abstract
essential aspects of complex real-world applications. In this work, we
introduce the AI safety TanksWorld as an environment for AI safety research
with three essential aspects: competing performance objectives, human-machine
teaming, and multi-agent competition. The AI safety TanksWorld aims to
accelerate the advancement of safe multi-agent decision-making algorithms by
providing a software framework to support competitions with both system
performance and safety objectives. As a work in progress, this paper introduces
our research objectives and learning environment with reference code and
baseline performance metrics to follow in a future work.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
49

Learning Representations by Humans, for Humans

attributed to: Sophie Hilgard, Nir Rosenfeld, Mahzarin R. Banaji, Jack Cao, David C. Parkes
posted by: KabirKumar

When machine predictors can achieve higher performance than the human
decision-makers they support, improving ...
When machine predictors can achieve higher performance than the human
decision-makers they support, improving the performance of human
decision-makers is often conflated with improving machine accuracy. Here we
propose a framework to directly support human decision-making, in which the
role of machines is to reframe problems rather than to prescribe actions
through prediction. Inspired by the success of representation learning in
improving performance of machine predictors, our framework learns human-facing
representations optimized for human performance. This "Mind Composed with
Machine" framework incorporates a human decision-making model directly into the
representation learning paradigm and is trained with a novel human-in-the-loop
training procedure. We empirically demonstrate the successful application of
the framework to various tasks and representational forms.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
50

Learning to Understand Goal Specifications by Modelling Reward

attributed to: Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward Hughes, Arian Hosseini, Pushmeet Kohli, Edward Grefenstette
posted by: KabirKumar

Recent work has shown that deep reinforcement-learning agents can learn to
follow language-like instructions f...
Recent work has shown that deep reinforcement-learning agents can learn to
follow language-like instructions from infrequent environment rewards. However,
this places on environment designers the onus of designing language-conditional
reward functions which may not be easily or tractably implemented as the
complexity of the environment and the language scales. To overcome this
limitation, we present a framework within which instruction-conditional RL
agents are trained using rewards obtained not from the environment, but from
reward models which are jointly trained from expert examples. As reward models
improve, they learn to accurately reward agents for completing tasks for
environment configurations---and for instructions---not present amongst the
expert data. This framework effectively separates the representation of what
instructions require from how they can be executed. In a simple grid world, it
enables an agent to learn a range of commands requiring interaction with blocks
and understanding of spatial relations and underspecified abstract
arrangements. We further show the method allows our agent to adapt to changes
in the environment without requiring new expert examples.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
51

Parenting: Safe Reinforcement Learning from Human Input

attributed to: Christopher Frye, Ilya Feige
posted by: KabirKumar

Autonomous agents trained via reinforcement learning present numerous safety
concerns: reward hacking, negativ...
Autonomous agents trained via reinforcement learning present numerous safety
concerns: reward hacking, negative side effects, and unsafe exploration, among
others. In the context of near-future autonomous agents, operating in
environments where humans understand the existing dangers, human involvement in
the learning process has proved a promising approach to AI Safety. Here we
demonstrate that a precise framework for learning from human input, loosely
inspired by the way humans parent children, solves a broad class of safety
problems in this context. We show that our Parenting algorithm solves these
problems in the relevant AI Safety gridworlds of Leike et al. (2017), that an
agent can learn to outperform its parent as it "matures", and that policies
learnt through Parenting are generalisable to new environments.

...read full abstract close
show post
: 1
Add

: 0
Add
▼ 1 Strengths and 0 Vulnerabilities
add vulnerability / strength
52

AvE: Assistance via Empowerment

attributed to: Yuqing Du, Stas Tiomkin, Emre Kiciman, Daniel Polani, Pieter Abbeel, Anca Dragan
posted by: KabirKumar

One difficulty in using artificial agents for human-assistive applications
lies in the challenge of accurately...
One difficulty in using artificial agents for human-assistive applications
lies in the challenge of accurately assisting with a person's goal(s). Existing
methods tend to rely on inferring the human's goal, which is challenging when
there are many potential goals or when the set of candidate goals is difficult
to identify. We propose a new paradigm for assistance by instead increasing the
human's ability to control their environment, and formalize this approach by
augmenting reinforcement learning with human empowerment. This task-agnostic
objective preserves the person's autonomy and ability to achieve any eventual
state. We test our approach against assistance based on goal inference,
highlighting scenarios where our method overcomes failure modes stemming from
goal ambiguity or misspecification. As existing methods for estimating
empowerment in continuous domains are computationally hard, precluding its use
in real time learned assistance, we also propose an efficient
empowerment-inspired proxy metric. Using this, we are able to successfully
demonstrate our method in a shared autonomy user study for a challenging
simulated teleoperation task with human-in-the-loop training.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
53

Penalizing side effects using stepwise relative reachability

attributed to: Victoria Krakovna, Laurent Orseau, Ramana Kumar, Miljan Martic, Shane Legg
posted by: KabirKumar

How can we design safe reinforcement learning agents that avoid unnecessary
disruptions to their environment? ...
How can we design safe reinforcement learning agents that avoid unnecessary
disruptions to their environment? We show that current approaches to penalizing
side effects can introduce bad incentives, e.g. to prevent any irreversible
changes in the environment, including the actions of other agents. To isolate
the source of such undesirable incentives, we break down side effects penalties
into two components: a baseline state and a measure of deviation from this
baseline state. We argue that some of these incentives arise from the choice of
baseline, and others arise from the choice of deviation measure. We introduce a
new variant of the stepwise inaction baseline and a new deviation measure based
on relative reachability of states. The combination of these design choices
avoids the given undesirable incentives, while simpler baselines and the
unreachability measure fail. We demonstrate this empirically by comparing
different combinations of baseline and deviation measure choices on a set of
gridworld experiments designed to illustrate possible bad incentives.

...read full abstract close
show post
: 1
Add

: 0
Add
▼ 1 Strengths and 0 Vulnerabilities
add vulnerability / strength
54

Conservative Agency via Attainable Utility Preservation

attributed to: Alexander Matt Turner, Dylan Hadfield-Menell, Prasad Tadepalli
posted by: KabirKumar

Reward functions are easy to misspecify; although designers can make
corrections after observing mistakes, an ...
Reward functions are easy to misspecify; although designers can make
corrections after observing mistakes, an agent pursuing a misspecified reward
function can irreversibly change the state of its environment. If that change
precludes optimization of the correctly specified reward function, then
correction is futile. For example, a robotic factory assistant could break
expensive equipment due to a reward misspecification; even if the designers
immediately correct the reward function, the damage is done. To mitigate this
risk, we introduce an approach that balances optimization of the primary reward
function with preservation of the ability to optimize auxiliary reward
functions. Surprisingly, even when the auxiliary reward functions are randomly
generated and therefore uninformative about the correctly specified reward
function, this approach induces conservative, effective behavior.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
55

Avoiding Side Effects in Complex Environments

attributed to: Alexander Matt Turner, Neale Ratzlaff, Prasad Tadepalli
posted by: KabirKumar

Reward function specification can be difficult. Rewarding the agent for
making a widget may be easy, but penal...
Reward function specification can be difficult. Rewarding the agent for
making a widget may be easy, but penalizing the multitude of possible negative
side effects is hard. In toy environments, Attainable Utility Preservation
(AUP) avoided side effects by penalizing shifts in the ability to achieve
randomly generated goals. We scale this approach to large, randomly generated
environments based on Conway's Game of Life. By preserving optimal value for a
single randomly generated reward function, AUP incurs modest overhead while
leading the agent to complete the specified task and avoid many side effects.
Videos and code are available at https://avoiding-side-effects.github.io/.

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
56

Stovepiping and Malicious Software: A Critical Review of AGI Containment

attributed to: Jason M. Pittman, Jesus P. Espinoza, Courtney Crosby
posted by: KabirKumar

Awareness of the possible impacts associated with artificial intelligence has
risen in proportion to progress ...
Awareness of the possible impacts associated with artificial intelligence has
risen in proportion to progress in the field. While there are tremendous
benefits to society, many argue that there are just as many, if not more,
concerns related to advanced forms of artificial intelligence. Accordingly,
research into methods to develop artificial intelligence safely is increasingly
important. In this paper, we provide an overview of one such safety paradigm:
containment with a critical lens aimed toward generative adversarial networks
and potentially malicious artificial intelligence. Additionally, we illuminate
the potential for a developmental blindspot in the stovepiping of containment
mechanisms.

...read full abstract close
show post
: 1
Add

: 0
Add
▼ 1 Strengths and 0 Vulnerabilities
add vulnerability / strength
57

Artificial Intelligence (AI) systems have significant potential to affect the lives of individuals and societi...
Artificial Intelligence (AI) systems have significant potential to affect the lives of individuals and societies. As these systems are being increasingly used in decision-making processes, it has become crucial to ensure that they make ethically sound judgments. This paper proposes a novel framework for embedding ethical priors into AI, inspired by the Bayesian approach to machine learning. We propose that ethical assumptions and beliefs can be incorporated as Bayesian priors, shaping the AI’s learning and reasoning process in a similar way to humans’ inborn moral intuitions. This approach, while complex, provides a promising avenue for advancing ethically aligned AI systems.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
58

This article explores the concept and potential application of bottom-up virtue ethics as an approach to insti...
This article explores the concept and potential application of bottom-up virtue ethics as an approach to instilling ethical behavior in artificial intelligence (AI) systems. We argue that by training machine learning models to emulate virtues such as honesty, justice, and compassion, we can cultivate positive traits and behaviors based on ideal human moral character. This bottom-up approach contrasts with traditional top-down programming of ethical rules, focusing instead on experiential learning. Although this approach presents its own challenges, it offers a promising avenue for the development of more ethically aligned AI systems.

...read full abstract close
show post
: 1
Add

: 3
Add
▼ 1 Strengths and 3 Vulnerabilities
add vulnerability / strength
59

As artificial intelligence rapidly advances, ensuring alignment with moral values and ethics becomes imperativ...
As artificial intelligence rapidly advances, ensuring alignment with moral values and ethics becomes imperative. This article provides a comprehensive overview of techniques to embed human values into AI. Interactive learning, crowdsourcing, uncertainty modeling, oversight mechanisms, and conservative system design are analyzed in-depth. Respective limitations are discussed and mitigation strategies proposed. A multi-faceted approach combining the strengths of these complementary methods promises safer development of AI that benefits humanity in accordance with our ideals.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
60

Distributional shift poses a significant challenge for deploying and maintaining AI systems. As the real-world...
Distributional shift poses a significant challenge for deploying and maintaining AI systems. As the real-world distributions that models are applied to evolve over time, performance can deteriorate. This article examines techniques and best practices for improving model robustness to distributional shift and enabling rapid adaptation when it occurs.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
61

This article proposes a detailed framework for a robust feedback loop to enhance corrigibility. The ability to...
This article proposes a detailed framework for a robust feedback loop to enhance corrigibility. The ability to continuously learn and correct errors is critical for safe and beneficial AI, but developing corrigible systems comes with significant technical and ethical challenges. The feedback loop outlined involves gathering user input, interpreting feedback contextually, enabling AI actions and learning, confirming changes, and iterative improvement. The article analyzes potential limitations of this approach and provides detailed examples of implementation methods using advanced natural language processing, reinforcement learning, and adversarial training techniques.

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
62

To align advanced AIs, an ensemble of diverse, transparent Overseer AIs will independently monitor the target ...
To align advanced AIs, an ensemble of diverse, transparent Overseer AIs will independently monitor the target AI and provide granular assessments on its alignment with constitution, human values, ethics, and safety. Overseer interventions will be incremental and subject to human oversight. The system will be implemented cautiously, with extensive testing to validate capabilities. Alignment will be treated as an ongoing collaborative process between humans, Overseers, and the target AI, leveraging complementary strengths through open dialog. Continuous vigilance, updating of definitions, and contingency planning will be required to address inevitable uncertainties and risks.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
63

My proposal entails constructing a tightly restricted AI subsystem with the sole capability of attempting to s...
My proposal entails constructing a tightly restricted AI subsystem with the sole capability of attempting to safely shut itself down in order to probe, in an isolated manner, potential vulnerabilities in alignment techniques and then improve them.

...read full abstract close
show post
: 2
Add

: 5
Add
▼ 2 Strengths and 5 Vulnerabilities
add vulnerability / strength
64

Corrigibility via multiple routes

attributed to: Jan Kulveit
posted by: tori[she/her]

Use multiple routes to induce 'corrigibility' by using principles which counteract instrumental convergence (e...
Use multiple routes to induce 'corrigibility' by using principles which counteract instrumental convergence (e.g. disutility from resource acquisition by a mutual information measure between the AI and distant parts of the environment
), by counteracting unbounded rationality (satisficing, myopia, etc.), with 'traps' like ontological uncertainty about the level of simulation (e.g. having uncertainty about whether it is in training or deployment), human oversight, and interpretability (e.g. an independent 'translator').

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
65

Avoiding Tampering Incentives in Deep RL via Decoupled Approval

attributed to: Jonathan Uesato, Ramana Kumar, Victoria Krakovna, Tom Everitt, Richard Ngo, Shane Legg
posted by: KabirKumar

How can we design agents that pursue a given objective when all feedback mechanisms are influenceable by the a...
How can we design agents that pursue a given objective when all feedback mechanisms are influenceable by the agent? Standard RL algorithms assume a secure reward function, and can thus perform poorly in settings where agents can tamper with the reward-generating mechanism. We present a principled solution to the problem of learning from influenceable feedback, which combines approval with a decoupled feedback collection procedure. For a natural class of corruption functions, decoupled approval algorithms have aligned incentives both at convergence and for their local updates. Empirically, they also scale to complex 3D environments where tampering is possible.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
66

PROVABLY FAIR FEDERATED LEARNING

attributed to: Shengyuan Hu, Zhiwei Steven Wu, Virginia Smith
posted by: KabirKumar

In federated learning, fair prediction across various protected groups (e.g., gender,
race) is an important co...
In federated learning, fair prediction across various protected groups (e.g., gender,
race) is an important constraint for many applications. Unfortunately, prior work
studying group fair federated learning lacks formal convergence or fairness guaran-
tees. Our work provides a new definition for group fairness in federated learning
based on the notion of Bounded Group Loss (BGL), which can be easily applied
to common federated learning objectives. Based on our definition, we propose a
scalable algorithm that optimizes the empirical risk and global fairness constraints,
which we evaluate across common fairness and federated learning benchmarks.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
67

Towards Safe Artificial General Intelligence

attributed to: Tom Everitt
posted by: shumaari

The field of artificial intelligence has recently experienced a number of breakthroughs thanks to progress in ...
The field of artificial intelligence has recently experienced a number of breakthroughs thanks to progress in deep learning and reinforcement learning. Computer algorithms now outperform humans at Go, Jeopardy, image classification, and lip reading, and are becoming very competent at driving cars and interpreting natural language. The rapid development has led many to conjecture that artificial intelligence with greater-than-human ability on a wide range of tasks may not be far. This in turn raises concerns whether we know how to control such systems, in case we were to successfully build them...

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
68

A Roadmap for Robust End-to-End Alignment

attributed to: Lê Nguyên Hoang
posted by: KabirKumar

As algorithms are becoming more and more data-driven, the greatest lever we have left to make them robustly be...
As algorithms are becoming more and more data-driven, the greatest lever we have left to make them robustly beneficial to mankind lies in the design of their objective functions. Robust alignment aims to address this design problem. Arguably, the growing importance of social medias’ recommender systems makes it an urgent problem, for instance to ade-quately automate hate speech moderation. In this paper, we propose a preliminary research program for robust alignment. This roadmap aims at decomposing the end-to-end alignment problem into numerous more tractable subproblems...

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
69

Taking Principles Seriously: A Hybrid Approach to Value Alignment

attributed to: Tae Wan Kim, John Hooker, Thomas Donaldson (Carnegie Mellon University, USA University of Pennsylvania, USA)
posted by: KabirKumar

An important step in the development of value alignment (VA) systems in AI is understanding how VA can reflect...
An important step in the development of value alignment (VA) systems in AI is understanding how VA can reflect valid ethical principles. We propose that designers of VA systems incorporate ethics by utilizing a hybrid approach in which both ethical reasoning and empirical observation play a role. This, we argue, avoids committing the "naturalistic fallacy," which is an attempt to derive "ought" from "is," and it provides a more adequate form of ethical reasoning when the fallacy is not committed...

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
70

A General Language Assistant as a Laboratory for Alignment

attributed to: Anthropic (Full Author list in Full Plan- click title to view)
posted by: KabirKumar

Given the broad capabilities of large language models, it should be possible to work towards a general-purpose...
Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. ... (Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 8
Add
▼ 0 Strengths and 8 Vulnerabilities
add vulnerability / strength
71

Aligning AI With Shared Human Values

attributed to: Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt
posted by: KabirKumar

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS data...
We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements... (Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 1
Add

: 0
Add
▼ 1 Strengths and 0 Vulnerabilities
add vulnerability / strength
72

Avoiding Side Effects By Considering Future Tasks

attributed to: Liwei Jiang, Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Maxwell Forbes, Jon Borchardt, Jenny Liang, Oren Etzioni, Maarten Sap, Yejin Choi
posted by: KabirKumar

Designing reward functions is difficult: the designer has to specify what to do (what it means to complete the...
Designing reward functions is difficult: the designer has to specify what to do (what it means to complete the task) as well as what not to do (side effects that should be avoided while completing the task). To alleviate the burden on the reward designer, we propose an algorithm to automatically generate an auxiliary reward function that penalizes side effects. This auxiliary objective rewards the ability to complete possible future tasks, which decreases if the agent causes side effects during the current task...(Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
73

Measuring and avoiding side effects using relative reachability

attributed to: Victoria Krakovna, Laurent Orseau, Miljan Martic, Shane Legg
posted by: KabirKumar

How can we design reinforcement learning agents that avoid causing unnecessary disruptions to their environmen...
How can we design reinforcement learning agents that avoid causing unnecessary disruptions to their environment? We argue that current approaches to penalizing side effects can introduce bad incentives in tasks that require irreversible actions, and in environments that contain sources of change other than the agent. For example, some approaches give the agent an incentive to prevent any irreversible changes in the environment, including the actions of other agents. We introduce a general definition of side effects, based on relative reachability of states compared to a default state, that avoids these undesirable incentives...(Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
74

Learning Human Objectives by Evaluating Hypothetical Behavior

attributed to: Siddharth Reddy, Anca D. Dragan, Sergey Levine, Shane Legg, Jan Leike
posted by: KabirKumar

We seek to align agent behavior with a user's objectives in a reinforcement learning setting with unknown dyna...
We seek to align agent behavior with a user's objectives in a reinforcement learning setting with unknown dynamics, an unknown reward function, and unknown unsafe states. The user knows the rewards and unsafe states, but querying the user is expensive. To address this challenge, we propose an algorithm that safely and interactively learns a model of the user's reward function. We start with a generative model of initial states and a forward dynamics model trained on off-policy data... (Full Abstract in Full Plan- click plan title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
75

SafeLife 1.0: Exploring Side Effects in Complex Environments

attributed to: Carroll L. Wainwright, Peter Eckersley
posted by: KabirKumar

We present SafeLife, a publicly available reinforcement learning environment that tests the safety of reinforc...
We present SafeLife, a publicly available reinforcement learning environment that tests the safety of reinforcement learning agents. It contains complex, dynamic, tunable, procedurally generated levels with many opportunities for unsafe behavior. Agents are graded both on their ability to maximize their explicit reward and on their ability to operate safely without unnecessary side effects. We train agents to maximize rewards using proximal policy optimization and score them on a suite of benchmark levels... (Full Abstract in Full Plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
76

(When) Is Truth-telling Favored in AI Debate?

attributed to: Vojtěch Kovařík(Future of Humanity Institute University of Oxford), Ryan Carey (Artificial Intelligence Center Czech Technical University)
posted by: KabirKumar

For some problems, humans may not be able to accurately judge the goodness of AI-proposed solutions. Irving et...
For some problems, humans may not be able to accurately judge the goodness of AI-proposed solutions. Irving et al. (2018) propose that in such cases, we may use a debate between two AI systems to amplify the problem-solving capabilities of a human judge. We introduce a mathematical framework that can model debates of this type and propose that the quality of debate designs should be measured by the accuracy of the most persuasive answer. We describe a simple instance of the debate framework called feature debate and analyze the degree to which such debates track the truth... (full abstract in full plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
77

Positive-Unlabeled Reward Learning

attributed to: Danfei Xu(Stanford), Misha Denil(DeepMind)
posted by: KabirKumar

Learning reward functions from data is a promising path towards achieving scalable Reinforcement Learning (RL)...
Learning reward functions from data is a promising path towards achieving scalable Reinforcement Learning (RL) for robotics. However, a major challenge in training agents from learned reward models is that the agent can learn to exploit errors in the reward model to achieve high reward behaviors that do not correspond to the intended task. These reward delusions can lead to unintended and even dangerous behaviors...(full abstract in full plan)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
78

On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference

attributed to: Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca D. Dragan
posted by: KabirKumar

Our goal is for agents to optimize the right reward function, despite how difficult it is for us to specify wh...
Our goal is for agents to optimize the right reward function, despite how difficult it is for us to specify what that is. Inverse Reinforcement Learning (IRL) enables us to infer reward functions from demonstrations, but it usually assumes that the expert is noisily optimal. Real people, on the other hand, often have systematic biases: risk-aversion, myopia, etc. One option is to try to characterize these biases and account for them explicitly during learning... (Full abstract in plan- click title to view}

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
79

Scaling shared model governance via model splitting

attributed to: Miljan Martic, Jan Leike, Andrew Trask, Matteo Hessel, Shane Legg, Pushmeet Kohli (DeepMind)
posted by: KabirKumar

Currently the only techniques for sharing governance of a deep learning model are homomorphic encryption and s...
Currently the only techniques for sharing governance of a deep learning model are homomorphic encryption and secure multiparty computation. Unfortunately, neither of these techniques is applicable to the training of large neural networks due to their large computational and communication overheads. As a scalable technique for shared model governance, we propose splitting deep learning model between multiple parties... (Full abstract in plan- click title to view}

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
80

Building Ethically Bounded AI

attributed to: Francesca Rossi, Nicholas Mattei (IBM)
posted by: KabirKumar

The more AI agents are deployed in scenarios with possibly unexpected situations, the more they need to be fle...
The more AI agents are deployed in scenarios with possibly unexpected situations, the more they need to be flexible, adaptive, and creative in achieving the goal we have given them. Thus, a certain level of freedom to choose the best path to the goal is inherent in making AI robust and flexible enough. At the same time, however, the pervasive deployment of AI in our life, whether AI is autonomous or collaborating with humans, raises several ethical challenges. AI agents should be aware and follow appropriate ethical principles and should thus exhibit properties such as fairness or other virtues... (Full abstract in plan- click title to view}

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
81

Guiding Policies with Language via Meta-Learning

attributed to: John D. Co-Reyes, Abhishek Gupta, Suvansh Sanjeev, Nick Altieri, Jacob Andreas, John DeNero, Pieter Abbeel, Sergey Levine
posted by: KabirKumar

Behavioral skills or policies for autonomous agents are conventionally learned from reward functions, via rein...
Behavioral skills or policies for autonomous agents are conventionally learned from reward functions, via reinforcement learning, or from demonstrations, via imitation learning. However, both modes of task specification have their disadvantages: reward functions require manual engineering, while demonstrations require a human expert to be able to actually perform the task in order to generate the demonstration... (Full abstract in plan- click title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
82

Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings

attributed to: Tom Everitt, Pedro A. Ortega, Elizabeth Barnes, Shane Legg
posted by: KabirKumar

Agents are systems that optimize an objective function in an environment. Together, the goal and the environme...
Agents are systems that optimize an objective function in an environment. Together, the goal and the environment induce secondary objectives, incentives. Modeling the agent-environment interaction using causal influence diagrams, we can answer two fundamental questions about an agent's incentives directly from the graph: (1) which nodes can the agent have an incentivize to observe, and (2) which nodes can the agent have an incentivize to control? The answers tell us which information and influence points need extra protection... (Full Abstract in Full Plan- click plan title to view)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
83

Integrative Biological Simulation, Neuropsychology, and AI Safety

attributed to: Gopal P. Sarma, Adam Safron, Nick J. Hay
posted by: KabirKumar

We describe a biologically-inspired research agenda with parallel tracks aimed at AI and AI safety. The bottom...
We describe a biologically-inspired research agenda with parallel tracks aimed at AI and AI safety. The bottom-up component consists of building a sequence of biophysically realistic simulations of simple organisms such as the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster, and the zebrafish Danio rerio to serve as platforms for research into AI algorithms and system architectures. The top-down component consists of an approach to value alignment that grounds AI goal structures in neuropsychology, broadly considered...(full abstract in full plan)

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
84

Constitutional AI: Harmlessness from AI Feedback

attributed to: Anthropic (full author list in full plan)
posted by: KabirKumar

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment wi...
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses.

...read full abstract close
show post
: 1
Add

: 1
Add
▼ 1 Strengths and 1 Vulnerabilities
add vulnerability / strength
85

What Would Jiminy Cricket Do? Towards Agents That Behave Morally

attributed to: Dan Hendrycks, Mantas Mazeika, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li, Jacob Steinhardt
posted by: KabirKumar

When making everyday decisions, people are guided by their conscience, an internal sense of right and wrong. B...
When making everyday decisions, people are guided by their conscience, an internal sense of right and wrong. By contrast, artificial agents are currently not endowed with a moral sense. As a consequence, they may learn to behave immorally when trained on environments that ignore moral concerns, such as violent video games. With the advent of generally capable agents that pretrain on many environments, it will become necessary to mitigate inherited biases from environments that teach immoral behavior.

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
86

Truthful AI: Developing and governing AI that does not lie

attributed to: Owain Evans, Owen Cotton-Barratt, Lukas Finnveden, Adam Bales, Avital Balwit, Peter Wills, Luca Righetti, William Saunders
posted by: KabirKumar

In many contexts, lying -- the use of verbal falsehoods to deceive -- is harmful. While lying has traditionall...
In many contexts, lying -- the use of verbal falsehoods to deceive -- is harmful. While lying has traditionally been a human affair, AI systems that make sophisticated verbal statements are becoming increasingly prevalent. This raises the question of how we should limit the harm caused by AI "lies" (i.e. falsehoods that are actively selected for). Human truthfulness is governed by social norms and by laws (against defamation, perjury, and fraud). Differences between AI and humans present an opportunity to have more precise standards of truthfulness for AI, and to have these standards rise over time.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
87

Interpretable Multi-Objective Reinforcement Learning through Policy Orchestration

attributed to: Ritesh Noothigattu, Djallel Bouneffouf, Nicholas Mattei, Rachita Chandra, Piyush Madan, Kush Varshney, Murray Campbell, Moninder Singh, Francesca Rossi
posted by: KabirKumar

Autonomous cyber-physical agents and systems play an increasingly large role in our lives. To ensure that agen...
Autonomous cyber-physical agents and systems play an increasingly large role in our lives. To ensure that agents behave in ways aligned with the values of the societies in which they operate, we must develop techniques that allow these agents to not only maximize their reward in an environment, but also to learn and follow the implicit constraints of society. These constraints and norms can come from any number of sources including regulations, business process guidelines, laws, ethical principles, social norms, and moral values.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
88

CM3: Cooperative Multi-goal Multi-stage Multi-agent Reinforcement Learning

attributed to: Jiachen Yang, Alireza Nakhaei, David Isele, Kikuo Fujimura, Hongyuan Zha
posted by: KabirKumar

A variety of cooperative multi-agent control problems require agents to achieve individual goals while contrib...
A variety of cooperative multi-agent control problems require agents to achieve individual goals while contributing to collective success. This multi-goal multi-agent setting poses difficulties for recent algorithms, which primarily target settings with a single global reward, due to two new challenges: efficient exploration for learning both individual goal attainment and cooperation for others' success, and credit-assignment for interactions between actions and goals of different agents...

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
89

Imitating Latent Policies from Observation

attributed to: Ashley D. Edwards, Himanshu Sahni, Yannick Schroecker, Charles L. Isbell
posted by: KabirKumar

In this paper, we describe a novel approach to imitation learning that infers latent policies directly from st...
In this paper, we describe a novel approach to imitation learning that infers latent policies directly from state observations. We introduce a method that characterizes the causal effects of latent actions on observations while simultaneously predicting their likelihood. We then outline an action alignment procedure that leverages a small amount of environment interactions to determine a mapping between the latent and real-world actions. We show that this corrected labeling can be used for imitating the observed behavior, even though no expert actions are given.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
90

Embedded Agency

attributed to: Abram Demski, Scott Garrabrant
posted by: KabirKumar

Traditional models of rational action treat the agent as though it is cleanly separated from its environment, ...
Traditional models of rational action treat the agent as though it is cleanly separated from its environment, and can act on that environment from the outside. Such agents have a known functional relationship with their environment, can model their environment in every detail, and do not need to reason about themselves or their internal parts.
We provide an informal survey of obstacles to formalizing good reasoning for agents embedded in their environment.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
91

Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective

attributed to: Tom Everitt, Marcus Hutter, Ramana Kumar, Victoria Krakovna
posted by: KabirKumar

Can humans get arbitrarily capable reinforcement learning (RL) agents to do their bidding? Or will sufficientl...
Can humans get arbitrarily capable reinforcement learning (RL) agents to do their bidding? Or will sufficiently capable RL agents always find ways to bypass their intended objectives by shortcutting their reward signal? This question impacts how far RL can be scaled, and whether alternative paradigms must be developed in order to build safe artificial general intelligence. In this paper, we study when an RL agent has an instrumental goal to tamper with its reward process, and describe design principles that prevent instrumental goals for two different types of reward tampering (reward function tampering and RF-input tampering).

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
92

Methods are currently lacking to prove artificial general intelligence (AGI) safety. An AGI
‘hard takeoff’ is ...
Methods are currently lacking to prove artificial general intelligence (AGI) safety. An AGI
‘hard takeoff’ is possible, in which first generation AGI1 rapidly triggers a succession of more powerful
AGIn that differ dramatically in their computational capabilities (AGIn << AGIn+1). No proof exists
that AGI will benefit h umans o r o f a s ound v alue-alignment m ethod. N umerous p aths toward
human extinction or subjugation have been identified. We suggest that probabilistic proof methods
are the fundamental paradigm for proving safety and value-alignment between disparately powerful
autonomous agents.

...read full abstract close
show post
: 0
Add

: 2
Add
▼ 0 Strengths and 2 Vulnerabilities
add vulnerability / strength
93

Adaptive Mechanism Design: Learning to Promote Cooperation

attributed to: Tobias Baumann, Thore Graepel, John Shawe-Taylor
posted by: KabirKumar

In the future, artificial learning agents are likely to become increasingly widespread in our society. They wi...
In the future, artificial learning agents are likely to become increasingly widespread in our society. They will interact with both other learning agents and humans in a variety of complex settings including social dilemmas. We consider the problem of how an external agent can promote cooperation between artificial learners by distributing additional rewards and punishments based on observing the learners' actions. We propose a rule for automatically learning how to create right incentives by considering the players' anticipated parameter updates.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
94

Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims

attributed to: Miles Brundage, Shahar Avin, Jasmine Wang, Haydn Belfield, Gretchen Krueger, Gillian Hadfield, Heidy Khlaaf, Jingying Yang, Helen Toner, Ruth Fong, Tegan Maharaj, Pang Wei Koh, Sara Hooker, Jade Leung, Andrew Trask, Emma Bluemke and many more
posted by: KabirKumar

With the recent wave of progress in artificial intelligence (AI) has come a growing awareness of the large-sca...
With the recent wave of progress in artificial intelligence (AI) has come a growing awareness of the large-scale impacts of AI systems, and recognition that existing regulations and norms in industry and academia are insufficient to ensure responsible AI development. In order for AI developers to earn trust from system users, customers, civil society, governments, and other stakeholders that they are building AI responsibly, they will need to make verifiable claims to which they can be held accountable. Those outside of a given organization also need effective means of scrutinizing such claims.

...read full abstract close
show post
: 1
Add

: 1
Add
▼ 1 Strengths and 1 Vulnerabilities
add vulnerability / strength
95

Institutionalising Ethics in AI through Broader Impact Requirements

attributed to: Carina Prunkl, Carolyn Ashurst, Markus Anderljung, Helena Webb, Jan Leike, Allan Dafoe
posted by: KabirKumar

Turning principles into practice is one of the most pressing challenges of artificial intelligence (AI) govern...
Turning principles into practice is one of the most pressing challenges of artificial intelligence (AI) governance. In this article, we reflect on a novel governance initiative by one of the world's largest AI conferences. In 2020, the Conference on Neural Information Processing Systems (NeurIPS) introduced a requirement for submitting authors to include a statement on the broader societal impacts of their research. Drawing insights from similar governance initiatives, including institutional review boards (IRBs) and impact requirements for funding applications, we investigate the risks, challenges and potential benefits of such an initiative...

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
96

Self-Imitation Learning

attributed to: Junhyuk Oh, Yijie Guo, Satinder Singh, Honglak Lee
posted by: KabirKumar

This paper proposes Self-Imitation Learning (SIL), a simple off-policy actor-critic algorithm that learns to r...
This paper proposes Self-Imitation Learning (SIL), a simple off-policy actor-critic algorithm that learns to reproduce the agent's past good decisions. This algorithm is designed to verify our hypothesis that exploiting past good experiences can indirectly drive deep exploration. Our empirical results show that SIL significantly improves advantage actor-critic (A2C) on several hard exploration Atari games and is competitive to the state-of-the-art count-based exploration methods. We also show that SIL improves proximal policy optimization (PPO) on MuJoCo tasks.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
97

Directed Policy Gradient for Safe Reinforcement Learning with Human Advice

attributed to: Hélène Plisnier, Denis Steckelmacher, Tim Brys, Diederik M. Roijers, Ann Nowé
posted by: KabirKumar

Many currently deployed Reinforcement Learning agents work in an environment shared with humans, be them co-wo...
Many currently deployed Reinforcement Learning agents work in an environment shared with humans, be them co-workers, users or clients. It is desirable that these agents adjust to people's preferences, learn faster thanks to their help, and act safely around them. We argue that most current approaches that learn from human feedback are unsafe: rewarding or punishing the agent a-posteriori cannot immediately prevent it from wrong-doing. In this paper, we extend Policy Gradient to make it robust to external directives, that would otherwise break the fundamentally on-policy nature of Policy Gradient.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
98

Safe Reinforcement Learning via Probabilistic Shields

attributed to: Nils Jansen, Bettina Könighofer, Sebastian Junges, Alexandru C. Serban, Roderick Bloem
posted by: KabirKumar

This paper targets the efficient construction of a safety shield for decision making in scenarios that incorpo...
This paper targets the efficient construction of a safety shield for decision making in scenarios that incorporate uncertainty. Markov decision processes (MDPs) are prominent models to capture such planning problems. Reinforcement learning (RL) is a machine learning technique to determine near-optimal policies in MDPs that may be unknown prior to exploring the model. However, during exploration, RL is prone to induce behavior that is undesirable or not allowed in safety- or mission-critical contexts. We introduce the concept of a probabilistic shield that enables decision-making to adhere to safety constraints with high probability.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
99

An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning

attributed to: Dhruv Malik, Malayandi Palaniappan, Jaime F. Fisac, Dylan Hadfield-Menell, Stuart Russell, Anca D. Dragan
posted by: KabirKumar

Our goal is for AI systems to correctly identify and act according to their human user's objectives. Cooperati...
Our goal is for AI systems to correctly identify and act according to their human user's objectives. Cooperative Inverse Reinforcement Learning (CIRL) formalizes this value alignment problem as a two-player game between a human and robot, in which only the human knows the parameters of the reward function: the robot needs to learn them as the interaction unfolds. Previous work showed that CIRL can be solved as a POMDP, but with an action space size exponential in the size of the reward parameter space.

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
100

Incomplete Contracting and AI Alignment

attributed to: Dylan Hadfield-Menell, Gillian Hadfield
posted by: KabirKumar

We suggest that the analysis of incomplete contracting developed by law and economics researchers can provide ...
We suggest that the analysis of incomplete contracting developed by law and economics researchers can provide a useful framework for understanding the AI alignment problem and help to generate a systematic approach to finding solutions. We first provide an overview of the incomplete contracting literature and explore parallels between this work and the problem of AI alignment. As we emphasize, misalignment between principal and agent is a core focus of economic analysis. We highlight some technical results from the economics literature on incomplete contracts that may provide insights for AI alignment researchers.

...read full abstract close
show post
: 1
Add

: 3
Add
▼ 1 Strengths and 3 Vulnerabilities
add vulnerability / strength
101

We propose the creation of a systematic effort to identify and replicate key findings in neuropsychology and a...
We propose the creation of a systematic effort to identify and replicate key findings in neuropsychology and allied fields related to understanding human values. Our aim is to ensure that research underpinning the value alignment problem of artificial intelligence has been sufficiently validated to play a role in the design of AI systems.

...read full abstract close
show post
: 1
Add

: 0
Add
▼ 1 Strengths and 0 Vulnerabilities
add vulnerability / strength
102

The Wisdom of Hindsight Makes Language Models Better Instruction Followers

attributed to: Tianjun Zhang, Fangchen Liu, Justin Wong, Pieter Abbeel, Joseph E. Gonzalez
posted by: KabirKumar

Reinforcement learning has seen wide success in finetuning large language models to better align with instruct...
Reinforcement learning has seen wide success in finetuning large language models to better align with instructions via human feedback. The so-called algorithm, Reinforcement Learning with Human Feedback (RLHF) demonstrates impressive performance on the GPT series models. However, the underlying Reinforcement Learning (RL) algorithm is complex and requires an additional training pipeline for reward and value networks. In this paper, we consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner...

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
103

Cooperative Inverse Reinforcement Learning

attributed to: Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell
posted by: KabirKumar

For an autonomous system to be helpful to humans and to pose no unwarranted risks, it needs to align its value...
For an autonomous system to be helpful to humans and to pose no unwarranted risks, it needs to align its values with those of the humans in its environment in such a way that its actions contribute to the maximization of value for the humans. We propose a formal definition of the value alignment problem as cooperative inverse reinforcement learning (CIRL). A CIRL problem is a cooperative, partial-information game with two agents, human and robot; both are rewarded according to the human's reward function, but the robot does not initially know what this is...

...read full abstract close
show post
: 1
Add

: 3
Add
▼ 1 Strengths and 3 Vulnerabilities
add vulnerability / strength
104

Alignment for Advanced Machine Learning Systems

attributed to: Jessica Taylor and Eliezer Yudkowsky and Patrick LaVictoire and Andrew Critch Machine Intelligence Research Institute
posted by: KabirKumar

We survey eight research areas organized around one question: As learning
systems become increasingly intellig...
We survey eight research areas organized around one question: As learning
systems become increasingly intelligent and autonomous, what design principles
can best ensure that their behavior is aligned with the interests of the operators?
We focus on two major technical obstacles to AI alignment: the challenge of
specifying the right kind of objective functions, and the challenge of designing
AI systems that avoid unintended consequences and undesirable behavior
even in cases where the objective function does not line up perfectly with the
intentions of the designers...

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
105

Shortest and not the steepest path will fix the inner-alignment problem

attributed to: Thane Ruthenis (https://www.alignmentforum.org/users/thane-ruthenis?from=post_header)
posted by: KabirKumar

Replacing the 'stochastic gradient descent' SGD) with something that takes the shortest and not the steepest p...
Replacing the 'stochastic gradient descent' SGD) with something that takes the shortest and not the steepest path should just about fix the whole inner-alignment problem

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
106

Learning Safe Policies with Expert Guidance

attributed to: Jessie Huang, Fa Wu, Doina Precup, Yang Cai
posted by: KabirKumar

We propose a framework for ensuring safe behavior of a reinforcement learning agent when the reward function m...
We propose a framework for ensuring safe behavior of a reinforcement learning agent when the reward function may be difficult to specify. In order to do this, we rely on the existence of demonstrations from expert policies, and we provide a theoretical framework for the agent to optimize in the space of rewards consistent with its existing knowledge. We propose two methods to solve the resulting optimization: an exact ellipsoid-based method and a method in the spirit of the "follow-the-perturbed-leader" algorithm. Our experiments demonstrate the behavior of our algorithm in both discrete and continuous problems...

...read full abstract close
show post
: 1
Add

: 2
Add
▼ 1 Strengths and 2 Vulnerabilities
add vulnerability / strength
107

AI safety via debate

attributed to: Geoffrey Irving, Paul Christiano, Dario Amodei
posted by: KabirKumar

To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals ...
To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via self play on a zero sum debate game. Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information...

...read full abstract close
show post
: 0
Add

: 5
Add
▼ 0 Strengths and 5 Vulnerabilities
add vulnerability / strength
108

Pragmatic-Pedagogic Value Alignment

attributed to: Jaime F. Fisac, Monica A. Gates, Jessica B. Hamrick, Chang Liu, Dylan Hadfield-Menell, Malayandi Palaniappan, Dhruv Malik, S. Shankar Sastry, Thomas L. Griffiths, Anca D. Dragan
posted by: KabirKumar

As intelligent systems gain autonomy and capability, it becomes vital to ensure that their objectives match th...
As intelligent systems gain autonomy and capability, it becomes vital to ensure that their objectives match those of their human users; this is known as the value-alignment problem. In robotics, value alignment is key to the design of collaborative robots that can integrate into human workflows, successfully inferring and adapting to their users' objectives as they go. We argue that a meaningful solution to value alignment must combine multi-agent decision theory with rich mathematical models of human cognition, enabling robots to tap into people's natural collaborative capabilities...

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
109

Low Impact Artificial Intelligences

attributed to: Stuart Armstrong, Benjamin Levinstein
posted by: KabirKumar

There are many goals for an AI that could become dangerous if the AI becomes superintelligent or otherwise pow...
There are many goals for an AI that could become dangerous if the AI becomes superintelligent or otherwise powerful. Much work on the AI control problem has been focused on constructing AI goals that are safe even for such AIs. This paper looks at an alternative approach: defining a general concept of `low impact'. The aim is to ensure that a powerful AI which implements low impact will not modify the world extensively, even if it is given a simple or dangerous goal. The paper proposes various ways of defining and grounding low impact, and discusses methods for ensuring that the AI can still be allowed to have a (desired) impact despite the restriction.

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
110

Ethical Artificial Intelligence

attributed to: Bill Hibbard
posted by: KabirKumar

This book-length article combines several peer reviewed papers and new material to analyze the issues of ethic...
This book-length article combines several peer reviewed papers and new material to analyze the issues of ethical artificial intelligence (AI). The behavior of future AI systems can be described by mathematical equations, which are adapted to analyze possible unintended AI behaviors and ways that AI designs can avoid them. This article makes the case for utility-maximizing agents and for avoiding infinite sets in agent definitions...

...read full abstract close
show post
: 0
Add

: 0
Add

Be the first to critique this plan!
Add the First Critique!
add vulnerability / strength
111

Towards Human-Compatible XAI: Explaining Data Differentials with Concept Induction over Background Knowledge

attributed to: Cara Widmer, Md Kamruzzaman Sarker, Srikanth Nadella, Joshua Fiechter, Ion Juvina, Brandon Minnery, Pascal Hitzler, Joshua Schwartz, Michael Raymer
posted by: KabirKumar

Concept induction, which is based on formal logical reasoning over description logics, has been used in ontolo...
Concept induction, which is based on formal logical reasoning over description logics, has been used in ontology engineering in order to create ontology (TBox) axioms from the base data (ABox) graph. In this paper, we show that it can also be used to explain data differentials, for example in the context of Explainable AI (XAI), and we show that it can in fact be done in a way that is meaningful to a human observer. Our approach utilizes a large class hierarchy, curated from the Wikipedia category hierarchy, as background knowledge.

...read full abstract close
show post
: 0
Add

: 1
Add
▼ 0 Strengths and 1 Vulnerabilities
add vulnerability / strength
112

Empowerment is (almost) All We Need

attributed to: Jacob Cannell
posted by: KabirKumar

One recent approach formalizes agents as systems that would adapt their policy if their actions influenced the...
One recent approach formalizes agents as systems that would adapt their policy if their actions influenced the world in a different way. Notice the close connection to empowerment, which suggests a related definition that agents are systems which maintain power potential over the future: having action output streams with high channel capacity to future world states. This all suggests that agency is a very general extropic concept and relatively easy to recognize.

...read full abstract close
show post
: 0
Add

: 7
Add
▼ 0 Strengths and 7 Vulnerabilities
add vulnerability / strength
113

GATO Framework: Global Alignment Taxonomy Omnibus Framework

attributed to: David Shapiro and GATO Team
posted by: KabirKumar

The GATO Framework serves as a pioneering, multi-layered, and decentralized blueprint for addressing the cruci...
The GATO Framework serves as a pioneering, multi-layered, and decentralized blueprint for addressing the crucial issues of AI alignment and control problem. It is designed to circumvent potential cataclysms and actively construct a future utopia. By embedding axiomatic principles within AI systems and facilitating the formation of independent, globally distributed groups, the framework weaves a cooperative network, empowering each participant to drive towards a beneficial consensus. From model alignment to global consensus, GATO envisions a path where advanced technologies not only avoid harm but actively contribute to an unprecedented era of prosperity, understanding, and reduced suffering.

...read full abstract close
show post
: 1
Add

: 3
Add
▼ 1 Strengths and 3 Vulnerabilities
add vulnerability / strength
114

I propose a set of logically distinct conceptual components that are necessary and sufficient to 1) ensure tha...
I propose a set of logically distinct conceptual components that are necessary and sufficient to 1) ensure that most known AGI scenarios will not harm humanity and 2) robustly align AGI values and goals with human values.
Methods. By systematically addressing each pathway category to malevolent AI we can induce the methods/axioms required to redress the category.
Results and Discussion. Distributed ledger technology (DLT, blockchain) is integral to this proposal

...read full abstract close
show post
: 0
Add

: 4
Add
▼ 0 Strengths and 4 Vulnerabilities
add vulnerability / strength
115

Using Mechanism Design and forms of Technical Governance to approach alignment from a different angle, trying ...
Using Mechanism Design and forms of Technical Governance to approach alignment from a different angle, trying to create a stable equilibria that can scale as AI intelligence and proliferation escalates, with safety mechanisms and aligned objectives built-into the greater network.

...read full abstract close
show post
: 5
Add

: 7
Add
▼ 5 Strengths and 7 Vulnerabilities
add vulnerability / strength