34
0
'Indifference' methods for managing agent rewards
attributed to: Stuart Armstrong, Xavier O'Rourke
`Indifference' refers to a class of methods used to control reward based
agents. Indifference techniques aim to achieve one or more of three distinct
goals: rewards dependent on certain events (without the agent being motivated
to manipulate the probability of those events), effective disbelief (where
agents behave as if particular events could never happen), and seamless
transition from one reward function to another (with the agent acting as if
this change is unanticipated). This paper presents several methods for
achieving these goals in the POMDP setting, establishing their uses, strengths,
and requirements. These methods of control work even when the implications of
the agent's reward are otherwise not fully understood.
What part of the alignment problem does this plan aim to solve?
Why has that part of the alignment problem been chosen?
How does this plan aim to solve the problem?
What evidence is there that the methods will work?
What are the most likely causes of this not working?
0
Vulnerabilities & Strengths