Designing hierarchical reinforcement learning algorithms that exhibit safe
behaviour is not only vital for practical applications but also, facilitates a
better understanding of an agent's decisions. We tackle this problem in the
options framework, a particular way to specify temporally abstract actions
which allow an agent to use sub-policies with start and end conditions. We
consider a behaviour as safe that avoids regions of state-space with high
uncertainty in the outcomes of actions. We propose an optimization objective
that learns safe options by encouraging the agent to visit states with higher
behavioural consistency. The proposed objective results in a trade-off between
maximizing the standard expected return and minimizing the effect of model
uncertainty in the return. We propose a policy gradient algorithm to optimize
the constrained objective function. We examine the quantitative and qualitative
behaviour of the proposed approach in a tabular grid-world, continuous-state
puddle-world, and three games from the Arcade Learning Environment: Ms.Pacman,
Amidar, and Q*Bert. Our approach achieves a reduction in the variance of
return, boosts performance in environments with intrinsic variability in the
reward structure, and compares favorably both with primitive actions as well as
with risk-neutral options.
48
0
Safe Option-Critic: Learning Safety in the Option-Critic Architecture
attributed to: Arushi Jain, Khimya Khetarpal, Doina Precup
Designing hierarchical reinforcement learning algorithms that exhibit safe
behaviour is not only vital for practical applications but also, facilitates a
better understanding of an agent's decisions. We tackle this problem in the
options framework, a particular way to specify temporally abstract actions
which allow an agent to use sub-policies with start and end conditions. We
consider a behaviour as safe that avoids regions of state-space with high
uncertainty in the outcomes of actions. We propose an optimization objective
that learns safe options by encouraging the agent to visit states with higher
behavioural consistency. The proposed objective results in a trade-off between
maximizing the standard expected return and minimizing the effect of model
uncertainty in the return. We propose a policy gradient algorithm to optimize
the constrained objective function. We examine the quantitative and qualitative
behaviour of the proposed approach in a tabular grid-world, continuous-state
puddle-world, and three games from the Arcade Learning Environment: Ms.Pacman,
Amidar, and Q*Bert.
0
Vulnerabilities & Strengths