Policy Gradient (PG) algorithms are among the best candidates for the
much-anticipated applications of reinforcement learning to real-world control
tasks, such as robotics. However, the trial-and-error nature of these methods
poses safety issues whenever the learning process itself must be performed on a
physical system or involves any form of human-computer interaction. In this
paper, we address a specific safety formulation, where both goals and dangers
are encoded in a scalar reward signal and the learning agent is constrained to
never worsen its performance, measured as the expected sum of rewards. By
studying actor-only policy gradient from a stochastic optimization perspective,
we establish improvement guarantees for a wide class of parametric policies,
generalizing existing results on Gaussian policies. This, together with novel
upper bounds on the variance of policy gradient estimators, allows us to
identify meta-parameter schedules that guarantee monotonic improvement with
high probability. The two key meta-parameters are the step size of the
parameter updates and the batch size of the gradient estimates. Through a
joint, adaptive selection of these meta-parameters, we obtain a policy gradient
algorithm with monotonic improvement guarantees.
51
0
Smoothing Policies and Safe Policy Gradients
attributed to: Matteo Papini, Matteo Pirotta, Marcello Restelli
Policy Gradient (PG) algorithms are among the best candidates for the
much-anticipated applications of reinforcement learning to real-world control
tasks, such as robotics. However, the trial-and-error nature of these methods
poses safety issues whenever the learning process itself must be performed on a
physical system or involves any form of human-computer interaction. In this
paper, we address a specific safety formulation, where both goals and dangers
are encoded in a scalar reward signal and the learning agent is constrained to
never worsen its performance, measured as the expected sum of rewards. By
studying actor-only policy gradient from a stochastic optimization perspective,
we establish improvement guarantees for a wide class of parametric policies,
generalizing existing results on Gaussian policies. This, together with novel
upper bounds on the variance of policy gradient estimators, allows us to
identify meta-parameter schedules that guarantee monotonic improvement with
high probability.
0
Vulnerabilities & Strengths