Professional Documents
Culture Documents
1993
is known. Having access to the analytic formulation of the Lemma 1 (Convexity of C yx ). Given a convex representation
policy, we can use gradient information to derive a new IRL of the reward function R(s, a; ω) w.r.t. the reward parame-
algorithm that does not need to solve the forward model. It is ters, the objective function C yx , with x, y ≥ 1, is convex.
worth to stress that even knowing the expert’s policy, behav- This means that the optimization problem can be solved
ioral cloning may not be the best solution to IRL problems. using any standard convex optimization approach. In this
In particular, the reward is a more succinct and powerful in- work we consider the case of squared 2 –norm (x = 2 and
formation than the policy itself. The reward function can be y = 2) and a constrained gradient descent approach.
used to generalize the expert’s behavior to state space re-
We want to stress the fact that the only requirement for
gions not covered by the samples or to new situations. For
the convexity of the optimization process is the convexity of
instance, the transition model can change and, as a conse-
the parametric class of rewards (that is free to be designed).
quence, the optimal policy may change as well. Further-
We think that this class is big enough to be used in several
more, behavioral cloning cannot be applied whenever the
real problems. Note, no assumption is posed on the real (ex-
expert demonstrates actions that cannot be executed by the
pert’s) reward. In this scenario (convex optimization func-
learner (think at a humanoid robot that learns by observing a
tion) there are no problems related to stationary points.
human expert). In all these cases, knowing the reward func-
tion, the optimal policy can be derived.
Standard policy gradient approaches require the policy to Meaning of GIRL As mentioned before, when the ex-
be stochastic in order to get rid of the knowledge of the tran- pert is optimal w.r.t. the parametric reward, the minimum
sition model (Sutton et al. 1999). When the expert’s policy of C yx πθE , ω is attained by the expert’s weights ω E . On
is deterministic the model must be available or the policy the other hand, there are several reasons for which the ex-
must be forced to be stochastic1 . For continuous state-action pert may be not optimal: I) the expert is optimizing a reward
domains the latter approach can be easily implemented by function that cannot be represented by the chosen reward
adding zero-mean Gaussian noise. Instead the Gibbs model parametrization; II) the expert exhibits a suboptimal policy;
is suited for discrete actions because the stochasticity can be III) the samples available to estimate the gradient are not suf-
regulated by varying the temperature parameter. ficient to obtain a reliable estimate. When the expert’s policy
Given a parametric (linear or non linear) reward function is not optimal for any reward function in the chosen reward
R(s, a; ω), we can compute the associate policy gradient class, the solution ω A represents the minimum norm gradi-
ent, i.e., the reward that induces the minimum change in the
E πE E πE policy parameters. In other words, GIRL tries to provide the
∇θ J πθ , ω = dμθ (s) ∇θ π (a; s, θ)Q (s, a; ω ) dads.
S A reward weights in the chosen space that better explain the
We assume to have an analytic description of the expert’s behavior of the expert’s policy. Note that the result ω A is
policy, but we have a limited set of demonstrations of the reliable as long as the norm of the gradient is small enough,
expert’s behavior in the environment. Let denote by D = otherwise the optimal policy associated to the given reward
N weights can be arbitrarily different than expert’s policy.2
{τi }i=1 the set of expert’s trajectories. Then, the gradient
can be estimated off-line using the N trajectories and any
standard policy gradient algorithm: REINFORCE (Williams Linear Reward Settings
1992), GPOMDP (Baxter and Bartlett 2001), natural gradi- In this section we reformulate the GIRL algorithm in
ent (Kakade 2002) or eNAC (Peters and Schaal 2008a). the case of linear parametrization of the reward function
If the policy performance J (π, ω) is differentiable w.r.t. R(s, a; ω). This interpretation comes from a multi-objective
the policy parameters θ and the expert πθE is optimal w.r.t. a optimization (MOO) view of the IRL problem. When the re-
parametrization R(s, a, ω E ), the associated policy gradient ward is linearly parametrized we have that
E
is identically equal to zero. the expert’s policy πθ
Clearly,
is a stationary point for J π, ω .
E q
∞
t
q
The Gradient IRL (GIRL) J (π, ω ) = ωi E γ φi (st , at ) = ω i J i (π)
algorithm aims at finding a sta- i=1 s0 ∼ D t=0 i=1
tionary point of J πθE , ω w.r.t. the reward parameter ω, a t ∼ π st ∼ P
that is, for any x, y ≥ 1 (2)
where J (π) ∈ Rq are the unconditional feature expecta-
1
y
ω
A y E
= arg min C x πθ , ω = arg min
E
∇θ J πθ , ω . (1) tions under policy π and basis functions Φ(s, a) ∈ Rq .
ω ω y x
1994
Svaiter 2000). It can be formulated as (Brown and Smith Geometric Interpretation We have seen the analytical in-
2005), for any non–zero (convex) vector α 0 terpretation of the reward weights. Now, if the expert is opti-
mal w.r.t. some linear combination of the objectives, then she
α ∈ Null (Dθ J (π)) , (3) lies on the Pareto frontier. Geometrically, the reward weights
are orthogonal to the hyperplane tangent to the Pareto
fron-
where Null is the (right) null space of the Jacobian tier (identified by the individual gradients ∇θ J i π E , i =
Dθ J (π). The same idea was used by (Désidéri 2012) to 1, . . . , q). By exploiting the local information given by
define the concept of Pareto–stationary solution. This def- the individual gradients w.r.t. the expert parametrization
inition derives from the geometric interpretation of Equa- Dθ J π E we can compute the tangent hyperplane and the
tion (3), that is, a point is Pareto–stationary if there exists a associated scalarization weights. Such hyperplane can be
convex combination of the individual gradients that gener- identified by the q points
associated
to the
Gram
matrix of
ates an identically zero vector: Dθ J π E = ∇θ J 1 π E , . . . , ∇θ J q π E :
q T
Dθ J (π) α = 0, αi = 1, αi ≥ 0 ∀i ∈ {1, . . . , q} . G = Dθ J π E D θ J π E .
i=1
If G is full rank, the q points univocally identify the tan-
A solution that is Pareto optimal is also Pareto– gent hyperplane. Since the expert’s weights are orthogonal
stationary (Désidéri 2012). A simple and intuitive interpre- to such hyperplane, they are obtained by computing the null
tation of the Pareto-stationary condition can be provided space of the matrix G: ω ∈ Null(G). Given the indi-
in the case of two objectives. When the solution lies out- vidual gradients, the complexity of obtaining the weights is
side the frontier it is possible to identify a set of direc- O(q 2 d + q 3 ). In the following, we will call this version of
tions (ascent cone) that simultaneously increase all the ob- GIRL as PGIRL (Plane GIRL).
jectives. When the solution belongs to the Pareto frontier
(i.e., ∃(convex)α : Dθ J (π) α = 0), the ascent cone is An Analysis of Linear Reward Settings
empty because any change in the parameters will cause the We initially state our analysis in the case of complete knowl-
decrease of at least one objective. Geometrically, the gra- edge where we can compute every single term in exact form.
dient vectors related to the different reward features result Later, we will deal with approximations.
coplanar and with contrasting directions.
1995
the objectives spanned by the chosen representation. Pre- along some direction, thus making the problem less robust
processing of the gradient or Gram matrix cannot be carried to approximation errors.
out since the rank is upper bounded by the policy parameters
and not by the reward ones. However, this problem can be Approximated Expert’s Policy Parameters
overcome by considering additional constraints or directly When the expert’s policy is unknown, to apply the same
in the design phase since either q and d are usually under the idea presented in the previous section, we need to infer a
control of the user. N
parametric policy model from a set of trajectories {τi }i=0
When the Gram matrix G is not full rank we do not have
of length M . This problem is a standard density estima-
a unique solution. In this case the null space operator
re-
tion problem. Given a parametric policy model π(a; s, θ),
turns an orthonormal basis Z such that Dθ J π E · Z is
a parametric density estimation problem can be defined as a
a null matrix. However, infinite vectors are spanned by the
maximum likelihood estimation (MLE) problem: θ M LE =
orthonormal basis. The constraint on the 1 –norm is not suf- N ·M
ficient for the selection of the reward parameters, an addi- arg maxθ L (θ; D) = arg maxθ i=1 π(ai ; si , θ).
tional criterion is required. Several approaches have been
defined in literature. For example, in order to select so- Related Work
lution vectors that consider only a subset of reward fea- In the last decade, many IRL algorithms have been pro-
tures, it is possible to design a sparsification approach (in posed (see (Zhifei and Meng Joo 2012) for a recent sur-
1
the experiments 2 –norm has been used). A different ap- vey), most of which are approaches that require the MDP
proach is suggested by the field of reward shaping, where model and/or need to iteratively compute the optimal pol-
the goal is to find the reward function that maximizes icy of the MDP obtained by considering intermediate re-
an information criterion (Ng, Harada, and Russell 1999; ward functions. Since an accurate solution to the IRL prob-
Wiewiora, Cottrell, and Elkan 2003). lem may require many iterations, when the optimal pol-
icy for the MDP cannot be efficiently computed, these IRL
Approximate Case When the state transition model is un- approaches are not practical. For this reason, some recent
known, the gradients are estimated through trajectories us- works have proposed model-free IRL approaches that do not
ing some model-free approach (Peters and Schaal 2008a). require to solve MDPs. The approach proposed by (Dvi-
The estimation errors of the gradients imply an error in the jotham and Todorov 2010) does not need to solve many
expert weights estimated by PGIRL. In the following theo- MDPs, but it can be applied only to linearly solvable MDPs.
rem we provide an upper bound to the 2 -norm of the dif- In (Boularias, Kober, and Peters 2011) the authors proposed
ference between the expert weights ω E and the weights ω A a model-free version of the Maximum Entropy IRL ap-
estimated by PGIRL, given 2 -norm upper bounds on the proach (Ziebart et al. 2008) that minimizes the relative en-
gradient estimation errors.4 tropy between the empirical distribution of the state-action
trajectories demonstrated by the expert and their distribution
2. Let denote with ĝi the estimated value of
Theorem under the learned policy. Even if this approach avoids the
∇θ J i π E . Under the assumption that the reward func- need of solving MDPs, it requires sampling trajectories ac-
tion maximized by the expert is a linear combination of ba- cording to a non-expert (explorative) policy. Classification-
Φ(s, a) with weights ω and that, for each i,
E
sis
functions based approaches (SCIRL and CSI) (Klein et al. 2012;
∇θ J i π E − ĝi ≤ i :
2 2013) can produce near-optimal results when accurate es-
⎛ ⎞ timation of the feature expectations can be computed and
2
E
heuristic versions have been proved effective even when a
ω − ω A ≤ 2 ⎝1 − 1 −
¯ ⎠,
2
few demonstrations are available. While SCIRL is limited
ρ̄ to linearly parametrized reward functions, CSI can deal with
nonlinear functions. However, both the algorithms require
where ¯ = maxi i and ρ̄ is the radius of the largest (q − 1)- the expert to be deterministic and need to use heuristic ap-
dimensional ball inscribed in the convex hull of points ĝi . proaches in order to learn with the only knowledge of ex-
Notice that the above bound is valid only when ¯ < ρ̄, oth- pert’s trajectories. Without using heuristics, they require an
erwise the upper bound is the maximum distance additional data set for exploration of the system dynamics.
√ between Moreover, CSI does not aim to recover the unknown expert’s
two vectors belonging to the unit simplex, that is 2. As ex-
pected, when the gradients are known exactly (i.e., i = 0 reward, but to obtain a reward for which the expert is nearly-
for all i), the term ¯ is zero and consequently the upper optimal. Finally, the main limitation of these classification-
bound is zero. On the other hand, it is interesting to notice based approaches is the assumption that the expert is opti-
that a small value of the term ¯ is not enough to guarantee mal, the suboptimality scenario was not addressed.
that the weights computed by PGIRL are a good approxima- In (Neu and Szepesvári 2007) the authors have proposed
tion of the expert ones. In fact, when ρ̄ = ¯ the upper bound an IRL approach that can work even with non-linear reward
on the weight error reaches its maximum value. Intuitively, parametrizations, but it needs to estimate the optimal action-
a small value of ρ̄ means that all the gradients are aligned value function. Other related works are (Levine and Koltun
2012) and (Johnson, Aghasadeghi, and Bretl 2013). Both ap-
4 proaches optimize an objective function related to the gradi-
The reader may refer to (Pirotta, Restelli, and Bascetta 2013)
for error bounds on the gradient estimate with Gaussian policies. ent information, but they require to know the dynamics.
1996
PGIRL GIRL
Episodes
R RB G GB eNAC R RB G GB eNAC
0.031 0.020 0.020 0.021 0.037 10.1 8.4 46.1 8.0 3.2
10
±0.011 ±0.0011 ±0.002 ±0.002 ±0.009 ±1.5 ±1.1 ±6.2 ±1.3 ±0.3
0.173 0.168 0.166 0.170 0.179 74.0 70.2 249.7 50.4 32.0
100
±0.007 ±0.001 ±0.003 ±0.003 ±0.003 ±19.8 ±9.8 ±73.6 ±3.3 ±6.5
1.664 1.655 1.655 1.683 1.640 971.4 472.7 3613.0 498.6 257.4
1000
±0.006 ±0.011 ±0.031 ±0.005 ±0.038 ±270.0 ±25.2 ±1128.2 ±32.5 ±14.7
Table 1: Average Computational Time (s) for the generation of the results presented in Figure 2b (5D LQR).
−165
The next set of experiments deals with the accuracy and
Pareto Front
time complexity of the proposed approaches (GIRL and
−165.5 J πE , ωE
Plane with 10 ep.
PGIRL) with different gradient estimation methods (REIN-
Plane with 100 ep. FORCE w/ and w/o baseline (RB, R), GPOMDP w/ and w/o
−166
Plane with 1000 ep. baseline (GB,G) and eNAC). We selected 5 problem dimen-
−150 sions: (2, 5, 10, 20). For each domain we selected 20 ran-
−166.5
−200 dom expert’s weights in the unit simplex and we generated
JM
2
1997
0.25 0.25 0.25
R RB RB RB
0.6
RB 0.2 GB 0.2 GB 0.2 GB
∞
0.2
0 0 0 0
10 50 100 500 1000 10 50 100 500 1000 10 50 100 500 1000 10 50 100 500 1000
Episodes Episodes Episodes Episodes
(a) 2D LQR (b) 5D LQR (c) 10D LQR (d) 20D LQR
Figure 2: The L∞ norm between the expert’s and agent’s weights for different problem and data set dimensions. s0 = −10.
RB RB RB
0.35 0.35 0.35
GB GB GB
0.3 0.3 0.3
∞
∞
eNAC eNAC eNAC
− ωA
− ωA
− ωA
0.25 0.25 0.25
E
E
ω
ω
ω
0.1 0.1 0.1
0.05 0.05 0.05
0 0 0
10 50 100 500 1000 10 50 100 500 1000 10 50 100 500 1000
Episodes Episodes Episodes
(a) Optimal Policy (linear) (b) Polynomial 3-deg (c) Gaussian RBFs
Figure 3: The L∞ norm between the expert’s and agent’s weights are reported for different problem and data set dimensions.
1998
an intermediate value for increasing numbers of samples. Lagoudakis, M. G., and Parr, R. 2003. Least-squares policy
By investigating the policies and the rewards recovered by iteration. Journal of Machine Learning Research 4:1107–
PGIRL we have noticed that it favors lower velocities when 1149.
approaching the goal. This explains why it takes slightly Levine, S., and Koltun, V. 2012. Continuous inverse optimal
more steps than SCI and SCIRL. On the other side, CSI and control with locally optimal examples. In Proc. ICML, 41–
SCIRL exploit more samples than PGIRL because, as men- 48.
tioned before, they build additional samples using heuristics. Nakos, G., and Joyner, D. 1998. Linear algebra with appli-
cations. PWS Publishing Company.
Conclusions and Future Work Neu, G., and Szepesvári, C. 2007. Apprenticeship learning
We presented GIRL a novel inverse reinforcement learning using inverse reinforcement learning and gradient methods.
approach that is able to recover the reward parameters by In Proc. UAI, 295–302.
searching for the reward function that minimizes the policy Ng, A. Y., and Russell, S. J. 2000. Algorithms for inverse
gradient and avoids to solve the “direct” MDP. While GIRL reinforcement learning. In Proc. ICML, 663–670.
defines a minimum–norm problem for linear and non-linear
reward parametrizations, PGIRL method is an efficient im- Ng, A. Y.; Harada, D.; and Russell, S. J. 1999. Policy invari-
plementation of GIRL in case of linear reward functions. We ance under reward transformations: Theory and application
showed that, knowing the accuracy in the gradient estimates, to reward shaping. In Proc. ICML, 278–287.
it is possible to derive guarantees on the error between the Peters, J., and Schaal, S. 2008a. Natural actor-critic. Neuro-
recovered and the expert’s reward weights. Finally, the algo- computing 71(79):1180 – 1190. Progress in Modeling, The-
rithm has been evaluated on the LQR domain and compared ory, and Application of Computational Intelligenc 15th Eu-
with the state-of-the-art on the well-known mountain car do- ropean Symposium on Artificial Neural Networks 2007 15th
main. Tests show that GIRL represents a novel approach that European Symposium on Artificial Neural Networks 2007.
efficiently and effectively solves IRL problems. Peters, J., and Schaal, S. 2008b. Reinforcement learn-
ing of motor skills with policy gradients. Neural Networks
References 21(4):682 – 697. Robotics and Neuroscience.
Baxter, J., and Bartlett, P. L. 2001. Infinite-horizon policy- Pirotta, M.; Parisi, S.; and Restelli, M. 2015. Multiobjec-
gradient estimation. Journal of Artificial Intelligence Re- tive reinforcement learning with continuous pareto frontier
search (JAIR) 15:319–350. approximation. In Proc. AAAI, 2928–2934.
Boularias, A.; Kober, J.; and Peters, J. 2011. Relative entropy Pirotta, M.; Restelli, M.; and Bascetta, L. 2013. Adaptive
inverse reinforcement learning. In Proc. AISTATS, 182–189. step-size for policy gradient methods. In Advances in Neu-
ral Information Processing Systems 26, 1394–1402. Curran
Brown, M., and Smith, R. E. 2005. Directed multi-objective
Associates, Inc.
optimization. International Journal of Computers, Systems,
and Signals 6(1):3–17. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and
Riedmiller, M. A. 2014. Deterministic policy gradient algo-
Désidéri, J.-A. 2012. Multiple-gradient descent algorithm rithms. In Proc. ICML, 387–395.
(mgda) for multiobjective optimization. Comptes Rendus
Mathematique 350(56):313 – 318. Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour,
Y. 1999. Policy gradient methods for reinforcement learning
Dvijotham, K., and Todorov, E. 2010. Inverse optimal con- with function approximation. In Advances in Neural Infor-
trol with linearly-solvable mdps. In Proc. ICML, 335–342. mation Processing Systems 12, 1057–1063. The MIT Press.
Fliege, J., and Svaiter, B. F. 2000. Steepest descent meth- Syed, U., and Schapire, R. E. 2007. A game-theoretic ap-
ods for multicriteria optimization. Mathematical Methods of proach to apprenticeship learning. In Advances in Neural
Operations Research 51(3):479–494. Information Processing Systems 20, 1449–1456.
Johnson, M.; Aghasadeghi, N.; and Bretl, T. 2013. Inverse Wiewiora, E.; Cottrell, G. W.; and Elkan, C. 2003. Principled
optimal control for deterministic continuous-time nonlinear methods for advising reinforcement learning agents. In Proc.
systems. In Decision and Control (CDC), 2013 IEEE 52nd ICML, 792–799.
Annual Conference on, 2906–2913.
Williams, R. 1992. Simple statistical gradient-following al-
Kakade, S. M. 2002. A natural policy gradient. In Ad- gorithms for connectionist reinforcement learning. Machine
vances in Neural Information Processing Systems 14. MIT Learning 8(3-4):229–256.
Press. 1531–1538.
Zhifei, S., and Meng Joo, E. 2012. A survey of inverse
Klein, E.; Geist, M.; Piot, B.; and Pietquin, O. 2012. Inverse reinforcement learning techniques. International Journal of
reinforcement learning through structured classification. In Intelligent Computing and Cybernetics 5(3):293–311.
Advances in Neural Information Processing Systems, 1007–
Ziebart, B. D.; Maas, A. L.; Bagnell, J. A.; and Dey, A. K.
1015.
2008. Maximum entropy inverse reinforcement learning. In
Klein, E.; Piot, B.; Geist, M.; and Pietquin, O. 2013. A cas- Proc. AAAI, 1433–1438.
caded supervised learning approach to inverse reinforcement
learning. In Machine Learning and Knowledge Discovery in
Databases. Springer. 1–16.
1999