Professional Documents
Culture Documents
Automatica
journal homepage: www.elsevier.com/locate/automatica
Brief paper
1. Introduction Xu, Jagannathan, and Lewis (2012) and Zhang, Wei, and Liu (2011)
for some recently developed results.
The adaptive controller design for unknown linear systems Among all the different ADP approaches, for discrete-time
has been intensively studied in the past literature (e.g. Ioannou systems, the action-dependent heuristic dynamic programming
and Sun (1996), Mareels and Polderman (1996) and Tao (2003)). (ADHDP) (Werbos, 1989), or Q -learning (Watkins, 1989), is an
A conventional way to design an adaptive optimal control law online iterative scheme that does not depend on the model to
can be pursued by identifying the system parameters first and be controlled. Recently, the methodology has been extended and
then solving the related algebraic Riccati equation. However, applied to many different areas, such as nonzero-sum games
adaptive systems designed this way are known to respond slowly (Al-Tamimi et al., 2007), networked control systems (Xu et al.,
to parameter variations from the plant. 2012), and optimal output feedback control designs (Lewis &
Inspired by the learning behavior from biological systems, Vamvoudakis, 2011), as well as multi-objective optimal control
reinforcement learning (Sutton & Barto, 1998) and approximate/ problems (Wei, Zhang, & Dai, 2009).
adaptive dynamic programming (ADP) (Werbos, 1974) theories Due to the different structures of the algebraic Riccati equations
have been broadly applied for solving optimal control problems between discrete-time and continuous-time systems, results de-
for uncertain systems in recent years. See, for example, Lewis veloped for the discrete-time setting cannot be directly applied
and Vrabie (2009) and Wang, Zhang, and Liu (2009) for two for solving continuous-time problems. In Baird (1994), the ad-
review papers, and Al-Tamimi, Lewis, and Abu-Khalaf (2007), vantage updating technique for sampled continuous-time system
Bhasin, Sharma, Patre, and Dixon (2011), Dierks and Jagannathan was proposed. In Murray, Cox, Lendaris, and Saeks (2002), two
(2011), Doya (2000), Ferrari, Steck, and Chandramohan (2008), model-free algorithms were developed, but the measurements of
Jiang and Jiang (2011), Vamvoudakis and Lewis (2011), Vrabie, state derivatives must be used. An exact method was developed
Pastravanu, Abu-Khalaf, and Lewis (2009), Werbos (1998, 2009), in Vrabie et al. (2009), where neither knowing the internal sys-
tem dynamics nor measuring the state derivatives was neces-
sary. However, a common feature of all the existing ADP-based
results is that partial knowledge of the system dynamics is as-
✩ This work has been supported in part by the National Science Foundation grants
sumed to be exactly known in the setting of continuous-time
DMS-090665 and ECCS-1101401. This paper was not presented at any IFAC meeting. systems.
This paper was recommended for publication in revised form by Associate Editor
Shuzhi Sam Ge under the direction of Editor Miroslav Krstic.
The primary objective of this paper is to remove this assumption
E-mail addresses: yjiang06@students.poly.edu (Y. Jiang), zjiang@poly.edu on partial knowledge of the system dynamics, and thus to develop
(Z.-P. Jiang). a truly model-free ADP algorithm. More specifically, we propose
1 Tel.: +1 718 260 3976; fax: +1 718 260 3906. a novel computational adaptive optimal control methodology
0005-1098/$ – see front matter © 2012 Elsevier Ltd. All rights reserved.
doi:10.1016/j.automatica.2012.06.096
2700 Y. Jiang, Z.-P. Jiang / Automatica 48 (2012) 2699–2704
that employs the approximate/adaptive dynamic programming Theorem 1 (Kleinman, 1968). Let K0 ∈ Rm×n be any stabilizing
technique to iteratively solve the algebraic Riccati equation using feedback gain matrix, and let Pk be the symmetric positive definite
the online information of state and input, without requiring the a solution of the Lyapunov equation
priori knowledge of the system matrices. In addition, all iterations
can be conducted by using repeatedly the same state and input (A − BKk )T Pk + Pk (A − BKk ) + Q + KkT RKk = 0 (6)
information on some fixed time intervals. It should be noticed that
our approach may serve as a computational tool to study ADP where Kk , with k = 1, 2, . . . , are defined recursively by:
related problems for continuous-time systems.
Kk = R−1 BT Pk−1 . (7)
This paper is organized as follows. In Section 2, we briefly intro-
duce the standard linear optimal control problem for continuous- Then, the following properties hold:
time systems and the policy iteration technique. In Section 3, we
1. A − BKk is Hurwitz,
develop our computational adaptive optimal control method and
2. P ∗ ≤ Pk+1 ≤ Pk ,
show its convergence. A practical online algorithm will be pro-
vided. In Section 4, we apply the proposed approach to the opti- 3. limk→∞ Kk = K ∗ , limk→∞ Pk = P ∗ .
mal controller design problem of a turbocharged diesel engine with In Kleinman (1968), by iteratively solving the Lyapunov equa-
exhaust gas recirculation. Concluding remarks as well as potential tion (6), which is linear in Pk , and updating Kk by (7), the solution
future extensions are contained in Section 5. to the nonlinear equation (4) is numerically approximated.
For the purpose of solving (6) without the knowledge of A, in
Notation. Throughout this paper, we use R and Z+ to denote Vrabie et al. (2009), (6) was implemented online by
the sets of real numbers and non-negative integers, respectively.
Vertical bars ∥ · ∥ represent the Euclidean norm for vectors, or xT (t )Pk x(t ) − xT (t + δ t )Pk x(t + δ t )
the induced matrix norm for matrices. We use ⊗ to indicate the t +δ t
Kronecker product, and vec(A) is defined to be the mn-vector xT Qx + uTk Ruk dτ
= (8)
formed by stacking the columns of A ∈ Rn×m on top of one another, t
i.e., vec(A) = [aT1 aT2 · · · aTm ]T , where ai ∈ Rn are the columns of A. A where uk = −Kk x is the control input of the system on the time
control law is also called a policy. A feedback gain matrix K ∈ Rm×n interval [t , t + δ t].
is said to be stabilizing for linear systems ẋ = Ax+Bu if the feedback Since both x and uk can be measured online, a symmetric
matrix A − BK is Hurwitz. solution Pk can be uniquely determined under some persistent
excitation (PE) condition (Vrabie et al., 2009). However, as we
2. Problem formulation and preliminaries can see from (7), the exact knowledge of system matrix B is still
required for the iterations. Also, to guarantee the PE condition, the
Consider a continuous-time linear system described by state may need to be reset at each iteration step, but this may cause
technical problems for stability analysis of the closed loop system
ẋ = Ax + Bu (1) (Vrabie et al., 2009). An alternative way is to add exploration noise
n (e.g. Al-Tamimi et al. (2007), Bradtke, Ydstie, and Barto (1994),
where x ∈ R is the system state fully available for feedback
Vamvoudakis and Lewis (2011) and Xu et al. (2012)) such that uk =
control design; u ∈ Rm is the control input; and A ∈ Rn×n and
B ∈ Rn×m are unknown constant matrices. In addition, the system
−Kk x + e, with e the exploration noise, is used as the true control
input in (8). As a result, Pk solved from (8) and the one solved from
is assumed to be stabilizable.
(6) are not exactly the same. In addition, after each time the control
The design objective is to find a linear optimal control law in the
policy is updated, information of the state and input must be re-
form of
collected for the next iteration. This may slow down the learning
u = −Kx (2) process, especially for high-dimensional systems.
By the assumptions mentioned above, (4) has a unique symmetric where Ak = A − BKk .
positive definite solution P ∗ . The optimal feedback gain matrix K ∗ Then, along the solutions of (9) by (6) and (7) it follows that
in (2) can thus be determined by x(t + δ t )T Pk x(t + δ t ) − x(t )T Pk x(t )
t +δ t
K ∗ = R−1 BT P ∗ .
(5)
xT (ATk Pk + Pk Ak )x + 2(u + Kk x)T BT Pk x dτ
=
t
Since (4) is nonlinear in P, it is usually difficult to directly
solve P ∗ from (4), especially for large-size matrices. Nevertheless, t +δ t t +δ t
many efficient algorithms have been developed to numerically =− xT Qk x dτ + 2 (u + Kk x)T RKk+1 x dτ (10)
t t
approximate the solution of (4). One of such algorithms was
developed in Kleinman (1968), and is introduced in the following: where Qk = Q + KkT RKk .
Y. Jiang, Z.-P. Jiang / Automatica 48 (2012) 2699–2704 2701
Box I.
Θk X = 0 (A.1)
has only the trivial solution X = 0.
To this end, we prove by contradiction. Assume X =
1
∈ R 2 n(n+1)+mn is a nonzero solution of (A.1), where
T
YvT ZvT
1
Yv ∈ R 2 n(n+1) and Zv ∈ Rmn . Then, a symmetric matrix Y ∈ Rn×n
and a matrix Z ∈ Rm×n can be uniquely determined, such that
Ŷ = Yv and vec(Z ) = Zv .
By (10), we have
Θk X = Ixx vec(M ) + 2Ixu vec(N ) (A.2)
where
N = B Y − RZ .
T
(A.4)
Notice that since M is symmetric, we have
Fig. 4. Convergence of Pk and Kk to their optimal values P ∗ and K ∗ during the
learning process. Ixx vec(M ) = Ix̄ M̂ (A.5)
l× 21 n(n+1)
where Ix̄ ∈ R is defined as:
iteration. In addition, the method in Vrabie et al. (2009) may need
to reset the state at each iteration step, in order to satisfy the PE
t1 t2 tl
T
condition. Ix̄ = x̄dτ , x̄dτ , . . . , x̄dτ . (A.6)
t0 t1 tl−1
5. Conclusions and future work
Then, (A.1) and (A.2) imply the following matrix form of linear
A novel computational policy iteration approach for finding equations
online adaptive optimal controllers for continuous-time linear
M̂
Ix̄ , = 0.
systems with completely unknown system dynamics has been 2Ixu (A.7)
vec(N )
presented in this paper. This method solves the algebraic Riccati
equation iteratively using system state and input information Under the rank condition in (13), we know Ix̄ ,
2Ixu has full
collected online, without knowing the system matrices. A practical
online algorithm was proposed and has been applied to the column rank. Therefore, the only solution to (A.7) is M̂ = 0 and
control design for a turbocharged diesel engine with unknown vec(N ) = 0. As a result, we have M = 0 and N = 0.
parameters. The methodology developed in this paper may serve Now, by (A.4) we know Z = R−1 BT Y , and (A.3) is reduced to the
as a computational tool to study the adaptive optimal control of following Lyapunov equation
uncertain nonlinear systems (Krstić, Kanellakopoulos, & Kokotović,
ATk Y + YAk = 0. (A.8)
1995). Some related work has appeared in Vamvoudakis and Lewis
(2010), and Vrabie and Lewis (2009, 2010), which are developed Since Ak is Hurwitz for all k ∈ Z+ , the only solution to (A.8) is
using neural networks (Ge, Lee, & Harris, 1998), and also in our Y = 0. Finally, by (A.4) we have Z = 0.
recent work Jiang and Jiang (2011) proposing a framework of In summary, we have X = 0. But it contradicts with the
robust-ADP using the ISS small-gain theorem (Jiang, Teel, & Praly, assumption that X ̸= 0. Therefore, Θk must have full column rank
1994). for all k ∈ Z+ . The proof is complete.
Acknowledgments References
The authors wish to thank Dr. Paul Werbos, Dr. Iven Mareels, Al-Tamimi, A., Lewis, F. L., & Abu-Khalaf, M. (2007). Model-free Q -learning designs
Dr. Lei Guo, the Associate Editor and the reviewers for their for linear discrete-time zero-sum games with application to H-infinity control.
Automatica, 43(3), 473–481.
constructive comments. Baird, L.C.III. (1994). Reinforcement learning in continuous time: advantage
updating. In Proceedings of IEEE international conference on neural networks. (pp.
Appendix. Proof of Lemma 6 2448–2453).
Bhasin, S., Sharma, N., Patre, P., & Dixon, W. E. (2011). Asymptotic tracking by
a reinforcement learning-based adaptive critic controller. Journal of Control
This amounts to showing that the following linear equation Theory and Applications, 9(3), 400–409.
2704 Y. Jiang, Z.-P. Jiang / Automatica 48 (2012) 2699–2704
Bradtke, S.J., Ydstie, B.E., & Barto, A.G. (1994). Adaptive linear quadratic control using Wei, Q., Zhang, H., & Dai, J. (2009). Model-free multiobjective approximate dynamic
policy iteration. In Proceedings of American control conference. (pp. 3475–3479). programming for discrete-time nonlinear systems with general performance
Dierks, T., & Jagannathan, S. (2011). Online optimal control of nonlinear discrete- index functions. Neurocomputing, 72(7–9), 1839–1848.
time systems using approximate dynamic programming. Journal of Control Werbos, P.J. (1974). Beyond regression: new tools for prediction and analysis in the
Theory and Applications, 9(3), 361–369. behavioural sciences. Ph.D. Thesis. Harvard University.
Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Werbos, P.J. (1989). Neural networks for control and system identification. In
Computation, 12, 219–245. Proceedings of IEEE conference on decision and control. (pp. 260–265).
Ferrari, S., Steck, J. E., & Chandramohan, R. (2008). Adaptive feedback control by Werbos, P.J. (1998). Stable adaptive control using new critic designs. Preprint.
constrained approximate dynamic programming. IEEE Transactions on System, http://arxiv.org/abs/adap-org/9810001.
Man, and Cybernetics, Part B: Cybernetics, 38(4), 982–987. Werbos, P. J. (2009). Intelligence in the brain: a theory of how it works and how to
Ge, S. S., Lee, T. H., & Harris, C. J. (1998). Adaptive neural network control of robotic build it. Neural Networks, 22(3), 200–212.
manipulators. World Scientific Pub. Co. Inc. Xu, H., Jagannathan, S., & Lewis, F. L. (2012). Stochastic optimal control of unknown
Ioannou, P. A., & Sun, J. (1996). Robust adaptive control. Upper Saddle River, NJ: linear networked control system in the presence of random delays and packet
Prentice-Hall. losses. Automatica, 48(6), 1017–1030.
Jiang, Y., & Jiang, Z.P. (2011). Robust approximate dynamic programming and global Zhang, H., Wei, Q., & Liu, D. (2011). An iterative adaptive dynamic programming
stabilization with nonlinear dynamic uncertainties. In Proceedings of the joint method for solving a class of nonlinear zero-sum differential games. Automatica,
IEEE conference on decision and control and European control conference. (pp. 47(1), 207–214.
115–120).
Jiang, Z. P., Teel, A. R., & Praly, L. (1994). Small-gain theorem for ISS systems and
applications. Mathematics of Control, Signals, and Systems, 7(2), 95–120.
Yu Jiang was born in Xi’an, China in 1984. He received his
Jung, M., Glover, K., & Christen, U. (2005). Comparison of uncertainty parameterisa-
B.S. degree in mathematics and applied mathematics from
tions for H∞ robust control of turbocharged diesel engines. Control Engineering
Sun Yat-Sen University, Guangzhou, China, in 2006, and
Practice, 13, 15–25.
his M.S. degree in control theory and control engineering
Kleinman, D. (1968). On an iterative technique for Riccati equation computations.
from South China University of Technology, Guangzhou,
IEEE Transactions on Automatic Control, 13(1), 114–115.
China, in 2009. Since fall 2009, he is pursuing his Ph.D.
Krstić, M., Kanellakopoulos, I., & Kokotović, P. V. (1995). Nonlinear and adaptive
degree and is working as a research assistant at the
control design. John Wiley.
Control and Networks Lab in the Polytechnic Institute
Lewis, F. L., & Syrmos, V. L. (1995). Optimal control. Wiley.
of New York University. His research interests include
Lewis, F. L., & Vamvoudakis, K. G. (2011). Reinforcement learning for partially ob-
approximate/adaptive dynamic programming, nonlinear
servable dynamic processes: adaptive dynamic programming using measured
control, and mathematical modeling of biological systems.
output data. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cyber-
netics, 41(1), 14–23.
Lewis, F. L., & Vrabie, D. (2009). Reinforcement learning and adaptive dynamic
programming for feedback control. IEEE Transactions on Circuits and Systems Zhong-Ping Jiang (M’94, SM’02, F’08) received the B.Sc.
Magazine, 9(3), 32–50. degree in mathematics from the University of Wuhan,
Mareels, I., & Polderman, J. W. (1996). Adaptive systems: an introduction. Birkhäuser. Wuhan, China, in 1988, the M.Sc. degree in statistics from
Murray, J. J., Cox, C. J., Lendaris, G. G., & Saeks, R. (2002). Adaptive dynamic the University of Paris XI, France, in 1989, and the Ph.D.
programming. IEEE Transactions on Systems, Man, and Cybernetics, Part C: degree in automatic control and mathematics from the
Applications and Reviews, 32(2), 140–153. Ecole des Mines de Paris, France, in 1993.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction. MIT Press. Currently, he is a Full Professor of Electrical and
Tao, G. (2003). Adaptive control design and analysis. Wiley. Computer Engineering at the Polytechnic Institute of New
Vamvoudakis, K. G., & Lewis, F. L. (2011). Multi-player non-zero-sum games: online York University (formerly called Polytechnic University),
adaptive learning solution of coupled Hamilton–Jacobi equations. Automatica, and an affiliated Changjiang Chair Professor at Beijing
47(8), 1556–1569. University. His main research interests include stability
Vamvoudakis, K. G., & Lewis, F. L. (2010). Online actor-critic algorithm to solve the theory, robust and adaptive nonlinear control, adaptive dynamic programming and
continuous-time infinite horizon optimal control problem. Automatica, 46(5), their applications to underactuated mechanical systems, communication networks,
878–888. multi-agent systems, and smart grid and systems physiology. He is coauthor of the
Vrabie, D., & Lewis, F. L. (2009). Neural network approach to continuous-time book Stability and Stabilization of Nonlinear Systems (with Dr. I. Karafyllis, Springer
direct adaptive optimal control for partially unknown nonlinear systems. Neural 2011).
Networks, 22(3), 237–246. An IEEE Fellow, Dr. Jiang is a Subject Editor for the International Journal of
Vrabie, D., & Lewis, F. (2010). Adaptive dynamic programming algorithm for finding Robust and Nonlinear Control and has served as an Associate Editor for several
online the equilibrium solution of the two-player zero-sum differential game. journals including Mathematics of Control, Signals and Systems (MCSS), Systems
In Proceedings of IEEE joint conference on neural networks. (pp. 1–8). & Control Letters, IEEE Transactions on Automatic Control, European Journal of
Vrabie, D., Pastravanu, O., Abu-Khalaf, M., & Lewis, F. L. (2009). Adaptive Control and J. Control Theory and Applications. Dr. Jiang is a recipient of the
optimal control for continuous-time linear systems based on policy iteration. prestigious Queen Elizabeth II Fellowship Award from the Australian Research
Automatica, 45(2), 477–484. Council, the CAREER Award from the US National Science Foundation, and the Young
Wang, F.-Y., Zhang, H., & Liu, D. (2009). Adaptive dynamic programming: an Investigator Award from the NSF of China. He received the Best Theory Paper Award
introduction. IEEE Computational Intelligence Magazine, 4(2), 39–47. (with Y. Wang) at the 2008 World Congress on Intelligent Control and Automation,
Watkins, C. (1989). Learning from delayed rewards. Ph.D. Thesis. King’s College of and with T. Liu and D.J. Hill, the Guan Zhao Zhi Best Paper Award at the 2011 Chinese
Cambridge. Control Conference.