Professional Documents
Culture Documents
Let α(zη ) ∈ R3 denote virtual control of the zη - Rewrite the optimal value function (6) as
subsystem (4), the infinite horizon value function is defined
2
as Vη∗ zη = βη
zη (t)
+ Vηo zη (11)
∞
where βη is a positive design constant, Vηo (zη ) =
Vη zη = rη zη (s), α zη ds (5)
t −βη zη (t)2 + Vη∗ (zη ).
Inserting (11) into (9), the optimal virtual control can be
where rη (z, α) = zTη zη + α T α is the immediate or local cost rewritten as
function.
1 ∂Vη∗ 1 ∂Vηo
Remark 2: The optimal problem for a dynamic system is α∗ = − = −βη zη (t) − . (12)
to find an admissible control policy [20] such that the con- 2 ∂zη 2 ∂zη
trol objective is realized by expending the minimal cost. For
It is well known that NNs have the excellent adaptive
example, for the zη -subsystem (4), the optimal virtual con-
learning and function approximating abilities, it can approx-
trol is designed to guarantee that the infinite horizon value
imate any continuous function to desired accuracy. Since the
function (5) is minimized.
scalar value function Vηo is continuous for zη ∈ η , it can be
View ν(t) as the optimal virtual control α ∗ (zη ), i.e., ν(t)
∗ approximated by NNs in the following form:
α , the optimal value function is yielded as
∞ Vηo zη = Wη∗T Sη zη + εη zη (13)
Vη∗ zη = min rη zη , α ds
α∈ (η ) t where Wη∗ ∈ Rnη is the ideal NN weight, nη is the neuron
∞
number; Sη (zη ) ∈ Rnη is basis function vector; εη (zη ) ∈ R is
= rη zη , α ∗ ds (6)
t the NN approximation error, which is required that it and its
derivative are bounded (more details see [20]).
where (η ) denotes the set of admissible control policies Based on the ideal approximation (13), Vη∗ (zη ) and α ∗ can
over η , η ⊂ R3 is a compact set. be re-expressed as
The Hamiltonian function associating with the infinite hori-
2
zon value function (6) is Vη∗ zη (t) = βη
zη (t)
+ Wη∗T Sη zη + εη zη
1 ∂ T Sη ∗ 1 ∂εη
∂Vη ∂Vη T α ∗ = −βη zη (t) − W − (14)
Hη zη , α, = rη z η , α + żη (t) (7) 2 ∂zη η 2 ∂zη
∂zη ∂zη
where (∂ T Sη /∂zη ) ∈ R3×nη and (∂εη /∂zη ) ∈ R3 are the
where ∂Vη /∂zη denotes the gradient of Vη with respect to zη .
gradients with respect to zη .
According to both (6) and (7), there is the following HJB
Using the NN approximation (14), HJB equation (8) can be
equation:
rewritten as
∂Vη∗
Hη zη , α ∗ , Wη∗ = − βη2 − 1
zη
− 2βη zTη η̇d
∗ 2
Hη zη , α , = zTη zη + α ∗T α ∗
∂zη
∗
T ∂Sη
∂Vη ∗ − Wη∗T βη zη + η̇d
+ α − η̇d = 0. (8) ∂zη
T
∂zη
T
2
1
∂ Sη ∗
−
W
+ ρη (t) = 0 (15)
Assuming the solution of (8) is existent and unique, 4
∂zη η
the optimal virtual control α ∗ can be obtained by solving
∂H zη , α ∗ , ∂Vη∗ /∂zη /∂α ∗ = 0 where ρη (t) = (∂εη /∂zTη )α ∗ (t) + (1/4)(∂εη /∂zη )2 −
(∂εη /∂zTη )η̇d (t), which is a bounded term by a positive constant
1 ∂Vη∗ ψη , i.e., |ρη (t)| ≤ ψη .
α ∗ (t) = − . (9) The optimal virtual control (14) is unavailable because the
2 ∂zη
ideal weight matrix Wη∗ is unknown. In order to achieve the
Substituting (9) into (8), the following result yields: control scheme, the RL algorithm is performed by constructing
the following both critic and actor NNs, which are utilized to
T
T ∗
∂Vη∗ 1 ∂Vη∗ ∂Vη evaluate the controlling performance and execute the virtual
zη (t)
2 − η̇d (t) − = 0. (10) control, respectively:
∂zη 4 ∂zη ∂zη
2
By substituting solution of (10) into (9), the optimal vir- V̂η∗ (zη ) = βη
zη (t)
+ Ŵηc
T
(t)Sη zη (16)
tual control can be obtained. However, solving the equation is 1 ∂ Sη
T
α̂ zη = −βη zη (t) − Ŵηa (t) (17)
very difficult or impossible because of its strong nonlineari- 2 ∂zη
ties. In order to realize the control scheme, the online RL of
actor-critic architecture is performed by employing adaptive where V̂η∗ denote the estimations of Vη∗ ; Ŵηc
T ∈ Rnη and Ŵ ∈
ηa
NN approximation. n
R η are the critic and actor NN weights, respectively.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Adding (16) and (17) into (8), the approximated HJB By introducing the error variable zν (t) = ν(t) − α̂(zη ), the
equation can be yielded as error dynamic (4) can be rewritten as
2
∂ T Sη zη
żη (t) = zν (t) + α̂ zη − η̇d (t). (24)
1
Hη zη , α̂, Ŵη =
zη
+
βη zη +
2
Ŵηa (t)
2 ∂zη
For the zη -subsystem, Lyapunov function candidate is
T designed as
∂ T Sη zη
− 2βη zη (t) + Ŵηc (t) 1
2 1 T 1 T
∂zη Lη (t) =
zη (t)
+ W̃ηa (t)W̃ηa (t) + W̃ηc (t)W̃ηc (t) (25)
2 2 2
1 ∂ T Sη zη where W̃ηc (t) = Ŵηc (t) − Wη∗ , and W̃ηa (t) = Ŵηa (t) − Wη∗ . Its
× βη zη (t) + Ŵηa (t) + η̇d (t) .
2 ∂zη time derivative along (21), (22), and (24) is
(18) L̇η (t) = zTη (t) zν (t) + α̂(zη ) − η̇d (t) + W̃ηa T
(t)
From (15) and (18), the Bellman residual error is derived as 1 ∂Sη zη ∂Sη ∂ T Sη
× z η (t) − γ ηa Ŵηa (t)
2 ∂zTη ∂zTη ∂zη
eη (t) = Hη zη , α̂, Ŵη − Hη zη , α ∗ , Wη∗
γηc ∂Sη zη ∂ T Sη zη
= Hη zη , α̂, Ŵη . (19) +
2 Ŵηa (t)
4 1 +
ση
∂zTη ∂zη
By applying gradient descent algorithm to the positive definite
γηc
function × ση (t)Ŵηc (t) −
T
2
1 1 +
ση (t)
Eη (t) = e2η (t) (20)
2
× W̃ (t)ση × σ T (t) − β 2 − 1
zη
T 2
ηc ηŴηc η
the following critic NN updating law is yielded so that the
Bellman residual error e(t) is minimized:
T
2
1
∂ Sη
˙ (t) − 2βη zTη (t)η̇d (t) +
Ŵηa (t)
. (26)
Ŵηc 4
∂zη
γηc ∂eη (t)
=−
2 eη (t) Substituting (17) into (26) yields
1 +
ση (t)
∂ Ŵηc (t)
⎛
2 1 ∂ T Sη
γηc
2 L̇η (t) = −βη
zη
+ zTη zν − zTη η̇d − zTη Ŵηa (t)
=−
2 ση ⎝σηŴηc (t) − βη − 1
zη (t)
− 2βη zη (t)η̇d (t)
T 2 T 2 ∂zη
1 + ση 1 T ∂Sη zη ∂Sη ∂ T Sη
2 ⎞ + W̃ηa (t) zη (t) − γηa W̃ηa (t) T
T
1
2 ∂zη
T ∂zη ∂zη
∂ Sη zη
T
+
Ŵηa (t)
⎠ (21)
4
∂zη
γηc ∂Sη zη ∂ T Sη zη
× Ŵηa (t) +
2 W̃ηa (t) ∂zT
T
4 1 +
σ
η ∂zη
where γc1 > 0 is the learning rate; ση (t) = η
−([∂Sη (zη )]/∂zTη )(βη zη (t) + (1/2)(∂ T Sη /∂zη )Ŵηa (t) + η̇d (t)). γηc
× Ŵηa (t)σηTŴ (t) −
2
The actor NN updating law is designed in the following: ηc
1 +
ση
1 ∂Sη zη ∂Sη ∂ T Sη
˙
Ŵηa (t) = zη (t) − γηa T Ŵηa (t) × T
W̃ηc (t)ση σηTŴ (t) − βη2 − 1
zη
− 2βη zTη η̇d
2
2 ∂zη T ∂zη ∂zη ηc
γηc ∂Sη ∂ T Sη
T
2
+ 1
∂ Sη
2 ∂zTη ∂zη
Ŵηa (t)σηTŴ (t) (22) +
Ŵηa (t)
. (27)
4 1 +
ση
ηc
4 ∂zη
where γηa > 0 is the learning rate. Using W̃ηa (t) = Ŵηa (t) − Wη∗ , there are the following results:
Assumption 1 ([35] Persistence of Excitation (PE)): The
1 T ∂ T Sη 1 ∂ T Sη
signal of ση (t)σηT (t) is required persistent excitation over the − zη Ŵηa + zTη W̃ηa
2 ∂zη 2 ∂zη
interval [t, t + tη ], i.e., there exist constants kη > 0, kη > 0,
tη > 0 for all t to satisfy 1 ∂ T Sη ∗ ∂Sη ∂ T Sη
= − zTη Wη − γηa W̃ηa
T
(t) T Ŵηa (t)
2 ∂zη ∂zη ∂zη
kη I3 ≤ ση (t)σηT (t) ≤ kη I3 (23)
γηa T ∂Sη ∂ T Sη
=− W̃ηa (t) T × W̃ηa (t)
where I3 ∈ R3×3is identity matrix. 2 ∂zη ∂zη
Remark 3: The PE assumption is also carried out in next γηa T ∂Sη ∂ T Sη γηa
backstepping step. The signal of σν (t)σνT (t), which is defined − Ŵηa (t) T Ŵηa (t) +
2 ∂zη ∂zη 2
in next backstepping step, is required to meet the PE condition
over the interval [t, t + tν ], tν > 0, i.e., there exist constants ∂Sη ∂ T Sη ∗
× Wη∗T W .
kν > 0, kν > 0 for all t to satisfy kν I3 ≤ σν (t)σνT (t) ≤ kν I3 . ∂zTη ∂zη η
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Adding the above results to (27) has Substituting (30) into (29) yields
2 1
2 γηa T
1 T ∂ T Sη ∗ L̇η (t) ≤ zν (t)2 − (βη − 2)
zη
− W̃ (t)
L̇η (t) = −βη
zη
+ zTη zν − zTη η̇d − z W 2 ηa
2 η ∂zη η 2
∂Sη ∂ T Sη γηa T ∂Sη ∂ T Sη
γηa T ∂Sη ∂ T Sη γηa T × T W̃ηa (t) − Ŵηa (t) T
− W̃ηa (t) T W̃ηa (t) − Ŵ (t) ∂zη ∂zη 2 ∂zη ∂zη
2 ∂zη ∂zη 2 ηa
γηc ∂Sη ∂ T Sη
∂Sη ∂ T Sη γηa ∗T ∂Sη ∂ T Sη ∗ × Ŵηa (t) +
2 W̃ηa (t) ∂zT ∂z
T
× T Ŵηa (t) + W W 4 1 +
ση
η η
∂zη ∂zη 2 η ∂zTη ∂zη η
T γηc
γηc ∂Sη zη ∂ Sη zη × Ŵηa (t)σηTŴ (t) −
2
+
2 W̃ηa (t) ∂zT
T
Ŵηa (t) ηc
1 +
ση
4 1 +
σ
η ∂zη
η
γηc 1 T ∂Sη ∂ T Sη ∗
× σηTŴ (t) − × W̃ηc (t)ση × σηTW̃ (t) − Ŵηa
T
(t) T W
2 ηc 2 ∂zη ∂zη η
ηc
1 +
ση
1 ∗T ∂Sη ∂ T Sη ∗ 1 T ∂Sη ∂ T Sη
+ Wη W + Ŵ (t)
(t) − βη2 − 1
zη
− 2βη zTη η̇d ∂zTη ∂zη η 4 ηa ∂zTη ∂zη
2
× T
W̃ηc (t)ση σηTŴ 4
ηc
γηa
T
2 1
1
∂ Sη
× Ŵηa (t) − ρη (t) + η̇d 2 + 1 +
+
Ŵηa (t)
. (28) 2 2
4 ∂zη
∂Sη ∂ T Sη ∗
× Wη∗T W . (31)
Using
n Cauchy inequality that ( nk=1 ak bk )2 ≤ ∂zTη ∂zη η
2 n 2
a
k=1 k k=1 kb and Young’s inequality that Using the following facts:
ab ≤ (a2 /2) + (b2 /2), there are the following facts:
1 T ∂Sη ∂ T Sη ∗ 1 ∗T ∂Sη ∂ T Sη ∗
1
2 1 − Ŵηa (t) T W + W W
zTη (t)zν (t) ≤
zη (t)
+ zν (t)2 2 ∂zη ∂zη η 4 η ∂zTη ∂zη η
2 2
1
2 1 1 T ∂Sη ∂ T Sη 1 T ∂Sη ∂ T Sη
−zη (t)η̇d (t) ≤
zη (t)
+ η̇d (t)2
T + Ŵηa (t) T Ŵηa (t) = W̃ηa (t) T
2 2 4 ∂zη ∂zη 4 ∂zη ∂zη
1 T ∂ T Sη ∗
2 ∂Sη ∂ T Sη ∗ 1 ∂Sη ∂ T Sη
− zη (t) Wη ≤
zη (t)
+ Wη∗T T W . × Ŵηa (t) − Wη∗T T W̃ηa (t) (32)
2 ∂zη ∂zη ∂zη η 4 ∂zη ∂zη
γηc γηc
2 W̃ηc (t)ση (t)ρη (t) ≤
2 ρη (t)
T 2
Applying the above inequalities to (28) has
1 + ση (t)
2 1 + ση
1
2 γηa T
L̇η (t) ≤ zν (t)2 − (βη − 2)
zη (t)
− W̃ (t) γηc
2 2 ηa +
2 W̃ηc (t)ση ση W̃ηc (t)
T T
(33)
∂Sη ∂ Sη
T γηa T ∂Sη ∂ T Sη
2 1 + ση
× T W̃ηa (t) − Ŵηa (t) T
∂zη ∂zη 2 ∂zη ∂zη the inequality (31) can be rewritten as
γηc ∂Sη ∂ T Sη 1
2 γηa T
× Ŵηa (t) +
2 W̃ηa (t) ∂zT ∂z
T
L̇η (t) ≤ zν (t)2 − (βη − 2)
zη (t)
− W̃ (t)
4 1 +
ση
η η 2 2 ηa
γηc ∂Sη ∂ T Sη γηc
× Ŵηa (t)σηTŴ (t) −
2 × T W̃ηa (t) −
2 W̃ηc (t)
T
∂zη ∂zη 2 1 +
σ (t)
ηc
1 + ση (t)
η
2 γηa T ∂Sη ∂ T Sη
× W̃ηc (t)ση × σηTŴ (t) − βη2 − 1
zη (t)
T × ση σηTW̃ (t) − Ŵηa (t) T Ŵηa (t)
ηc ηc 2 ∂zη ∂zη
T
2 γηc ∂Sη ∂ T Sη
1
∂ Sη
+
2 W̃ T
(t) Ŵηa (t)σηTŴ (t)
− 2βη zη (t)η̇d +
T
Ŵηa
ηa
∂z T ∂z ηc
4 ∂z η 4 1 + ση η η
Based on the following condition: γηa γηc
2
× T
W̃ηc (t)ση σηTW̃ (t) − − T
Ŵηa (t)
ηc 2 2
γc1 ∂Sη ∂ T Sη
2 W̃ηa (t) ∂zT ∂z Ŵηa (t)σηŴηc (t)
T T
∂Sη ∂ T Sη γηa ∗T ∂Sη ∂ T Sη
4 1 + ση
η η × T Ŵηa (t) + 1 + Wη
∂zη ∂zη 2 ∂zTη ∂zη
γηc ∂Sη ∂ T Sη T γηc 1
−
2 W̃ηc (t)σηW̃ηa
T
Ŵ (t) × Wη∗ +
2 ρη + 2 η̇d .
2 2
(36)
∂zTη ∂zη ηa
T (t)
4 1 +
ση
2 1 +
σ
η
⎢ γ γ 2
∂Sη ∂ T Sη
⎢
Aη (t) = ⎣ 0 ηa ηc ∗T
2 − 2 − 32 Wη ση σηWη∗ ∂zTη ∂zη
1 T
γηc ∂Sη ∗T ∂ T Sη
+
2 W̃ηa (t) ∂zT Wη ση ∂z Ŵηa (t)
T
4 1 +
ση
η η 0 0
⎤
γηc ∂Sη ∂ T Sη 0
+
2 W̃ηc (t)σηWη∗T ∂zT ∂z W̃ηa (t)
T
0 ⎥
⎦
4 1 +
ση
η η γηc ∗T ∂Sη ∂ T Sη ∗
1
− 1
W η ∂zη ∂zη
W η ση ση
T
1+ση
2 2 32 T
γηa ∗T ∂Sη ∂ T Sη ∗ γηc
+ 1+ W +
ρη
2
γηa ∗T ∂Sη ∂ Sη ∗ T γηc
∂zTη ∂zη η 2 1 +
Wη
2
σ
2 Cη (t) = (1 + )Wη W + ρ2
η 2 ∂zTη ∂zη η 2(1 + ση 2 ) η
1 1
+ η̇d 2 . (35) + η̇d 2 .
2 2
According to Young’s inequality and Cauchy inequality, there Based on Assumption 1, the matrix Aη (t) can be made pos-
are the following results: itive definite via choosing the design parameters βη , γηa , γηc
γηc ∂Sη ∗T ∂ T Sη satisfying the following conditions:
2 W̃ T
ηa (t) W ση Ŵηa (t)
4 1 +
ση
∂zTη η ∂zη kη
βη > 2, γηa > γηc
2
+ Wη∗T Wη∗
1 T ∂Sη ∂ T Sη 16
≤ W̃ηa (t) T Wη∗T ση σηW
T
W̃ηa (t)
∗T ∂Sη ∂ Sη ∗
∗ 1 T
32 ∂zη η ∂z
η γηc > sup Wη W . (38)
16 t≥0 ∂zTη ∂zη η
γηc
2
∂Sη ∂ T Sη
+ T
Ŵηa (t) T Ŵηa (t)
2 ∂zη ∂zη Then (37) can become the following one:
γηc ∂Sη ∂ T Sη 1
2
2 W̃ηc (t)σηWη∗T ∂zT ∂z W̃ηa (t)
T zν (t)2 − aη
ξη (t)
+ cη
L̇η (t) ≤ (39)
4 1 +
ση
η η 2
where aη = inft≥0 {λmin {Aη (t)}}, cη = supt≥0 {Cη (t)}.
γηc
2
∂Sη ∂ T Sη 1 Step 2: The actual control u is obtained in the step. From
≤ T
W̃ηa (t) W̃ηa (t) +
2
2 ∂zTη ∂zη 32 1 +
ση
the dynamic equation (3), the time derivative of error variable
zν (t) = ν(t) − α̂ is
∂Sη ∂ T Sη ∗ T
× W̃ηc
T
(t)σηWη∗T W σ W̃ηc (t). żν (t) = f (χ ) − α̂˙ + u.
∂zTη ∂zη η η (40)
Substituting the above inequalities into (35) has Define the optimal cost function as
2 ∞
1
L̇η (t) ≤ zν (t)2 − (βη − 2)
zη (t)
Vν∗ (zν ) = min rν (zν , u)ds
2 u∈ (ν ) t
γηc
2 ∞
γηa 1 ∗T ∂Sη ∂ T Sη
− − − Wη ση σηWη∗ W̃ηa (t) T W̃ηa (t) rν zν , u∗ ds
T T
2 2 32 ∂zη ∂zη
= (41)
t
1 γηc 1 ∗T ∂Sη ∂ T Sη ∗ where rν (zν , u) = zTν zν + uT u, ν is a compact set, u∗ is the
−
2 − Wη W
1 +
ση
2 32 ∂zTη ∂zη η optimal control. Then the HJB equation for zν -subsystem is
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
˙ 1 ∂ T Sν
× f (χ ) − α̂ −βν zν (t) − Ŵνa (t) .
where βν is a positive design constant, Vνo (zν ) = 2 ∂zν
−βν zν (t)2 + Vν∗ (zν ). Substituting (44) into (42), the optimal (52)
control can be rewritten as Similar with step 1, the critic NN weight updating
1 ∂Vνo (zν ) law is constructed by minimizing Bellman error eν (t) =
u∗ = −βν zν (t) − (45)
2 ∂zν Hν (zν , u, Ŵνc ). Define a positive definite function as Eν (t) =
Since the uncertaintied term [(∂Vνo (zν ))/(∂zν )] is continuous (1/2)e2ν (t), then the critic NN weight updating law is derived
and well defined in the compact set ν , it can be approximated based on the gradient descent algorithm
by NNs as ˙ (t) = − γνc ∂eν (t)
Ŵνc eν (t)
Vνo (zν ) = Wν∗T Sν (zν ) + εν (zν ) 1 + σν 2
∂ Ŵνc (t)
(46)
γνc
where Wν∗T ∈ Rnν is the ideal weight; Sν (zν ) ∈ Rnν is the basis =− σν σνT Ŵνc (t) − (βν2 − 1)z2ν (t)
1 + σν 2
function vector; εν (zν ) ∈ R is the approximation error.
Substituting (46) into (45) has + 2βν zTν (t) f (χ ) − α̂˙
where (∂εν /∂zν ) is bounded by a constant δν , i.e., where γνc > 0 is the learning rate, σν = (∂Sν /∂zTν )(f (χ ) −
(∂εν /∂zν ) ≤ δν . α̂˙ − βν zν (t) − (1/2)(∂ T Sν /∂zν )Ŵνa (t)).
Inserting (46) and (47) into (42) yields The weight updating law of actor NN is
˙ (t) = 1 ∂Sν z (t) − γ ∂Sν ∂ Sν Ŵ (t)
T
Hν zν , u∗ , Wν∗ = −(βν2 − 1)zν (t)2 + 2βν zTν (t) Ŵ
νa ν νa νa
∂Sν (zν ) 2 ∂zTν ∂zTν ∂zν
× f (χ ) − α̂˙ + Wν∗T
∂zT γνc ∂Sν ∂ T Sν
ν + Ŵνa (t)σνT Ŵνc (t) (54)
× f (χ ) − α̂˙ − βν zν 4 1 + σν 2 ∂zTν ∂zν
T
2 where γνa > 0 are the learning rate.
1
∂ Sν (zν ) ∗
−
Wν
+ ρν (t) = 0 (48) Consider the overall Lyapunov function candidate as
4 ∂zν follows:
where ρν (t) = (∂εν /∂zTν )u∗ + (∂εν /∂zTν )(f (x) − α̂) ˙ + 1 1 T
L(t) = Lη (t) + zTν (t)zν (t) + W̃ (t)W̃νa (t)
(1/4)(∂εν /∂zν )2 . Since all terms of ρν (t) are bounded, it 2 2 νa
can be bounded by a constant, i.e., |ρν (t)| ≤ ψν . 1 T
+ W̃νc (t)W̃νc (t) (55)
Because the ideal constant matrix Wν∗ is unknown, the 2
optimal control (47) is unavailable. For getting the available
where W̃νc (t) = Ŵνc (t) − Wν∗ and W̃νa (t) = Ŵνa (t) − Wν∗ are
control, the online critic-actor RL is employed to implement
the critic and actor NN approximation errors, respectively.
the optimizing scheme
The time derivative of L(t) along (40), (53), and (54) is
V̂ν∗ (zν ) = βν zν (t)2 + Ŵνc
T
(t)Sν (zν ) (49)
L̇(t) = L̇η (t) + zTν (t) f (χ ) − α̂˙ + u + W̃νa T
(t)
1 ∂ Sν (zν )
T
u = −βν zν (t) − Ŵνa (50) 1 ∂Sν ∂Sν ∂ T Sν
2 ∂zν × z (t) − γ Ŵνa (t)
ν νa
2 ∂zTν ∂zTν ∂zν
where V̂ν∗ (zν ) are the approximation of Vν∗ (zν ); Ŵνc
T (t) ∈ Rnν
and Ŵνa (t) ∈ R ν are the critic and actor NN weights,
T n γνc ∂Sν ∂ T Sν
+ Ŵνc (t)σν Ŵνc (t)
T
respectively. 4 1 + σν 2 ∂zTν ∂zν
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
γνc ⎡
− T
W̃νc (t)σν βν − 3 12 0
1 + σν 2 ⎢ γνa γνc
2
Aν (t) = ⎣ − − ∗T T ∗
32 Wν σν σν Wν
1
0 2 2
× σνT Ŵνc (t) − (βν2 − 1)z2ν (t) + 2βν zTν f (χ ) − α̂˙ 0 0
⎤
0
1 T ∂Sν ∂ T Sν 0 ⎥
+ Ŵνa (t) T Ŵνa (t) . (56) ⎦
4 ∂zν ∂zν γνc γνc
∗T ∂Sν
− ∗ ∂T S
32 Wν ∂zTν ∂zν Wν σν σν
1 ν T
1+σν 2 2
Applying the control (50)–(56), similar with the first step, there
γνa ∗T ∂Sν ∂ T Sν ∗ γνc 2 γνc 2
is the following one: Cν (t) = (1 + )Wν Wν + f (χ ) + ρ
2 ∂zν ∂zν
T 2 2 ν
γνa T ∂Sν
L̇(t) ≤ L̇η (t) − (βν − 3)zν (t)2 − W̃νa (t) T 1
2 ∂zν +
α̇2 .
2
∂ T Sν γνa T ∂Sν ∂ T Sν
× W̃νa (t) − Ŵ (t) Ŵνa (t) Based on PE assumption, Aν (t) can be made positive def-
∂zν 2 νa ∂zTν ∂zν inite by designing the parameters βν , γνa , and γνc to satisfy
γνc ∂Sν ∂ T Sν the following conditions:
− T
W̃νa (t) T Ŵνc σνT Ŵνc (t)
4 1 + σν 2 ∂zν ∂zν kν
γνc βν > 4, γνa > γνc
2
+ Wν∗T Wν∗
− W̃νcT
(t)σν 16
1 + σν 2
∗T ∂Sν ∂ Sν ∗
1 T
γνc > sup Wν W (61)
× σνT Ŵνc (t) − (βν2 − 1)z2ν (t) + 2βν zTν f (χ ) − α̂˙ 16 t≥0 ∂zTν ∂zν ν
then (60) can become the following one:
1 T ∂Sν ∂ T Sν
2
+ Ŵνa (t) T Ŵνa (t) L̇(t) < −aη
ξη (t)
− aν ξν (t)2 + cη + cν
4 ∂zν ∂zν (62)
γνa ∗T ∂Sν ∂ T Sν ∗ where aν = inft≥0 {λmin {Aν (t)}}, cν = supt≥0 {Cν (t)}.
+ 1+ Wν W
2 ∂zTν ∂zν ν The main results are concluded in the following theorem.
1 1 Theorem 1: Consider the surface vessel (1) with bounded
+ f 2 (χ ) + α̇2 . (57) initial condition and reference signals. If the OB control
2 2
Rewrite (48) to the following one: utilizes the weight updating laws (21), (22) for the virtual
control (17), and (53), (54) for the actual control (50), and
−(βν2 − 1)zν (t)2 + 2βν zTν (t) f (χ ) − α̂˙ the design parameters satisfy (38), (61), and PE conditions
(Assumptions 1) are satisfied, then:
∂Sν (zν ) ∂ T Sν (zν ) ∗
= −σνT Wν∗ − Ŵνa
T
(t) Wν 1) all error signals of the optimized control are SGUUB;
∂zTν ∂zν 2) the surface vessel can track the reference trajectory to
2
1
∂ T Sν (zν ) ∗
desired accuracy.
+
W
ν
− ρν (t). (58)
4
∂zν Proof: See the Appendix.
Similar to the first step, applying (58) to (57) yields
IV. S IMULATION E XAMPLES
L̇(t) ≤ L̇η (t) − (βν − 3)zν (t)2 The simulation is carried out by a mode ship of 1:75 scale-
γνa γνc
2 1 ∗T T ∗ ∂Sν ∂ T Sν down replica. The mass of the model ship is m = 21 kg, its
− − − Wν σν σν Wν W̃νa T
(t) T W̃νa (t)
2 2 32 ∂zν ∂zν length and width are 1.2 and 0.3 m, respectively. The inertia,
γνa γ2 T ∂Sν ∂ Sν
T
M =⎣0 19 0.72⎦
− − νc Ŵνa Ŵνa
2 2 ∂zν ∂zν
T 0 0.72 2.7
⎡ ⎤
γνa ∗T ∂Sν ∂ T Sν ∗ 1 2 0 0 −19vy − 0.72vz
+ 1+ Wν W + f (χ)
2 ∂zTν ∂zν ν 2 C=⎣ 0 0 20vx ⎦
1 γ νc 2 19vy + 0.72vz −20vx 0
+ α̇2 + ρ . (59) ⎡
2 2 ν 0.72 + 1.3|vx | + 5.8v2x 0
Using the previous results, (59) is rewritten to compact form D=⎣ 0 0.86 + 36vy + 3|vz |
as 0 −0.1 − 5vy + 3|vz |
2 ⎤
L̇(t) ≤ −aη
ξη (t)
+ cη − ξνT (t)Aν (t)ξν (t) + Cν 0
γνa γνc2
−0.1 −2v y + 2|vz |⎦.
T ∂Sν ∂ Sν
T
− − Ŵνa Ŵνa (60) 6 + 4vy + 4|vz |
2 2 ∂zTν ∂zν
For simplicity, the restoring force vector g(η) is assumed to
where
be 0. The initial states of position and velocity are η(0) =
ξν (t) = [zTν (t), W̃νc
T
(t), W̃νa
T
(t)]T [0.3, 0.1, 0.2]T and v(0) = [0.1, 0.2, 0.3]T .
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
V. C ONCLUSION
Based on the new optimizing technique OB [20], an
optimized tracking control for surface vessel is developed. In
the optimized control, the NN-based RL strategy of actor-critic
architecture is employed, where the critic NN is used to evalu-
ate the control performance and the actor NN is used to carry
out the control behavior. The overall control for the vessel sys-
tem is optimized by designing both virtual and actual controls
to be the optimized solutions of corresponding subsystems.
Based on the Lyapunov analysis, it is proven that the pro-
posed optimal algorithm can achieve the control objective. The
effectiveness is further demonstrated by simulation results.
A PPENDIX
P ROOF OF T HEOREM 1
Fig. 7. Critic and actor weights of the second step. The following lemma is used in the proof.
Lemma 1 [36]: Let G(t) ∈ R be a continuous positive func-
tion with bounded initial value G(0). If Ġ(t) ≤ −aG(t) + c is
held, where a and c are two constants, then there is following
inequality:
c
G(t) ≤ e−at G(0) + 1 − e−at . (63)
a
Proof of Theorem 1:
1) Taking a = min{aη , aν } and c = max{cη , cν }, then (62)
can be rewritten as
L̇(t) < −aL(t) + c. (64)
According to Lemma 1, the following one can be obtained:
c
L(t) < e−at L(0) + 1 − e−at . (65)
a
The above inequality implies that all error signals, zη (t), zν (t),
W̃ηa (t), W̃ηc (t), W̃νa (t), W̃νc (t), are SGUUB.
2) Let Lz (t) = (1/2)zTη (t)zη (t) + (1/2)zTν (t)zν (t), its time
Fig. 8. Similar control performance. derivative along (24) and (40) is
L̇z (t) = zTη (t) zν (t) + α̂ − η̇d (t) + zTν (t) f (χ ) − α̂˙ + u . (66)
Substituting (17) and (50) into (66) has
2 1 ∂ T Sη
L̇z (t) = −βη
zη (t)
− zTη (t) Ŵηa (t) + zTη (t)zν (t)
2 ∂zη
1 ∂ T Sν
− zTη (t)η̇d (t) − βν zν (t)2 − zTν (t) Ŵνa (t)
2 ∂zν
˙
+ zTν (t)f (χ (t)) − zTν (t)α̂. (67)
Based on the following result facts:
1
2 1
zTη (t)zν (t) ≤
zη (t)
+ zν (t)2
2 2
1
2 1
−zη (t)η̇d (t) ≤
zη (t)
+ η̇d (t)2
T
2 2
2
1 T ∂ T Sη
2
∂ T Sη
− zη (t)
Ŵηa (t) ≤ zη (t) +
Ŵηa (t)
2 ∂z ∂z
η η
1 1
Fig. 9. Cost functions of two control methods. zTν (t)f (χ (t)) ≤ zν (t)2 + f (χ )2
2 2
˙ 1 1
˙
2
−zν (t)α̂(t) ≤ zν (t) +
α̂(t)
T 2
2 2
is archived, and Fig. 9 shows the control costs of two control
2
1 T ∂ T Sν
∂ T Sν
− zν (t) Ŵνa (t) ≤ zν (t) +
Ŵνa (t)
.
methods. Obviously, the proposed control method is lower-cost 2
under the same control performances. 2 ∂zν ∂zν
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Equation (67) can be rewritten as [12] Y. Li, K. Sun, and S. Tong, “Adaptive fuzzy robust fault-tolerant optimal
2 control for nonlinear large-scale systems,” IEEE Trans. Fuzzy Syst., to
L̇z (t) ≤ −(βη − 2)
zη (t)
− (βν − 3)zν (t)2 + P(t) (68) be published, doi: 10.1109/TFUZZ.2017.2787128.
[13] F. L. Lewis, D. L. Vrabie, and V. L. Syrmos, Optimal Control, 3rd ed.
where P(t) = (1/2)η̇d (t)2 + (1/2)α̂˙ 2 + (1/2)f (χ )2 + New York, NY, USA: Wiley, 2012.
[14] S. Tong, Y. Li, and S. Sui, “Adaptive fuzzy tracking control design
(∂ Sη /∂zη )Ŵηa (t) +(∂ Sν /∂zν )Ŵνa (t)2 . Because W̃ηa (t)
T 2 T
for SISO uncertain nonstrict feedback nonlinear systems,” IEEE Trans.
and W̃νa (t) are SGUUB, which are proven by part 1, it is Fuzzy Syst., vol. 24, no. 6, pp. 1441–1454, Dec. 2016.
[15] H. Modares, F. L. Lewis, and M. B. Naghibi-Sistani, “Adaptive optimal
concluded that P(t) are bounded by a constant , i.e., P(t) < . control of unknown constrained-input systems using policy iteration and
Further, the following fact holds: neural networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 10,
pp. 1513–1525, Oct. 2013.
L̇z (t) < −βLz (t) + [16] D. Liu, D. Wang, and H. Li, “Decentralized stabilization for a class
of continuous-time nonlinear interconnected systems using online learn-
where β = min{βη − 2, βν − 3}. Applying Lemma 1 has ing optimal control approach,” IEEE Trans. Neural Netw. Learn. Syst.,
vol. 25, no. 2, pp. 418–428, Feb. 2014.
Lz (t) < e−βt Lz (0) + (1 − e−βt )
[17] J.-H. Park, S.-H. Kim, and C.-J. Moon, “Adaptive neural control for
β strict-feedback nonlinear systems without backstepping,” IEEE Trans.
Neural Netw., vol. 20, no. 7, pp. 1204–1209, Jul. 2009.
it implies that the tracking errors can arrive to the desired [18] J. Q. Gong and B. Yao, “Neural network adaptive robust control of
nonlinear systems in semi-strict feedback form,” Automatica, vol. 37,
accuracy by making β large enough, as a result that the surface no. 8, pp. 1149–1160, 2001.
vessel can track the predefined trajectory to desired accuracy. [19] G. Arslan and T. Başar, “Disturbance attenuating controller design
for strict-feedback systems with structurally unknown dynamics,”
Automatica, vol. 37, no. 8, pp. 1175–1188, 2001.
ACKNOWLEDGMENT [20] G. Wen, S. S. Ge, and F. Tu, “Optimized backstepping for tracking
control of strict-feedback systems,” IEEE Trans. Neural Netw. Learn.
The authors would like to thank the National Research Syst., to be published, doi: 10.1109/TNNLS.2018.2803726.
Foundation, Keppel Corporation, and the National [21] D. Wang and D. Liu, “Neural robust stabilization via event-triggering
University of Singapore for supporting this paper done mechanism and adaptive learning technique,” Neural Netw. Official J.
in the Keppel-NUS Corporate Laboratory. The conclu- Int. Neural Netw. Soc., vol. 102, pp. 27–35, Jun. 2018.
[22] Y. J. Liu, G. X. Wen, and S. C. Tong, “Direct adaptive NN con-
sions put forward reflect the views of the authors alone, trol for a class of discrete-time nonlinear strict-feedback systems,”
and not necessarily those of the institutions within the Neurocomputing, vol. 73, nos. 13–15, pp. 2498–2505, 2010.
Corporate Laboratory. The WBS number of this project is [23] D. Wang, D. Liu, C. Mu, and Y. Zhang, “Neural network learning and
robust stabilization of nonlinear systems with dynamic uncertainties,”
R-261-507-004-281. IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 4, pp. 1342–1351,
Apr. 2017.
[24] G. Wen, C. L. P. Chen, Y.-J. Liu, and L. Zhi, “Neural network-based
R EFERENCES adaptive leader-following consensus control for a class of nonlinear
[1] G. Wen, S. S. Ge, F. Tu, and Y. S. Choo, “Artificial potential-based multiagent state-delay systems,” IEEE Trans. Cybern., vol. 47, no. 8,
adaptive H∞ synchronized tracking control for accommodation vessel,” pp. 2151–2160, Aug. 2017.
IEEE Trans. Ind. Electron., vol. 64, no. 7, pp. 5640–5647, Jul. 2017. [25] Y. Guo, “Globally robust stability analysis for stochastic Cohen–
[2] T. Zhang, S. S. Ge, and C. C. Hang, “Adaptive neural network con- Grossberg neural networks with impulse control and time-varying
trol for strict-feedback nonlinear systems using backstepping design,” delays,” Ukrainian Math. J., vol. 69, no. 8, pp. 1220–1233, 2018.
Automatica, vol. 36, no. 12, pp. 1835–1846, 2000. [26] B. Xu, Z. Shi, and C. Yang, “Composite fuzzy control of a class of
[3] Y. Yang, G. Feng, and J. Ren, “A combined backstepping and small-gain uncertain nonlinear systems with disturbance observer,” Nonlin. Dyn.,
approach to robust adaptive fuzzy control for strict-feedback nonlinear vol. 80, nos. 1–2, pp. 341–351, 2015.
systems,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 34, [27] L. Zhang, Z. Ning, and Z. Wang, “Distributed filtering for fuzzy time-
no. 3, pp. 406–420, May 2004. delay systems with packet dropouts and redundant channels,” IEEE
[4] S. Tong, Y. Li, Y. Li, and Y. Liu, “Observer-based adaptive fuzzy Trans. Syst., Man, Cybern., Syst., vol. 46, no. 4, pp. 559–572, Apr. 2016.
backstepping control for a class of stochastic nonlinear strict-feedback [28] Y. Li, S. Tong, and T. Li, “Adaptive fuzzy output feedback dynamic sur-
systems,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 6, face control of interconnected nonlinear pure-feedback systems,” IEEE
pp. 1693–1704, Dec. 2011. Trans. Cybern., vol. 45, no. 1, pp. 138–149, Jan. 2015.
[5] Z.-P. Jiang, “Global tracking control of underactuated ships by [29] Y. Li, S. Tong, and T. Li, “Hybrid fuzzy adaptive output feedback con-
Lyapunov’s direct method,” Automatica, vol. 38, no. 2, pp. 301–309, trol design for uncertain MIMO nonlinear systems with time-varying
2002. delays and input saturation,” IEEE Trans. Fuzzy Syst., vol. 24, no. 4,
[6] K. Do, Z.-P. Jiang, and J. Pan, “Universal controllers for stabilization pp. 841–853, Aug. 2016.
and tracking of underactuated ships,” Syst. Control Lett., vol. 47, no. 4, [30] D. Wang, H. He, and D. Liu, “Adaptive critic nonlinear robust control: A
pp. 299–317, 2002. survey,” IEEE Trans. Cybern., vol. 47, no. 10, pp. 3429–3451, Oct. 2017.
[7] K. P. Tee and S. S. Ge, “Control of fully actuated ocean surface vessels [31] D. Wang, C. Li, D. Liu, and C. Mu, “Data-based robust optimal control
using a class of feedforward approximators,” IEEE Trans. Control Syst. of continuous-time affine nonlinear systems with matched uncertainties,”
Technol., vol. 14, no. 4, pp. 750–756, Jul. 2006. Inf. Sci., vol. 366, pp. 121–133, Oct. 2016.
[8] K. D. Do and J. Pan, “Global tracking control of underactuated ships [32] G. Wen, C. L. P. Chen, J. Feng, and N. Zhou, “Optimized multi-
with nonzero off-diagonal terms in their system matrices,” Automatica, agent formation control based on identifier-actor-critic reinforce-
vol. 41, no. 1, pp. 87–95, 2005. ment learning algorithm,” IEEE Trans. Fuzzy Syst., to be published,
[9] M. Chen, S. S. Ge, and Y. S. Choo, “Neural network tracking control doi: 10.1109/TFUZZ.2017.2787561.
of ocean surface vessels with input saturation,” in Proc. IEEE Int. Conf. [33] D. Wang, H. He, X. Zhong, and D. Liu, “Event-driven nonlinear dis-
Autom. Logistics (ICAL), Shenyang, China: IEEE, 2009, pp. 85–89. counted optimal regulation involving a power system application,” IEEE
[10] S. Tong, K. Sun, and S. Sui, “Observer-based adaptive fuzzy decen- Trans. Ind. Electron., vol. 64, no. 10, pp. 8177–8186, Oct. 2017.
tralized optimal control design for strict-feedback nonlinear large-scale [34] S. Bhasin et al., “A novel actor–critic–identifier architecture for approx-
systems,” IEEE Trans. Fuzzy Syst., vol. 26, no. 2, pp. 569–584, Apr. imate optimal control of uncertain nonlinear systems,” Automatica,
2018. vol. 49, no. 1, pp. 82–92, 2013.
[11] Y. Li, K. Sun, and S. Tong, “Observer-based adaptive fuzzy fault-tolerant [35] K. G. Vamvoudakis and F. L. Lewis, “Online actor–critic algorithm
optimal control for SISO nonlinear systems,” IEEE Trans. Cybern., to to solve the continuous-time infinite horizon optimal control problem,”
be published, doi: 10.1109/TCYB.2017.2785801. Automatica, vol. 46, no. 5, pp. 878–888, 2010.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[36] G.-X. Wen, C. L. P. Chen, Y.-J. Liu, and Z. Liu, “Neural-network- C. L. Philip Chen (S’88–M’88–SM’94–F’07)
based adaptive leader-following consensus control for second-order received the M.S. degree in electrical engineering
non-linear multi-agent systems,” IET Control Theory Appl., vol. 9, from the University of Michigan, Ann Arbor, MI,
no. 13, pp. 1927–1934, Aug. 2015. USA, in 1985, and the Ph.D. degree from Purdue
University, West Lafayette, IN, USA, in 1988.
He is currently a Chair Professor with the
Department of Computer and Information Science
and the Dean of the Faculty of Science and
Guoxing Wen received the M.S. degree in applied Technology, University of Macau, Macau, China.
mathematics from the Liaoning University of His current research interests include computational
Technology, Jinzhou, China, in 2011, and the Ph.D. intelligence, systems, and cybernetics.
degree in computer and information science from
Macau University, Macau, China, in 2014.
He was a Research Fellow with the Department
of Electrical and Computer Engineering, Faculty
of Engineering, National University of Singapore, Fangwen Tu received the B.E. degree from
Singapore, from 2015 to 2016. He is currently the Department of Electrical Engineering, Dalian
a Lecturer with the Department of Mathematics, University of Technology, Dalian, China, in 2012.
Binzhou University, Binzhou, China. His current He is currently pursuing the Ph.D. degree with the
research interests include adaptive neural network control, optimal control, Department of Electrical and Computer Engineering,
and multiagent control. National University of Singapore, Singapore.
His current research interests include intelligent
control, machine learning, data mining, and com-
puter vision.