Professional Documents
Culture Documents
art ic l e i nf o
a b s t r a c t
Article history:
Received 12 February 2016
Received in revised form
4 May 2016
Accepted 17 June 2016
Available online 5 July 2016
In this paper we propose a Lyapunov theory based Markov game fuzzy controller which is both safe and
stable. We attempt to optimize a reinforcement learning (RL) based controller using Markov games,
simultaneously hybridizing it with a Lyapunov theory based control for stability. Proposed technique
generates in an RL based game theoretic, adaptive, self learning, optimal fuzzy controller which is both
robust and has guaranteed stability. Proposed controller is an annealed hybrid of fuzzy Markov games
and the Lyapunov theory based control. Fuzzy systems have been employed as generic function approximators for scaling the proposed approach to continuous state-action domains. We test our proposed
controller on three benchmark non-linear control problems: (i) inverted pendulum, (ii) trajectory
tracking of standard two-link robotic manipulator, and (iii) tracking control of a two link selective
compliance assembly robotic arm (SCARA). Simulation results and comparative evaluation against
baseline fuzzy Markov game based control showcases superiority and effectiveness of the proposed
approach.
& 2016 Elsevier Ltd. All rights reserved.
Keywords:
Reinforcement learning
Q learning
Fuzzy Markov games
Lyapunov theory
Inverted pendulum
Two link robotic manipulator
SCARA
1. Introduction
Reinforcement learning (RL) paradigm centers on Markov Decision Processes (MDP) as the underlying model for adaptive optimal control of non-linear systems (Busoniu et al., 2010; Wiering
and van Otterlo, 2012). A critical assumption in the MDP based RL
technique is the assumption of a stationary environment. However, imposing such a restrictive assumption on the environment
may not be feasible, especially when the controller has to deal
with disturbances and parametric variations.
Notwithstanding this limitation, RL has been used successfully
for controlling a wide variety of non linear systems, e.g., in Kobayashi et al. (2009) a meta-learning method based on temporal
difference has been employed for inverted pendulum control
(IPC); Kumar et al. (2012) presents a self tuning fuzzy Q controller
for IPC; Ju et al. (2014) proposes kernel based approximate dynamic programming approach for inverted pendulum control, and
in Liu et al. (2014) an experience replay least squares policy
iteration procedure has been proposed for efcient utilization of
experiential information. In literature, we can nd quite a few
variants of the inverted pendulum problem. However, in our work,
we have used standard version of the pendulum wherein the pivot
point is mounted on a cart which can move horizontally.
Another domain where RL has been applied is robotic
E-mail address: rajneesh496@gmail.com
http://dx.doi.org/10.1016/j.engappai.2016.06.008
0952-1976/& 2016 Elsevier Ltd. All rights reserved.
120
the cost incurred j steps into future and is the discount factor;
0r o1.
In Q learning (Wiering and van Otterlo, 2012), Q-value denes
the quality of a state-action pair, and is the total expected discounted cost incurred by a policy, that takes action ak A(sk ) in
state sk S and follows the optimal policy in the subsequent
states. Q values implicitly contain information regarding transition
probabilities. For the state-action pair (sk , ak ), the Q-value is dened as:
Q s k , ak = r k +
)( )
p s k , ak , s k + 1 V s k + 1
s k + 1 S
(1)
where
) {
Q s k , ak Q s k , ak + r k +
min
Q s k + 1, a Q s k , a k
( )
aA s k+1
)}
(2)
121
( )
( )
i s k = Minimax ai s k
a
i k
ar s with probability
=
ai* s k otherwise
( )
( )
(6)
We have used pseudo-stochastic exploration wherein the exploration is gradually reduced ( parameter) to 0.02. Policy i(sk )
a
Ri :
( )
( )
s
ai , di ai
(4)
( )
( )=
a* s
( )( )
(s )
(sk ) = {1(sk )
2 (sk ).. .
(7)
( )
( )
a s k ak
(8)
( )
a i A , di D
( )( )
(s )
( )
V sk+1 =
(9)
k+1
) as:
i = 1 V i s k + 1 i s k + 1
N
i = 1 i
k+1
(10)
actions ai(sk ) are obtained stochastically to form the global Q-value. Q-values are inferred from the quality of the local Minimax
discrete-actions:
i = 1 ai* s k i s k
N
i = 1 i
a
N
i = 1 i
k A
Ri being ith rule of the rule-base, A = {a1, a2, , am} is the controller action set while D = {d1, d2, ... , do} is the action set for the
disturber. Parameter q: q(i, ai , di ) represents quality of controllerdisturber action pair for each rule Ri . For each rule Ri , a game
matrix, (e.g., for A = 3, D = 3) is constructed (Fig. 1).
( )( )
(s )
i = 1 i s k i s k
( )
a s k =
a mgc s k =
ai*
( )
policies as:
(5)
( )
Q sk =
( )
i = 1 q i, ai, di i s k
( )
i = 1 i s k
(11)
( )
d s k =
( )( )
(s )
i = 1 di s k i s k
N
i = 1 i
(12)
2.2.3. q update
For controlling the system, we use continuous action amgc , but
for learning, we use the discrete action ai chosen stochastically,
under each rule Ri . Temporal difference (TD) error is calculated as:
Fig. 1. Game matrix.
Q = r k + V (s k + 1) Q (s k )
(13)
122
q i, ai, di q i, ai, di + Q
( )
( )
i s k
N
i = 1 i s k
(14)
( )
C r k = k = 0 r k = k = 0 k +
( )
alyap s k =
1 k
k
k
k
cos + ,
k
( )
(15)
( )
1 k
k
k
k
1 cos 1 + 1 , 1
( )
1k
(16)
1k , 1
) =
k
1
(17)
(18)
Under this control action alyap _ mg (s ), inverted pendulum transitions to the next state sk + 1 and generates cost:
k
k
k
rlyap
_ mg = +
= 100
if k < 180
if k > 180
(19)
( )
( )
k
k+1
Q lyap _ mg = rlyap
Q sk
_ mg + V s
(20)
k
We use reward rlyap
_ mg (19) to update the candidate function
k
C (rlyap
_ mg )
( ) (
( k , ) = ( [ ]2 + cos k )sin k
alyap s1k =
nk ( i, ai , di )
no + nk ( i, ai , di )
) ( )
where (k ) = 1 cos2k
k
k =
+ cos 1ksin 1k
(21)
Lyapunov theory based Markov game fuzzy controller is detailed in Fig. 2. The gure shows various modules that take part in
generating a safe and stable control policy:
We now briey describe how the proposed controller arrives at
a game theoretic stable policy. The FIS antecedent part (3) receives
current state, control action and disturber action to generate a rule
ring strength vector (sk ) = {1(sk ) 2 (sk ).. . i (sk )... . N (sk )}. FIS
consequent part in (3) generates rule specic q values. These q
values are used to form a game matrix which is solved using linear
programming to generate an optimal policy a* sk (5) and optimal
( )
( ) (9).
value V S
k+1
( )
123
l
x j x jp
l (xj ) = e
2j2
; lp = 1, 2, 3;
j = 1, 2;
(23)
g sin
m pl sin( 2)
2
cos( )F
( )l m lcos
4
3
(22)
2
. .
M ( ) + C ( , ) + g ( ) + f ( ) + ud = u
(24)
124
lp
x j( n) x j ( n)
j( n)
e
;
lp
is
l x j ( n) =
p
label
(fuzzy)
lp = 1, 2, 3;
for
the
j = 1, 2;
link
.
n = 1, 2
in
joint
(25)
n,
(x1(1) = (1) , x1(2) = (2) , x2(1) = (1) , x2(2) = (2)). The centers and
width
corresponding
to
these
fuzzy
variables
are
l
4. Simulation results
We give comparison of the proposed controller against base
line game theory based RL control namely Markov game fuzzy RL
control. Proposed Lyapunov theory based Markov game fuzzy
control shares all shortcomings and advantages with corresponding Markov game fuzzy control, both being RL based approaches.
This provides a fair comparison ground for the proposed control
approach. This is more so because our primary claim is that the
inclusion of a Lyapunov theory based action generation mechanism makes fuzzy Markov game controller stable. For comparison of
fuzzy Markov game control against conventional and other RL
based methods, we refer the reader to (Sharma and Gopal, 2008).
125
Fig. 11. Two link comparison without payload variations: tracking error for link 1.
Fig. 7. Two link controller comparison: tracking error for link 1.
Fig. 12. Two link comparison without payload variations: tracking error for link 2.
Fig. 13. Two link comparison without payload variations: torque exerted at link 1.
126
Fig. 14. Two link comparison without payload variations: torque exerted at link 2.
Simulation model (Fig. 19) for two-link robot arm has been
taken from (Sharma and Gopal, 2008). A two-link robot arm has all
the nonlinear effects common to general robot manipulators and is
easy to simulate.
Manipulator dynamics
. .
.2
..
.. +
.
+ cos 2
1 sin 2
e cos + e cos( + )
1
1
1
1
1
2
+
=
2
e1 cos( 1 + 2)
= ( m1 + m2)a12
= m2a22
= m2a1a2
e1 = g /a1
127
Desired trajectories
The desired trajectories for joint angles 1 = (1) and 2 = (2)
are set as in Sharma and Gopal (2008): desired( 1) = 0. 5+0. 2( sin t
References
..
p + p c
p3 + 0.5p2 c2 1 p4 p2 s22 1 0.5p2 s22
1
2 2
.. +
.
.2
p3 + 0.5p2 c2 p3
2
0.5p2 s21 + p52
1
=
2
Aguilar-Ibanez, C., 2008. A constructive Lyapunov function for controlling the inverted pendulum. In: American Control Conference, Westin Seattle Hotel,
Seattle, Washington, USA, June 1113.
Busoniu, L., Babuska, R., De Schutter, B., Ernst, D., 2010. Reinforcement Learning and
Dynamic Programming using Function Approximator. CRC Press.
Ju, X., Lian, C., Zuo, L., He, H., 2014. Kernel based approximate dynamic programming forreal time online learning control: an experimental study. IEEE Trans.
Control. Syst. Technol. 22 (1).
Katic, D., Vukobratovic, M., 2013. Intelligent Control of Robotic Systems. Springer
Science.
Kobayashi, K., Mizoue, H., Kuremoto, T., Obayashi, M., 2009. A meta-learning
method based on temporal difference error. In: ICONIP 2009, Part 1, LNCS, vol.
5863, pp. 530537.
Kumar, R., Nigam, M.J., Sharma, S., Bhavsar, P., 2012. Temporal difference based
tuning of fuzzy logic controller through reinforcement learning to control an
inverted pendulum. Int. J. Intell. Syst. Appl. 9, 1521.
Levine, J., 2009. Analysis and Control of Nonlinear Systems. SpringerVerlag, London.
Lin, Chuan-Kai, 2009. H1 reinforcement learning control of robot manipulators
using fuzzy wavelet networks. Fuzzy Sets Syst. 160, 17651786.
Liu, Q., Zhou, X., Zhu, F., Fu, Q., Fu, Y., 2014. Experience replay for least squares
policy iteration. IEEE/CAA J. Autom. 1 (3).
Sharma, Rajneesh, Gopal, M., 2008. A Markov game adaptive fuzzy controller for
robot manipulators. IEEE Trans. Fuzzy Syst. 16 (1), 171186.
Sharma, Rajneesh, 2015. Game theoretic lyapunov fuzzy control for inverted pendulum. In: Proceedings of the 4th IEEEE International Conference on Reliability,
Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), Amity University, Noida, Uttar Pradesh, India, pp. 16, September 24.
Tang, Li, Liu, Yan-Jun, Tong, Shaocheng, 2014. Adaptive neural control using reinforcement learning for a class of robot manipulator. Neural Computing
and Applications, Springer Verlag, vol. 25, pp. 135141.
Vrabie, D., Vamvoudakis, K.G., 2013. Adaptive Control and Differential Games by
Reinforcement Learning Principles. IET.
Wiering, M., van Otterlo, M., 2012. Reinforcement learning: state-of-the-art. In:
Daswani, M., Sunehag, Peter, Hutter, Marcus (Eds.), Adaptation, Learning and
Optimization 12. Springer, Berlin.