You are on page 1of 6

Proceedings of the 1999 EEE

International Symposium on Intelligent Controhtelligent Systems and Semiotics


Cambridge, MA September 15-17.1999

A neural network with minimal structure


for MagLev system modeling and control
Mostafa Lairi and GCrard Bloch
Centre de Recherche en Automatique de Nancy (CRAN-CNRS ESA 7039)
ESSTIN, Rue Jean Lamour, 54500 Vandoeuvre, France
E-mail: bloch @ cran.esstin.u-nancy.fr

Abstract Even if in some cases perceptrons with several hidden


The-paper is concerned with the determination of a layers carry out more precise mappings, most of the time
minimal structure of a one hidden layer perceptron for perceptrons with one hidden layer and linear activation at
system identification and control. Structural identification the output are sufficient and the universal approximation
is a key issue in neural modeling. Decreasing the size of a ability of such a structure has been proved. The search for
neural network is a way to avoid overfitting and bad the appropriate structure of a perceptron for modeling a
generalization and leads moreover to simpler models particular system is one of the key issues in neural
which are required for real time applications, particularly modeling. Apart from the fact that decreasing the size of
in control. the network leads to simpler models which are required
A learning algorithm and a pruning method both based on for real time control with short sampling period.
a criterion robust to outliers are presented. Their constraining the model to be "simple" in some sense is a
performances are illustrated on a real example, the inverse way to avoid overfitting which often causes bad
model identification of a MagLev system, which is generalization [IO]. Several approaches can be used to
nonlinear, dynamical and fast. This inverse model is used constrain the number of the network parameters or the
in a feedforward neural control scheme. Very satisfactory parameters themselves, including the use of regularized
approximation performances are obtained for a network learning criteria (e.g. "weight decay") or the pruning of
with very few parameters. useless parameters after learning. This paper focuses on
pruning methods, particularly the Optimal Brain Surgeon
Keywords (OBS) method [8], insofar as, in practice, tuning the
one hidden layer perceptron, structural identification, balance parameter for regularized criteria is not an easy
pruning, robust criterion, magnetic levitation. task, which is most of the time done by hand.
1. Introduction In the second part, the form of a one hidden layer
perceptron with linear activation function at the output
Artificial neural networks have been the focus of a great and the general Levenberg-Marquardt estimation
deal of attention during the last decade, due to their algorithm are briefly recalled. In the third part, a robust
capabilities to solve nonlinear problems by learning. weighted learning criterion is described while the fourth
Backpropagation or derived algorithms have been part recalls the principle of the OBS procedure and the
particularly applied for process modeling and weak modifications necessary to incorporate the robust
identification [5, 161 and for control [ l , 9, 141. criterion for weight elimination.
In order to diminish the influence of the outliers (i.e. to The last part illustrates the proposed robust algorithms for
provide a more reliable and robust estimate of the parameter estimation and pruning on a real magnetic
unknown parameter vector), a M-estimator can be used. levitation system (MagLev), which is nonlinear and
This approach, stemming from robust statistics, has been unstable. The goal is to build a neural inverse model of the
introduced for neural networks learning by several authors MagLev system to be included in a feedforward control
[2, 3, 6, 7, 13, 171. It must be noted that neural networks strategy [ l l , 121. Results obtained with standard
are not at all intrinsically insensitive to outliers and Levenberg-Marquardt parameter algorithm followed by
ignoring the effect of outliers, particularly for industrial OBS pruning and outlier-robust learning and pruning are
data which are frequently corrupted by spiky noise, can compared, showing that a network with very few
lead to biased estimators but also to overparametrized parameters and very satisfactory approximation
structures and poor generalization as a consequence of performances can be obtained with the second approach.
overfitting.

0-7803-5665-9/99/$10.000 1999 IEEE 40


Such a network can be then easily incorporated in a real differentiating ( 5 ) and can be approximated with the
time application, with short sampling time. Levenberg-Marquardt update rule by:

2. Neural model and general learning rule 1 "


H(@)=; +P I
CW(k,e>L(E(k,e))WT(k,8) (6)
The one hidden layer perceptron with linear activation k=l

function at the output is considered here. Its form is given, where L'(&(k,€))) is the second derivative of L(&(k,8))
for single output, by: with respect to E , I is the identity matrix and p a small
non negative scalar, adjusted during learning.
The criterion most frequently used for parameter
estimation is the classical ordinary least-squares criterion
where x i , h = l,..-,n,,, are the inputs of the network, (L2 norm), with as case cost function a quadratic one:
wfh and b:, i = l,-..,n,, h = 1,a.e ,no, are the weights 1
and biases of the hidden layer, the activation function g is L(E(k,8)) =-c2(k,8) (7)
2
the hyperbolic tangent, w; , i = l,...,n,, and b2 are the whose derivatives are simply:
weights and bias of the output neuron. All the weights and
biases of the network can be grouped in the parameter L'(E(k,8) = E(k,8) , L"(&(k,8) = 1 (8)
vector 8 , and the inputs xt in the regression vector Such a criterion receives the largest contributions from the
points which have the largest errors and the solution can
cp(x) = [xy xt x",]. So for the case k, the output
be dominated by a very small number of points which can
predicted by the network can be written: be gross errors or outliers.
W O ) = NN(cp(x(k)),O).
3. An outlier-robust learning rule
To estimate from data the parameter vector 8 , the
prediction error: The batch learning method presented here is detailed in
[4,171. This Iteratively Reweighted Least Squares (IRLS)
E(k,e) = y(k)-j%k,e) (2) method starts, following Huber, from a distribution of the
noise e contaminated by outliers expressed by a mixture of
with y(k) the desired output, is formed and incorporated in
two probability density functions. The first one
a criterion to be minimized. A general form for this
corresponds to the basic distribution of a measurement
criterion, which leads to a M-estimator, is given, for n
samples, by: noise, for example Gaussian, with variance o;!and the
second one, corresponding to outliers, is arbitrary
symmetric long-tailed, for example also Gaussian, but
(3)
with variance 0; such as 0: << 0;:
where L(.) is a scalar case cost function.
The minimization of the criterion (3) can therefore be
e&) - (l-p)N(O,of)+pN(O,o~)
carried out using the Gauss-Newton algorithm: where p is the probability of occurring a large error. In
practice, neither the probability p nor the two variances
(4)
0: and 0; are known and the preceding model is
In (4), the gradient V'(8) of the criterion (3) with respect replaced by:
to 8 is given by:
E(k) - (1- S(k)) N(O,.?) + 6(k ) N(0, 0:) (9)
where E(k) is the prediction error, given by (2), 6 ( k ) = 0
for l&(k)lSM and 6(k) = 1 for IE(k)l> M , and M is a
where v(k,B) is the gradient of q(k,8) with respect to
bound which can be taken as 30, .
8 and L(&(k,e)),the score or influence function, is the
first derivative of L(e(k,8)) with respect to E . The The unknown variances 0: and 0: are estimated as
second derivative H(8) of the criterion (3) with respect to follows. At each iteration i of the algorithm (4), the
8 , known as the Hessian matrix, is obtained by prediction error sequence E is calculated by (2), and the
variances recursively estimated by:

41
for le(k)l I301 (k - 1) weight. The sensitivity SV(e) of the criterion V(0) is
approximated by a Taylor expansion around 8 to order
o:(k)=o?(k-l)+- (E2 (k) - 0:(k - 1))
k-T(k) ( 1Oa) two:
otherwise
o:(k) =o:(k-l)
and The gradient V'(0) being zero after convergence, the first
term in (15) vanishes, leading to:
for lE(k)l >3c1(k-1)
1
0;(k) = 0;(k - 1) +-(E* (k) - 0 ; (k - 1))
W )
otherwise which involves only the Hessian H. Noting eq a canonical
o;(k)=oi(k-l) vector selecting the qth component of 0
with z(0) = O at each iteration, and z(k+l) = z(k)+l (e: =[0 0 l o . . . O ] ) , the deletion of the weight
1.. e,,
whenever (e(k)l> 30, (k - 1). oi(0) can be chosen equal (i.e. e;(Se+e) = 0 ) must lead to a minimal increase of
to o:(O) and o:(O) equal to the classically calculated the criterion. The following Lagrangian can thus be
variance of the prediction errors at the first iteration. Note written:
that T( n) is the estimate of the number of outliers.
The variance 0:(k) of E( k) is finally given by:
1
k(se)= -se
2
T
H se + A (et
(se+ e))
and minimized, leading to:
02, (k) = (1- 6(k)) o:(n) + 6(k) o;( n) (11)
leading to the weighted robust norm: (18)

where Hi: is the qthdiagonal term of H-l . The weight to


be deleted is the one which minimizes (15). Equation ( 1 8)
Algorithm (4) can be then employed with (12) as case cost allows to force the qth weight to zero and to update the
function in criterion (3), with its first derivative: remaining weights, without retraining. It can bc
nevertheless useful to retrain the network after each
pruning of a weight, in order to compensate the
approximation introduced by the Taylor expansion ( 15).
in the criterion gradient ( 5 ) and with its second derivative: If the true quadratic criterion is used, the corresponding
approximate Hessian is given by:
1
L" ( E ( k)) = -
02,(k)

in the approximate Hessian (6).


For the robust weighted criterion, the Hessian becomes:
4. Robust Pruning
After estimation of the parameter vector, it can be useful
to determine the minimal structure of the network.
Removing the useless parameters leads obviously to a
simpler model, but moreover diminishes the overfitting of where 0:(k) is given by (1 1).
the model to data, i.e. the learning of the noise and of the As presented in [18] on simulation examples, the use of
unknown underlying model of the system at the same weighted robust criterion for initial learning as well as for
time, and can thus improve the generalization abilities of pruning greatly improves the estimation of the parameters.
the model. The pruning algorithm which is probably the particularly the associated generalization capabilities, and
most classical is the Optimal Brain Surgeon, proposed by the model structure selection.
Hassibi and Stork [8]. This algorithm minimizes the
sensitivity of the error criterion subject to the constraint of
nullity of a weight, which expresses the deletion of this

42
5. Magnetic levitation example "ML system" is the block which represents the real system
(real time interfaced).
Outlier robust learning and structure determination
presented above are applied to model the inverse behavior
of a magnetic levitation system (MagLev). Precise
description of the MagLev system which allows to balance
a metallic ball without any support, by the use of an
electromagnet, and to steer the object tracking a desired
vertical trajectory, can be found in [ l l , 121. The system
input is the voltage U applied to the coil and the output is
bml+ ML System
ML System

the voltage V, corresponding to the position X of the ball 1- PID

(cf. figure 1). Figure 3: Data collection in closed loop


The reference voltage is chosen with two goals: to control
the system around the operating point ( - 3 volts) and to
explore an operating range as large as possible. From
1500 couples (Vx(k), U(k)), presented figure 4, network
parameters estimation is achieved by two ways.
I,
II ' I I Ill

Figure 1: Principle diagram of Magnetic Levitation


A neural inverse model of the system is used with a PID in
a feedforward control strategy (cf. figure 2).

+
upid
P.I.D.
+

Figure 2: Principle of neural feedforward control


Based on direct linear model of the system, the expression
of the output V, at time k+l can be written:

V,(k + 1) = f(V,(k), V,(k - Wk), U(k - 1)) (21)


where f is a non linear function modeling the system
behavior. The inverse model can be thus expressed as: Figure 4: Learning set with 1500 data
(A) U in volts (B) Vx in volts
U(k) = f-'(V,(k + l), Vx(k),V,(k - l), U(k - 1)) (22) The first one is the standard Levenberg-Marquardt (LM)
So, taking 4(k) =[V,(k+l) V,(k) Vx(k-1) U(k-l)] as algorithm for initial learning and OBS algorithm for
pruning as implemented in [15]. The initial fully
regression vector; the neural network estimates the inverse connected model comprises 10 hidden hyperbolic tangent
model: c(k) = NN(@(k),B). units, i.e. 61 parameters.
To collect data for estimation and validation of the neural Figure 5A Plots the evolution, during pruning, of the
model, experiment is carried out in closed loop, the system training error "x",
Of the test a ~ O r"0" and of the
being unstable in open loop. A white noise is added to the Akaike's WE estimate "+", with respect to parameter's
control signal U, given by the PID controller, to enrich the number. Starting fi-om the initial learning (with 61
excitation signal, as presented in figure 3. In this figure, parameters), the plot should be read from right to left.
Figure 5B gives the final architecture of the pruned

43
network. At each parameter deletion, retraining for 50
iterations maximum is performed. According to the W E
(Final Prediction Error), the optimal network architecture
was the one comprising 9 hidden neurons and 38
parameters.

0 IO 20 30 40 50 60 A.

Figure 6: (A) Training error, test error and FPE with


respect to parameter number
(B) Architecture of the pruned network

Comparisons are given in table 1 . The "robust OBS"


algorithm (ROBS) gives performances slightly better than
the standard least squares OBS in terms of mean square
B. error on the learning set (MSLE) and on the test set
(MSTE), but with a number of parameters divided by 2.
Figure 5: (A) Training error, test error and FPE with
Moreover, the robust pruning process is performed
respect to parameter number
quicker, the iteration number necessary for retraining at
(B) Architecture of the pruned network
each weight deletion being clearly lower, with a stopping
In [18], simulation examples show the importance of the criterion on the gradient norm. Note, from figure 6A, that
quality of the model obtained after initial learning. the network with only 11 parameters can be finally
Nevertheless, in order to focus on pruning process, the selected without significant worsening of the
second way uses modified pruning presented in section 4 approximation performances.
from the initial structure of 10 hidden neurons, with 61
parameters estimated as before with standard LM
algorithm. I I BeforeOBS I AfterOBS I A=l
# parameter 61 38 19
Figure 6A plots, during robust pruning, the evolution of
the training error "X", of the test error "0"and of the MSLE 3.02 10-3 3.95 10-j 3.9 10-j
Akaike's FPE estimate "+" with respect to the parameter's MSTE 3.9 10-3 4.3 io-' 4.0 IO-?
number. According to the FPE, the optimal network
architecture, presented figure 6B, is the one comprising 6
hidden neurons with 19 parameters.

44
corresponding output Vx are plotted. The proposed Workshop on Artificial Intelligence arid Statistics, Fort
strategy has been carried out in real time using Matlab- Lauderdale, FL, January 199 1, pp. 2 18-239,
Simulink and RTW (Real Time Workshop) software [7] S.J. Hanson and D.J. Burr, "Minkowski-r back-
environment, with sampling period of 3ms. propagation: learning in connectionist models with non-
Euclidean error signals", in Neurul Itlformatiori
Processing Systems, D.Z. Anderson (Ed.), American
Institute of Physics, New-York, 1988, pp. 348-357.
[8] B. Hassibi and D.G. Stork, "Second order derivatives
for network pruning: optimal brain surgeon", in Advarice~
in Neural Information Processing Systems, S.H. Hanson,
, w , I J.D. Cowan and C.L. Gilles (Eds.), Morgan Kaufmann.
0 50 100 150 200 250 300 350
San Mateo, CA, Vol. 5, 1993, pp. 164-17 1 .
Figure 7: Reference trajectory and the output of the
MagLev system [9] K. Hunt, D. Sbarbaro, R. Sbikowski, and P.J.
Gawthtop, "Neural networks for control systems - a
6. Conclusion survey", Autornatica, Vol. 28, 1992, pp. 1083- 1 1 12.
[lo] C. Jutten and 0. Fambon, "Pruning methods: A
The use of artificial neural networks for control is
review", Proc. European Symp. on Artificial Neurul
motivated by their universal approximation capabilities.
Networks ESANN'95, Brussels, April 1995, pp. 129- 140.
The feedforward one hidden layer perceptron with linear
activation at the output gives a simple, but sufficient [ l l ] M. Lairi, "Identification et commande neuronales de
general structure for nonlinear modeling. In this paper, systkmes non IinCaires : application h un systkme de
the usefulness of deleting spurious parameters and outlier- sustentation magnttique", T h b e de Doctorat de
robust criterion for pruning are shown. The model size 1'UniversitC Henri PoincarBNancy 1, specialit6
reduction is applied to the identification of the inverse Automatique, 1998.
model of a MagLev system for feedforward control. [12] M. Lairi, G. Bloch, and G. Millerioux, "Real time
Obtaining a model with very few parameters and good feedforward neural control of a MagLev system", Neurul
approximation capabilities allows to envisage adaptive Processing Letters, Kluwer Academic Publishers, 1999.
parameter estimation algorithms, which cannot be submitted.
implemented in real time with standard environment, for [13] K. Liano, "Robust error for supervised neural
systems requiring short sampling period, without minimal network learning with outliers", IEEE Truns. on Neurd
structure neural models. Networks, Vol. 7, no 1, 1996, pp. 246-250.
[ 141 K.S. Narendra, and K. Parthasarathy, "Identification
7. References and control of dynamical systems using neural networks."
[ l ] M. Agarwal, "A systematic classification of neural IEEE Trans. on Neural Networks, 1990, Vol. 1, pp. 4-27.
network based control", IEEE Control Systems Mag., Vol. [ 151 M. Norgaard, "Neural Network Based System
17, 1997, pp. 78-84. Identification Toolbox", Tech. Report 95-E-773, Institute
[2] G. Bloch, P. Thomas, M. Ouladsine and M. Lairi, "On of Automation, Technical. University of Denmark, 1995.
several outlier-robust training rules of neural networks for [16] J. Sjoberg, Q. Zhang, L. Ljung, A. Benveniste, B.
identification of nonlinear systems", 8th Int. Con$ on Delyon, P.Y. Glorennec, H. Hjalmarsson, and A. Juditsky,
Neural Networks and their Applications NEURAP'96, "Nonlinear black-box modeling in system identification: a
Marseille, France, March 1996, pp. 13-19. unified overview", Automatica, Vol. 3 I , n o 12, 1995, pp.
[3] G . Bloch, P. Thomas, and D. Theilliol, 1691-1724.
"Accommodation to outliers in identification of non linear [17] P. Thomas and G. Bloch, "From batch to-recursive
SISO systems with neural networks", Neurocomputing, outlier-robust identification of non linear dynamic systems
Vol. 14, no 1, 1997, pp. 85-99. with neural networks", Proc. IEEE Int. Cor$ on Neiirul
[4] G. Bloch, F. Sirou, V. Eustache and P. Fatrez, "Neural Networks ICNN'96, Vol. 1, Washington, DC, June 1996.
intelligent control of a steel plant", IEEE Trans. on Neural pp. 178-183.
Networks, Vol. 8, no 4, July 1997, pp. 910-918. [18] P. Thomas and G. B l a h , "Robust pruning for
[5] S. Chen, S.A. Billings, and P.M. Grant, "Non-linear multilayer perceptrons", Proc. IMACSLEEE
system identification using neural networks," Int. J. Multiconference on Computational Engineering in
Control, Vol. 51, no 6, 1990, pp. 1191-1214. Systems Applications CESA'98, P. Borne, M. Ksouri and
163 D.S. Chen and R.C. Jain, "A robust back propagation A. El Kame1 (Eds.), Vol. 4, Nabeul-Hammamet, Tunisia.
learning algorithm for function approximation", Third Inr. April 1998, pp. 17-22.

45

You might also like