Professional Documents
Culture Documents
Abstract— In this paper, we use a deep learning method, re- II. N ONLINEAR SYSTEM IDENTIFICATION WITH DEEP
stricted Boltzmann machine, for nonlinear system identification. NEURAL NETWORKS
The neural model has deep architecture and is generated by a
random search method. The initial weights of this deep neural Consider the following unknown discrete nonlinear system
model are obtained from the restricted Boltzmann machines.
To identify nonlinear systems, we propose special unsupervised ̄( + 1) = [̄ () ()] () = [̄ ()] (1)
learning methods with input data. The normal supervised learn-
ing is used to train the weights with the output data. The modified where () ∈ < is the input vector, ̄ () ∈ < is an
algorithm is validated by modeling two benchmark systems. internal state vector, and () ∈ < is the output vec-
tor. and are £nonlinear smooth functions ∈ ¤∞ .
I. I NTRODUCTION Denoting () = () ( + 1) · · · ( + − 1)
£ ¤
The multi-layer perceptron (MLP) can approximate any
() = () ( + 1) · · · ( + − 2) If ̄ is
nonlinear function to any prescribed accuracy if sufficient non-singular at ̄ = 0 = 0 this leads to the model (2)
hidden neurons are provided [1]. The structure identification
is often tackled by trial-and-error approaches like fuzzy grid () = Φ [ ()] (2)
[2], and data mining methods [3]. In this paper, we increase
the hidden layers and decrease the hidden units keeping the () = [ ( − 1) ( − 2) · · · () ( − 1) · · · ]
model complexity while improving its modeling capacity. Φ (·) is an unknown nonlinear difference equation representing
Since the identification error space is unknown, the neural the plant dynamics, () and () are measurable scalar input
model can be settled down in a local minima easily if the and output, is time delay. The nonlinear system (2) is a
initial weights are not suitable [4]. There are some techniques NARMA model. We can also regard the input of the nonlinear
to overcome the local minima: such as noise-shaping modifi-
system as () = [1 · · · ] ∈ < the output as () ∈
cation [5] and combining nonlinear clustering [6]. However, < We use the MLP to identify the unknown system (2)
these methods do not solve the key problem: wrong initial
weights. In [7], the initial weights are calculated by the b () = { −1 (2 1 [1 () + 1 ] + 2 ) + }
sensitivity ratio analysis and in [8], they are obtained by (3)
finding the support vectors of the input data. In this paper, we where b () ∈ < is the output of the neural model, 1 ∈
use deep learning techniques to find the best initial weights. <1 × 1 ∈ <1 2 ∈ <2 ×1 2 ∈ <2 ∈ <×−1
A deep neural network (DNN) has the same structure as ∈ < ( = 1 · · · ) are node numbers in each layer,
a MLP. There are not methods to find the best structure of ∈ < ( = 1 · · · ) are active vector functions, is layer
DNNs. [9], [10], and [11] propose some effective algorithms number.PWe use P linear function for the output layer i.e.,
to find a suitable structure for DNNs: 1) Grid search, it tests all = [ 1 · · · ³] The other´ layers use sigmoid functions
possible combinations of the structure parameters; 2) Random
as () = 1 + − − , where = 1 · · · − 1
search, it only use a few combinations by some sampling rules = 1 · · · and are prior defined positive constants,
obtaining similar performances than the grid search [11]. is the input vector to the sigmoid functions.
The restricted Boltzmann machines (RBM) are unsuper- From the Stone-Weierstrass theorem, we know if the node
vised learning methods. The results of [12] show that the number of one hidden layer is large enough, the neural model
unsupervised pre-training can drive the DNN away from the can approximate the nonlinear function Φ to any degree of
local minima for pattern recognition. For system identification, accuracy for all () Instead of increasing , we increase the
can similar results be obtained? To the best of our knowledge, layer number in this paper. The deep neural structure of this
there are not publications that apply deep learning for nonlin- paper is ≥ 2 The goal of the neural identification is to find a
ear system identification. RBMs cannot be applied to system suitable structure (layer number node number in each layer
identification directly, because the input values are non-binary ) and weights (1 · · · ) such that the identification error
as classification problem [13]. In this paper, we first construct () = b () − () is minimized. To find a suitable structure
a DNN, then we design unsupervised learning methods with of the neural model, we use the random search method [14].
RBMs. Then we use the trained weights as the initial weights The next job is to find suitable weights (1 · · · ). The
of the DNN. Finally, the gradient based supervised learning gradient descent method can always arrive a local minima
method is applied. Two benchmark examples are used to show completely depending on the initial weights 1 · · · . In the
the effectiveness of the DNN. following sections, we will show how to use RBMs to find
Erick de la Rosa , Wen Yu are with Departamento de Control Automatico, the initial weights 1 (0) · · · (0) and how to identify the
CINVESTAV-IPN (National Polytechnic Institute), Mexico City, Mexico. nonlinear systems as suggested in Fig.1.
e(k ) W1 ,b1
D e e p n e u ra l
yˆ ( k ) x1 h1
n e tw o rk
In it ia l w e ig h t s S u p e r v is e d le a r n in g
w1(0 ) w p (0 ) x2 h2
x(k )
Visible unit Hidden unit
xn hl1
D e e p le a r n in g
m o d e ls z(k )
U n s u p e r v is e d
W1T ,c1
le a r n in g
III. R ESTRICTED B OLTZMANN MACHINES FOR SYSTEM In order to calculate de gradient, the concept of free energy
IDENTIFICATION z() is introduced by (8).
The RBM is a deep learning method, which trains the
X X
weights under a probability distribution by using a set of input. z() = − − log ( + ) (8)
We describe next how to use it for system identification. =1
times. It is shown in Fig.2, which shows the first model of Obviously, if () ∈ (−∞ ∞), the integral term does not
RBM. The hidden representation of Model 1 is 1 () = converge. Let denote = + where is the term
[1 () + 1 ] where 1 () is the input of RBM Model applied to the visible units, see Fig.2. If () are non-
2. The weights of RBM are updated by negative, () ∈ [0 ∞) (9) becomes
log ()
( + 1) = () − 3 (7) ( |) = R ∞ = 1 ∞ (10)
() 0
|0
444
(10) has finite value if () 0 ∀. The conditional A. First order nonlinear system
probability distribution is ( |) = − () () 0. The first order nonlinear system presented by [17] is
The visible units which have this probability distribution as
are called exponential units. In order to perform the sampling ()
( + 1) = + 3 () (17)
process, we need to calculate the cumulative probability distri- 1 + 2 ()
bution ( |) = 1 − . So ( |) always increases. where () is a periodic input, which has different form in
The sampling process is possible by the inverse function of the training and the testing processes
cumulative probability −1 . If () is a value from a sampling µ ¶ µ ¶
process on a uniform distribution, then we can associate it with
() = sin + sin (18)
the cumulative density value in (11) and the expected value 50 20
according to the distribution ( |) is [ ] = −1 () In the training stage, = = 1 In the training stage of
ln (1 − ()) the neural model (3), = 11 = 09 In the testing stage,
() = (11) = 09 = 11 The unknown nonlinear system (17) has
the form of (2). ̄ () = [() ()] = 2 = 1
2) The visible and hidden units are in the interval [0 1] We use 2000 data to train the deep learning model. We
The positive range [0 ∞) is very big. Sometimes the use 2000 data for the neural model (3), 200 data to test it.
calculation accuracy is affective. If the input to the visible So = 2000 The structure parameters of the neural model,
units is restricted in a special zone, normalization methods layer number and node number of each layer ( = 1 · · · ),
can be applied to fore the interval in [0 1] In this case, the are obtained by the random search method [11]. The search
probability distribution computation is also bounded (9) and ranges of the parameters are: 10 ≥ ≥ 3 6 ≥ ≥ 2, for
the cummulative probability is given by (12). the RBM (7) 01 ≥ 3 ≥ 001 The random search shows
Z better performance with 4 hidden layers ( = 5) and = 6
− 1 ( = 1 2 3 4). We use it as our neural model: four hidden
( |) = = (12)
−1 0 −1 layers, each hidden layer has six nodes.
We compare the following four models: deep neural model
With a sample unit (), the new value () with respect
with RBM (DN_RB), deep neural model with autoencoders
to is given by (13) and the expected value of the distrib-
method [18] (DN_AM), multilayer perceptron (MLP), and
ution is [ ] = 1(1 − − ) − 1
deep neural model with backpropagation (DN_BP). The MLP
log [1 + () ( − 1)] model has the same structure as [17], it has two hidden layers
() = (13)
(20 10). The DN_BP and DN_AM models have the same
structure as the deep neural model (6 6 6 6). However, the
3) The visible and hidden units are in the interval [− ]
initial weights of DN_BP are random.
If the input set () can be positive and negative, we define
The DN_RB model uses four types of visible units to
a interval as [− ] for each visible unit. In this case the
calculate the conditional probability transformation: binary,
conditional probability is truncated exponential. (9) is
[0 ∞) interval, [0 1] interval, and [−8 8] interval. For binary
case ̄ () is treated as a probability not as a real value. For
( |) = = (14)
1 − − the interval [0 1] ̄ () is normalized as
|−
̄ () − min {̄ ()}
The sampling process using the inverse function of the () = (19)
max {̄ ()} − min {̄ ()}
corresponding and () in the uniform distribution and
the expected value for this distribution are When = 2000 the upper bound of ̄ () is 8 For [0 ∞)
case, () = ̄ () + 8
log[− +()( −− )]
() = The testing results of the four models are shown in Fig. 3.
(15)
+−
[ ] = −− −
1
0.8 1.2
System System
After RBM training with input data, the initial weights of 0.75
DN_RB
DN_AM 1
DN_RB
DN_AM
MLP MLP
neural model are obtained. For the supervised learning, the 0.7
DN_BP
0.8
DN_BP
Outputs
0.6 0.4
2 ()
( + 1) = () − 2 = 1··· (16) 0.55 0.2
0.45 -0.2
IV. S IMULATIONS
0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 60 70 80 90 100
Iterations Iterations
445
1 shows the average errors of different models in the testing Table 1. The average error in the testing phase (×10−3 ). FOS (first order
phase. This table also gives the comparisons of the four types system), SOS (second order system)
visible units for RBM. We can see that the best performance is Model DN_AM MLP DN_BP Binary [0, ∞) [0,1) [-8,8]
to normalize the visible units in [0 1]. It is some strange that FOS 7.477 16.158 6.652 5.972 31.546 5.915 6.862
the binary normalization has better performance than the non- SOS 8.569 9.123 21.345 7.738 14.124 7.234 9.456
0.8
DN_BP 0.7 DN_BP national Conference on Machine Learning, pp. 160-167, ACM, 2008.
0.6
0.65 [15] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, A learning algorithm
0.6 for boltzmann machines, Cognitive Science, vol. 9, pp. 147-169, 1985.
0.4
0.55 [16] D. Erhan, P.-A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent, The
0.2 0.5 difficulty of training deep architectures and the effect of unsupervised
0 0.45 pretraining, 12th International Conference on Artificial Intelligence and
-0.2
0.4
0 5 10 15 20 25 30 35 40 45 50
Statistics (AISTATS’09), pp. 153-160, 2009.
0 10 20 30 40 50
Time
60 70 80 90 100 Time
[17] K. S. Narendra and K. Parthasarathy, Gradient methods for optimization
of dynamical systems containing neural networks, IEEE Transactions od
Neural Networks, pp. 252-262, March 1991
Fig. 4 Second order system results. Left) Training Right) Testing [18] P.Vincent, H. Larochelle Y. Bengio and P.A. Manzagol, Extracting
and Composing Robust Features with Denoising Autoencoders, 25th
Fig. 4 (right) and Table 1 show the testing results of different International Conference on Machine Learning, pp.1096-1103, ACM,
models. We obtain the similar conclusions as the first example. 2008.
446