You are on page 1of 4

2015 IEEE 14th International Conference on Machine Learning and Applications

Restricted Boltzmann machine for nonlinear system modeling


Erick De la Rosa , Wen Yu

Abstract— In this paper, we use a deep learning method, re- II. N ONLINEAR SYSTEM IDENTIFICATION WITH DEEP
stricted Boltzmann machine, for nonlinear system identification. NEURAL NETWORKS
The neural model has deep architecture and is generated by a
random search method. The initial weights of this deep neural Consider the following unknown discrete nonlinear system
model are obtained from the restricted Boltzmann machines.
To identify nonlinear systems, we propose special unsupervised ̄( + 1) =  [̄ ()   ()]  () =  [̄ ()] (1)
learning methods with input data. The normal supervised learn-
ing is used to train the weights with the output data. The modified where  () ∈ < is the input vector, ̄ () ∈ < is an
algorithm is validated by modeling two benchmark systems. internal state vector, and  () ∈ < is the output vec-
tor.  and  are £nonlinear smooth functions   ∈ ¤∞ .
I. I NTRODUCTION Denoting  () =   ()    ( + 1)  · · ·   ( +  − 1) 

£  ¤
The multi-layer perceptron (MLP) can approximate any 
 () =  ()   ( + 1)  · · ·  ( +  − 2)  If   ̄ is
nonlinear function to any prescribed accuracy if sufficient non-singular at ̄ = 0  = 0 this leads to the model (2)
hidden neurons are provided [1]. The structure identification
is often tackled by trial-and-error approaches like fuzzy grid () = Φ [ ()] (2)
[2], and data mining methods [3]. In this paper, we increase
the hidden layers and decrease the hidden units keeping the  () = [  ( − 1)    ( − 2)  · · ·  ()   ( − 1)  · · · ]
model complexity while improving its modeling capacity. Φ (·) is an unknown nonlinear difference equation representing
Since the identification error space is unknown, the neural the plant dynamics,  () and  () are measurable scalar input
model can be settled down in a local minima easily if the and output,  is time delay. The nonlinear system (2) is a
initial weights are not suitable [4]. There are some techniques NARMA model. We can also regard the input of the nonlinear
to overcome the local minima: such as noise-shaping modifi- 
system as  () = [1 · · ·  ] ∈ <  the output as () ∈
cation [5] and combining nonlinear clustering [6]. However, <  We use the MLP to identify the unknown system (2)
these methods do not solve the key problem: wrong initial
weights. In [7], the initial weights are calculated by the b () =  { −1    (2 1 [1  () + 1 ] + 2 )    +  }
sensitivity ratio analysis and in [8], they are obtained by (3)
finding the support vectors of the input data. In this paper, we where b () ∈ < is the output of the neural model, 1 ∈
use deep learning techniques to find the best initial weights. <1 ×  1 ∈ <1  2 ∈ <2 ×1  2 ∈ <2   ∈ <×−1 
A deep neural network (DNN) has the same structure as  ∈ <   ( = 1 · · · ) are node numbers in each layer,
a MLP. There are not methods to find the best structure of  ∈ < ( = 1 · · · ) are active vector functions,  is layer
DNNs. [9], [10], and [11] propose some effective algorithms number.PWe use P linear function for the output layer   i.e.,
to find a suitable structure for DNNs: 1) Grid search, it tests all  = [ 1 · · ·  ³]  The other´ layers use sigmoid functions
possible combinations of the structure parameters; 2) Random

as  () =   1 + −  −  , where  = 1 · · ·  − 1
search, it only use a few combinations by some sampling rules  = 1 · · ·       and  are prior defined positive constants,
obtaining similar performances than the grid search [11].  is the input vector to the sigmoid functions.
The restricted Boltzmann machines (RBM) are unsuper- From the Stone-Weierstrass theorem, we know if the node
vised learning methods. The results of [12] show that the number of one hidden layer is large enough, the neural model
unsupervised pre-training can drive the DNN away from the can approximate the nonlinear function Φ to any degree of
local minima for pattern recognition. For system identification, accuracy for all  ()  Instead of increasing  , we increase the
can similar results be obtained? To the best of our knowledge, layer number  in this paper. The deep neural structure of this
there are not publications that apply deep learning for nonlin- paper is  ≥ 2 The goal of the neural identification is to find a
ear system identification. RBMs cannot be applied to system suitable structure (layer number  node number in each layer
identification directly, because the input values are non-binary  ) and weights (1 · · ·  ) such that the identification error
as classification problem [13]. In this paper, we first construct  () = b () −  () is minimized. To find a suitable structure
a DNN, then we design unsupervised learning methods with of the neural model, we use the random search method [14].
RBMs. Then we use the trained weights as the initial weights The next job is to find suitable weights (1 · · ·  ). The
of the DNN. Finally, the gradient based supervised learning gradient descent method can always arrive a local minima
method is applied. Two benchmark examples are used to show completely depending on the initial weights 1 · · ·  . In the
the effectiveness of the DNN. following sections, we will show how to use RBMs to find
Erick de la Rosa , Wen Yu are with Departamento de Control Automatico, the initial weights 1 (0) · · ·  (0)  and how to identify the
CINVESTAV-IPN (National Polytechnic Institute), Mexico City, Mexico. nonlinear systems as suggested in Fig.1.

978-1-5090-0287-0/15 $31.00 © 2015 IEEE 443


DOI 10.1109/ICMLA.2015.24
x(k ) N o n lin e a r y(k ) Model 1
sy ste m Φ

e(k ) W1 ,b1
D e e p n e u ra l
yˆ ( k ) x1 h1
n e tw o rk

In it ia l w e ig h t s S u p e r v is e d le a r n in g
w1(0 ) w p (0 ) x2 h2
x(k )
Visible unit Hidden unit

xn hl1
D e e p le a r n in g
m o d e ls z(k )

U n s u p e r v is e d
W1T ,c1
le a r n in g

Fig. 1. The identification structure using the deep learning techniques.


Fig. 2. The model of the restricted Boltzmann machine

III. R ESTRICTED B OLTZMANN MACHINES FOR SYSTEM In order to calculate de gradient, the concept of free energy
IDENTIFICATION z() is introduced by (8).
The RBM is a deep learning method, which trains the 
X X
weights under a probability distribution by using a set of input. z() = −  − log  ( + ) (8)
We describe next how to use it for system identification. =1 

This allows to calculate de gradient of  () as,


A. Restricted Boltzmann machine training
 log  () X z() z()
An RBM is an energy-based model, which is defined by a =  () −
 ()  ()  ()
probability distribution. This probability distribution depends
on the current configuration (or energy) of the model. The goal It is difficult
P to determine an analytical expression for
z ()
of the training is to maximize the likelihood of the model, the gradient  ()   because it involves the calcu-
h i ()
z ()
X X −() lation of   ()  which is the mathematical expec-
 () =  ( ) = (4) tation over all possible combinations of the reconstructed

  input  () under the distribution  (). In this paper, we
where  is the input,  P is the hidden unit,  is a partition estimate it with a set  which includes finite P samples. The
z ()
function defined as  =  −()   () is the probability sample number is  turning the sum into  ()  () ≈
P z () P z () P z ()
∈  ()  Here the
1 1
distribution of  ( ) is the energy function defined  ∈  ()   ()  () ≈ 
samples  are obtained with the Monte Carlo algorithm
 ( ) = −  −   −    (5) with contrastive divergence. Monte Carlo Markov chains work
especially well for RBM models [15], although the approxi-
where  is the weight matrix of the hidden layer,  and 
mation capability is improved when the hidden units increases
are visible and hidden bias respectively. Since the observation
[16]. In each running period, the samples  and the input ()
of all states  () is not tractable, this may require the
are used to update the weights  and a new example ( +1)
introduction of some not observed variables  to enhance is presented to the model.
the reconstruction capabilities. The nonlinear active functions
become conditional probability transformations. In some sim- B. Conditional probability for non-binary values
ple cases, such as classification and dimensionality reduction, For nonlinear system identification, the visible units cannot
 are binary values, i.e.,  ∈ {0 1}  If  ( = 1 · · ·  ) be binary values. We classify the input  () into three types:
denote the rows of  and  ( = 1 · · · ) its columns, the 1) interval [0 ∞) for unbounded positive input; 2) [0 1] for
probabilistic version of the deep learning model becomes normalized input; 3) [− ] for bounded input.
Q Q 1) The visible units are in the interval [0 ∞)
 (|) =  ( |) =  [  () + ]
The conditional probability for energy function of (4) is
=1···
Q  Q  £ 
=1··· ¤ (6)
 (|) =  ( |) =    () +  
( +  )
=1··· =1···
 ( |) = R (9)
( +  ) 
For each sub-model, the transformation (6) is repeated  

times. It is shown in Fig.2, which shows the first model of Obviously, if  () ∈ (−∞ ∞), the integral term does not
RBM. The hidden representation of Model 1 is 1 () = converge. Let denote  =  +   where  is the term
 [1  () + 1 ] where 1 () is the input of RBM Model applied to the visible units, see Fig.2. If  () are non-
2. The weights of RBM are updated by negative,  () ∈ [0 ∞) (9) becomes
 log  ()    
 ( + 1) =  () − 3 (7)  ( |) = R ∞ = 1   ∞ (10)
 () 0
     |0

444
(10) has finite value if  ()  0 ∀. The conditional A. First order nonlinear system
probability distribution is  ( |) = − ()  ()  0. The first order nonlinear system presented by [17] is
The visible units which have this probability distribution as
are called exponential units. In order to perform the sampling ()
( + 1) = + 3 () (17)
process, we need to calculate the cumulative probability distri- 1 +  2 ()
bution  ( |) = 1 −   . So  ( |) always increases. where () is a periodic input, which has different form in
The sampling process is possible by the inverse function of the training and the testing processes
cumulative probability −1 . If  () is a value from a sampling µ ¶ µ ¶
process on a uniform distribution, then we can associate it with  
() =  sin +  sin (18)
the cumulative density value  in (11) and the expected value 50 20
according to the distribution  ( |) is [ ] = −1 () In the training stage,  =  = 1 In the training stage of
ln (1 −  ()) the neural model (3),  = 11  = 09 In the testing stage,
 () = (11)  = 09  = 11 The unknown nonlinear system (17) has
 
the form of (2). ̄ () = [()  ()]   = 2  = 1
2) The visible and hidden units are in the interval [0 1] We use 2000 data to train the deep learning model. We
The positive range [0 ∞) is very big. Sometimes the use 2000 data for the neural model (3), 200 data to test it.
calculation accuracy is affective. If the input to the visible So  = 2000 The structure parameters of the neural model,
units is restricted in a special zone, normalization methods layer number  and node number of each layer  ( = 1 · · · ),
can be applied to fore the interval in [0 1] In this case, the are obtained by the random search method [11]. The search
probability distribution computation is also bounded (9) and ranges of the parameters are: 10 ≥  ≥ 3 6 ≥  ≥ 2, for
the cummulative probability is given by (12). the RBM (7) 01 ≥ 3 ≥ 001 The random search shows
Z  better performance with 4 hidden layers ( = 5) and  = 6
   − 1 ( = 1 2 3 4). We use it as our neural model: four hidden
 ( |) =     =  (12)
 −1 0  −1 layers, each hidden layer has six nodes.
We compare the following four models: deep neural model
With a sample unit  (), the new value  () with respect
with RBM (DN_RB), deep neural model with autoencoders
to  is given by (13) and the expected value of the distrib-
method [18] (DN_AM), multilayer perceptron (MLP), and
ution is [ ] = 1(1 − − ) − 1
deep neural model with backpropagation (DN_BP). The MLP
log [1 +  () ( − 1)] model has the same structure as [17], it has two hidden layers
 () = (13)
 (20 10). The DN_BP and DN_AM models have the same
structure as the deep neural model (6 6 6 6). However, the
3) The visible and hidden units are in the interval [− ]
initial weights of DN_BP are random.
If the input set  () can be positive and negative, we define
The DN_RB model uses four types of visible units to
a interval as [− ] for each visible unit. In this case the
calculate the conditional probability transformation: binary,
conditional probability is truncated exponential. (9) is
[0 ∞) interval, [0 1] interval, and [−8 8] interval. For binary
     case ̄ () is treated as a probability not as a real value. For
 ( |) = = (14)
1      − −  the interval [0 1] ̄ () is normalized as
  |−
̄ () − min {̄ ()}
The sampling process using the inverse function of the  () = (19)
max {̄ ()} − min {̄ ()}
corresponding  and  () in the uniform distribution and
the expected value for this distribution are When  = 2000 the upper bound of ̄ () is 8 For [0 ∞)
case,  () = ̄ () + 8
log[−  +()(  −−  )]
 () = The testing results of the four models are shown in Fig. 3.
 
 (15)
+− 
[ ] =    −−  − 
1

0.8 1.2
System System

After RBM training with input data, the initial weights of 0.75
DN_RB
DN_AM 1
DN_RB
DN_AM
MLP MLP

neural model are obtained. For the supervised learning, the 0.7
DN_BP
0.8
DN_BP

weights  are updated by 0.65 0.6


Outputs

Outputs

0.6 0.4

2 ()
 ( + 1) =  () − 2   = 1··· (16) 0.55 0.2

 () 0.5 0

0.45 -0.2

IV. S IMULATIONS
0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 60 70 80 90 100
Iterations Iterations

Fig. 3 First order system results. Left) Training Right) Testing


In this section, we use two benchmark examples proposed
in [17] to show the effectiveness of our deep learning methods In
Pthe testing phase,2 we define the average error as
for nonlinear system identification. 1
 =1 k̂ () −  ()k  In this example,  = 200 Table

445
1 shows the average errors of different models in the testing Table 1. The average error in the testing phase (×10−3 ). FOS (first order
phase. This table also gives the comparisons of the four types system), SOS (second order system)
visible units for RBM. We can see that the best performance is Model DN_AM MLP DN_BP Binary [0, ∞) [0,1) [-8,8]

to normalize the visible units in [0 1]. It is some strange that FOS 7.477 16.158 6.652 5.972 31.546 5.915 6.862

the binary normalization has better performance than the non- SOS 8.569 9.123 21.345 7.738 14.124 7.234 9.456

normalization methods, [0 ∞) and [−8 8] We believe that V. C ONCLUSIONS


the normalization method is more suitable for the conditional
probability transformation of RBM, and the weights are not In this paper, the neural model for nonlinear system identifi-
influenced so much by exact values of the visible units. cation has deep structure. The unsupervised learning methods
For this example, we have the following conclusions: with input data is used to obtain the initial weights of the
1) The deep neural model of this paper has better modeling neural model. The supervised learning techniques with output
accuracy than the normal neural model as in [17]. The deep data are applied to train the weights. We modify the deep
neural model has less hidden nodes (24) than MLP (30). learning algorithms of the restricted Boltzmann machines,
2) The RBM (DN_RB) has better capacity to avoid the local such that they are suitable for nonlinear system identification.
minima than the autoencoders method (DN_AM) Further works will develop on-line version of these algorithms.
3) Comparing DN_RB and DN_AM with DN_BP, the deep R EFERENCES
learning techniques are effective in nonlinear modeling
[1] G.Cybenko, Approximation by Superposition of Sigmoidal Activation
4) After the RBM application, the weighs converge faster. Function, Math.Control, Sig Syst, Vol.2, 303-314, 1989
[2] J. S. Jang, ANFIS: Adaptive-network-based fuzzy inference system,
B. Second order nonlinear system IEEE Transactions on Systems, Man and Cybernetics, Vol. 23, 665–685,
The second order nonlinear system presented by [17] is 1993.
[3] W.Yu, K.Li, X.Li, Automated nonlinear system modeling with multiple
() + ( − 1) neural networks, International Journal of Systems Science, Vol.42,
( + 1) = + ()3 − () (20) No.10,1683-1695, 2011
1 +  2 () +  2 ( − 1) [4] S.Jagannathan and F.L.Lewis, Identification of Nonlinear Dynamical
Systems Using Multilayered Neural Networks, Automatica, Vol.32,
where () is a periodic input as described by (18). In the No.12, 1707-1712, 1996.
training stage of the DNN (3),  = 08  = 11 In the [5] S.Chakrabartty, ; R.K.Shaga, K.Aono, Noise-Shaping Gradient Descent-
testing stage,  = 09  = 08 For the unknown nonlinear Based Online Adaptation Algorithms for Digital Calibration of Analog
Circuits, IEEE Transactions on Neural Networks and Learning Systems,
system (20) ̄ () = [()  ( − 1)   ()]   = 3  = 1 Volume: 24 , Issue: 4, pp. 554-565, 2013
We use 2000 data to train the model and 400 data to test [6] Yang Liu, Yan Liu, K.Chan, K.A.Hua, Hybrid Manifold Embedding,
it. The search ranges of the parameters are the same as the IEEE Transactions on Neural Networks and Learning Systems, Volume:
25 , Issue: 12, pp.2295 - 2302, 2014
first order nonlinear system (17). A local minimum is in four [7] Q.Song, Robust Initialization of a Jordan Network With Recurrent
hidden layers ( = 5) and  = 3. Constrained Learning, IEEE Transactions on Neural Networks, Vol.22,
We also use four models, DN_RB, DN_AM, MLP, and No.12, pp.2460 - 2473, 2011
[8] W.Yu, X.Li, Automated Nonlinear System Modeling with Multiple
DN_BP to identify the nonlinear system (20). The visible units Fuzzy Neural Networks and Kernel Smoothing, International Journal
of DN_RB have also the same 4 types and the MLP model of Neural Systems, Vol.20, No.5, 429-435, 2010
has two hidden layers (20 10). The training time of the deep [9] J. Bergstra, Y. Bengio, Algorithms for Hyper-Parameter Optimization,
Journal of Machine Learning Research, pp 201-210. 2012
learning models DN_RB and DN_AM are 11 seconds and 6 [10] S. Lawrence, C. Lee Giles, and Ah Chung, What Size Neural Network
seconds with the Intel i5 CPU @ 2.4GHZ. The DN_RB model Gives Optimal Generalization, Technical Report, University of Mary-
has more computational complexity. The training results of the land, 1996
[11] J. Bergstra, Y. Bengio, Random Search for Hyper-Parameter Optimiza-
deep neural model (3) are shown in Fig. 4 (left). The deep tion, Journal of Machine Learning Research, pp 281-305, 2011
learning methods achieve good performances after 60 training [12] D.Erhan, Y.Bengio, A.Courville, P-A.Manzagol, P.Vincent, Why Does
samples without the presence of overfitting. Unsupervised Pre-training Help Deep Learning?, Journal of Machine
Learning Research, vol.11, 625-660, 2010
[13] G. E. Hinton, S. Osindero, and Y. Teh, A fast learning algorithm for
Outputs
deep belief nets, Neural Computation, vol. 18, pp. 1527-1554, 2006.
1.2
System
0.8
Output System [14] R. Collobert and J. Weston, A unified architecture for natural language
processing: Deep neural networks with multitask learning, 25th Inter-
DN_RB DN_RB
1 0.75
DN_AM DN_AM
MLP MLP

0.8
DN_BP 0.7 DN_BP national Conference on Machine Learning, pp. 160-167, ACM, 2008.
0.6
0.65 [15] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, A learning algorithm
0.6 for boltzmann machines, Cognitive Science, vol. 9, pp. 147-169, 1985.
0.4
0.55 [16] D. Erhan, P.-A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent, The
0.2 0.5 difficulty of training deep architectures and the effect of unsupervised
0 0.45 pretraining, 12th International Conference on Artificial Intelligence and
-0.2
0.4
0 5 10 15 20 25 30 35 40 45 50
Statistics (AISTATS’09), pp. 153-160, 2009.
0 10 20 30 40 50
Time
60 70 80 90 100 Time
[17] K. S. Narendra and K. Parthasarathy, Gradient methods for optimization
of dynamical systems containing neural networks, IEEE Transactions od
Neural Networks, pp. 252-262, March 1991
Fig. 4 Second order system results. Left) Training Right) Testing [18] P.Vincent, H. Larochelle Y. Bengio and P.A. Manzagol, Extracting
and Composing Robust Features with Denoising Autoencoders, 25th
Fig. 4 (right) and Table 1 show the testing results of different International Conference on Machine Learning, pp.1096-1103, ACM,
models. We obtain the similar conclusions as the first example. 2008.

446

You might also like