You are on page 1of 6

© Journal of the Technical University - Sofia

Plovdiv branch, Bulgaria


“Fundamental Sciences and Applications”
Vol. 24, 2018

ANALYSIS OF THE MEAN FIELD


APPROXIMATION FOR TRAINING THE DEEP
BOLTZMANN MACHINE
TODOR TODOROV, GEORGI TSANEV

Abstract: The developers of the deep machine learning are essentially inspired from the
activities of the human brain. The main goal in this area is to get a finite software model of
human recognition approach. The deep learning architecture is based on the mean field
approximation. Many authors a priory assume that the mean field approximation problem
has a solution for all random initial guesses. Some of them just make a few steps applied
the fixed point iteration to establish whether neurons in hidden layers are active or not. The
present paper describes the area of application of the mean field approximation for
training multilayer Boltzmann machine. We have a strong proof on the fact that mean field
approximations are not convergent for all random initial guesses. The convergence
strongly depends on the norms of the weight matrices. The new results are supported by
computer implemented examples.
Key words: Deep Boltzmann Machine, mean field approximation, gradient iterative
methods.

1. Introduction [9] way back in 1988. The method became popular


The contemporary voice control of [10] very fast because of the advantages like: no
machines is related to deep belief learning. The line search procedures, easy implementation and
Deep Boltzmann Machine became practically only slight dependence of the initial guesses etc.
usable after R. Salakhutdinov and G. E. Hinton [1] The contributions of the present paper are
had developed the mean field approximation described as follows. The paper is devoted to a
algorithm to establish the state of neurons in hidden numerical algorithm for solving the mean field
layers to be active. The mean field approximation approximation problem. The weak formulation of
requires the visible neurons to be fixed to the the original problem is transformed to an
training data when a fixed-point iteration is unconstrained minimization problem. Sufficient
performed. R. Salakhutdinov and G. E. Hinton conditions for existence and uniqueness are
extend their variational approach in the following rigorously proved. A strict proof that the weak
publications [2,3,4]. After the pioneering paper of solution can be used in the process of deep learning
R. Salakhutdinov and G. E. Hinton [1] a lot of is made. The minimization problem is solved by the
researchers [5,6,7,8] etc. have applied the mean two-point step size gradient method. We emphasize
field approximation when investigating Deep on the fact that the choice of a steplength for the
Boltzmann Machines. Having in mind that such Barzilai-Borwein method is a crucial point with
procedures are executed a lot of times while training respect to the rate of convergence. Some steplengths
a neural network, the real application of a Deep assure convergence but with very slow rate.
Boltzmann Machine strongly depends on the Inappropriate choice of the steplength causes
convergence rate of the fixed point iteration. Most divergence of the sequence of approximate
of the authors (see for instance [3] and [6]) have solutions. The successful steplength for the two-
used the fixed-point iteration in order to obtain the point step size gradient method depends on the
probability of the neurons in a hidden layer to be objective functional. The initial version of the
active. Unfortunately this method has too low rate Barzilai-Borwein method [9] was developed for
of convergence and very bad behavior when the quadratics in the two-dimensional case. Further,
weight matrices are non-positive definite or ill- similar results were obtained by M. Raydan [11] in
conditioned. The two-point step size gradient the n -dimensional case. T. D. Todorov [12] found
method was obtained by J. Barzilai and J. Borwein a new steplength for quartics, which gives much

Copyright  by Technical University - Sofia, Plovdiv branch, Bulgaria


Copyright © 2018 by Technical University - Sofia, Plovdiv branch, Bulgaria ISSN 1310 - 8271
194
better results than the classical steplength proposed
by J. Barzilai and J. Borwein [9]. The number of
Dk F  x   sup
D k F  x  1, 2 ,..., n  .
necessary iterations for satisfying the stop criterion i 1
strongly depends of the variable steplength. Two 1 i  k
original steplengths for solving the mean field A deep Boltzmann machine with p hidden
approximation problem are obtained in the present layers and no more than s hidden units in each
investigation. The new steplengths are compared layer is an object of interest in the present paper.
with the classical one and the fixed-point iteration is We assume that all hidden layers are connected
compared with the Barzilai-Borwein method. The excluding horizontal connections within a fixed
time for training Deep Boltzmann Machine strongly layer. In the first stage of our investigation we
depends on the initial guesses of the weighted suppose that all hidden units in all layers have the
matrices. Lower norms of the weighted matrices same length m , Figure 1. The latter means that all
assure higher rate of convergence solving the mean matrices Wij are square. Usually the weight
field approximation problem and hence decreasing
of the necessary time for training. matrices are rectangular, Figure 2. This case is
The rest of the paper is organized as considered further. We compile the following mean
follows. The problem of interest is defined in field approximation problem
Section 2. The weak formulation and the associate
unconstrained minimization problem is considered  v1    a1   O W12 ... W1r  v1  
       
in Section 3. Sufficient conditions for existence and  v2  Sigm   a2    W21 O ... W2r  v2   , (1)

uniqueness of the weak solution are found in the  ...    ...   ... ... ... ...  ...  
same section. An iterative method for solving the         
 vr    ar   Wr1 Wr 2 ... O  vr  
associate unconstrained minimization problem is
investigated in Section 4. Here a detail proof for
convergence of the Barzilai-Borwein method is where r is the number of all hidden units in the
presented. neural network and Sigm ( v ) is the multivariate
sigmoid. Some blocks Wij are zero matrices since
2. Setting the problem
there are not connections within layers and usually
We begin with some basic definitions and
the number of hidden units is not the same in the
denotations. The vector space n with the standard different layers. Concatenating the vectors
basis e1 , e2 ,..., en is provided with the Euclidean vi , i  1, 2,..., r we obtain an n-dimensional vector
norm ||  || and the corresponding scalar product. v  v1 , v2 ,..., vn  . Then the problem (1) takes the
The vectors ei , i  1, 2,..., n are the columns of the following compact form
n
corresponding identity matrix and eˆ   e . The
i 1
i
Find u  such that

 (2)
 
norm in C k  , k  is denoted by |||  |||k , u  Sigm  a  Wu  , a  .

and the norm in Lk    by ||  ||k ,  . Let F be a Here W is n  n block matrix and


k -differentiable map from n
to n
. Then the
k

 v  n | 0  vi  1, i  1, 2,..., n . 
norm of the k -th Fréchet derivative D F  x  is
given by

Fig. 1. Neural network where all the hidden units in all layers Fig. 2. Neural network with various lengths of the hidden units.
have the same length.
195
3. The weak formulation and the associate v 
unconstrained minimization problem 1
J (v )
 (v, v)    s (t )dt ,1 
Let  h be a uniform triangulation of the 2  
0 
interval  [0,1] with n linear finite elements.
to associate weak form (3) with the following
The associate finite element space h is spanned minimization problem
by nodal basis functions i , i  1, 2,..., n . The arg min J  v  . (4)
space is provided with the inner scalar product v ˆ h
h
Theorem 1 establishes existence and uniqueness
u, v    uvdx . of the solution of minimization problem (4). To
A linear operator T :  defined by prove this theorem we need the following
h
denotations: sigm(x) is the well-known logistic
Tv  v . , v  ,   1 ,  2 ,...,  n 
function; diag(i , i  1, 2,..., n) is n n a
generates a subset ˆ 
h 
v Tv | v   of h . diagonal matrix with i in the main diagonal; I is
The weak problem is compiled as follows the identity n  n matrix;
  v  diag  Sigm(a  Wv )diag( eˆ - Sigm( a  Wu ))  .
Find u  ˆ h such that
 Theorem 1 If the weight matrix W is V  positive
 (3)
 u, v   ( s (u ), v) 0, v  ˆ h , definite with spectral norm W  4 then the
problem (4) has a unique solution.

where s (v) TSigm  a  Wv  and v  T v. The 1 Proof. The functional J (v) is continuous, i.e. we
weak formulation is a very important part of our have to prove that J (v) is bounded below, coercive
investigation. We look for a finite dimensional and convex. We estimate the second term in the
Hilbert space of continuous functions and a objective functional using Hölder inequality
bijection between and . Note that the v  1
Gramian matrix of the basis functions is positive   s(t )dt ,1  s v |  | 2 s v 2,
definite. As a particular example in our   0, 1,  0,
0 
consideration we chose the finite element space
and the functional
ˆ . In this case the Gramian matrix is actually the
h 1 2
mass matrix. The interval [0,1] is not essential for
J v  v  s v
2 2, 0, 2,
the proof of the main result. Any other finite closed
interval can be successfully applied but we use the
canonical one-dimensional simplex for the sake of

1
2 
v 2,  2 s
0, 
v 2, , v  ˆ h .

simplicity. The choice of the piecewise linear trial The latter inequality shows us that J (v) is
functions is made for the same reasons. The bounded below and coercive. It remains to prove
operator T is the wanted bijection. that J (v) is convex. For this purpose we calculate
Lemma 1 All solutions of the weak problem are Fréchet derivatives of J (v) :
also solutions of the equation (2).
Proof. The zero vector does not belong to the set
DJ   u, v    s(u ), v  ,
(u )v
and the mass matrix M is symmetric and positive  v, v   ( DsD( a  Wu ). , v)
D 2 J (u )(v, v) 
definite. Therefore the assertion of the Lemma
v) v T Mv   (u )Wv  Mv
T
 (v, v)  ((u )Wv. ,
results from the following equivalences:
 u, v   (s(u ), v) 
0  v T T
 
    u Wv  Mv v T I     u W  Mv.
T

 uMv  Sigm(a  Wu )Mv  0 1
Since  u   and W  4 the second
  u  Sigm(a  Wu )  Mv 
0. 4
derivative D J  u  v, v   0 and the objective
2
Definition 1 The square matrix W is said to be -
positive definite if v TWv  0, v  . functional is strongly convex.
The main goal of the present investigation is to Corollary 1 If the conditions of Theorem 1 hold
present problem (2) as a minimization problem. then the problem (3) has a unique solution.
Therefore we define an objective functional

Copyright  by Technical University - Sofia, Plovdiv branch, Bulgaria


196
Proof. The problem (3) is equivalent to W o  4 (6)
DJ  u  v  0 , which means that the solution u of
Then the sequence uk  generated by the two-point
(3) is the unique stationary point of the functional
step size gradient method (5) with the steplength  k
J v .
converges Q-linearly to the weak solution u.
4. An iterative method for solving the Proof. The theorem is proved under random initial
unconstrained minimization problem
guesses. We just suppose that u0 , u1  ˆ h . The
The object of investigation in this section is a
two-point gradient method provided with various 1
equality  uk 1 , v  
 uk , v   DJ  uk  v, k  1
steplengths. We define the Barzilai-Borwein k
method is transformed into the equality
1  k  ek 1 , v    k  ek , v   DJ  u  v  DJ (uk )v
 uk 1 , v  
 uk , v   DJ  uk  v, k  1 (5)
k
having in mind that u is the solution of (3) and
with a step k for the unconstrained minimization e uk  u is the error in the approximate solution
k
problem (4). The choice of the steplength is very
uk . Applying the mean value theorem and replacing
important with respect to the rate of convergence
and the overall necessary computational work. In v  ek 1 we obtain
the present investigation we propose the following 1
original steplengths:  ek 
1, ek 1   ek , ek 1   D 2 J  uk  ek , ek 1  , (7)
k
2 2
k uk  k ek for some k   0,1 . Since
uk 2,  uk 1 2, where u
k  2 2
s  uk   s  uk 1  1
2,  2,  0    v   , v  ,
4

k

 i
k
k 1  I  Ds ui  DJ ui  , DJ ui  3
the steplength ˆk is estimated by   k 
4
2 4 3
k
 i k 1
DJ  ui 
2, The inequality 1 
0
 D 2 J  uk  follows from
Additionally, we define the classical steplength 4
(6). We continue estimating the left hand side of (7)
DJ  uk  uk
k  , where uk  uk  uk 1, 1
2
uk 2,  ek 1, ek 1   I D 2 J  uk  ek 2, ek 1 2,
k
DJ  uk   DJ  uk   DJ  uk 1  .  k ek 2, ek 1 2, . The contraction factor
The identity matrix and the identity operator are 1 3
denoted by the same letter I. The steplength  k is a k is bounded above by k   0    1.
4 16
preconditioned version of the step length Finally
k
 i k 1  
D 2 J ui DJ ui , DJ ui     ek 1 2,  k ek 2,   k e1 2, , k  .
ˆk  2
.
k
 i k 1
DJ ui 2, 
  Further we suppose that all hidden units in the i-th
layer have the same length mi and m  max mi .
i 1,2,..., p
The steplength ˆk gives better results than  k but Let hij be the j-th hidden unit in the i-th hidden
ˆk essentially increase computational complexity
of the problem. That is why further we analyze the
layer and W hij , hlk  be the weight matrix
method with the steplength  k . Similar approach generated by the connection between the neurons
can be applied when  k or ˆk are used. hij and hlk , i  k . We only consider the particular
Theorem 2 assures the convergence of the case 0  mi  mk  m . In this case the weight
Barzilai-Borwein method independently of the
initial guesses.

matrix W hij , hlk   Am  m  B m  m
i i i k  mi  
Theorem 2 The main result. Let the weight is a block rectangular matrix. Similar approach can
matrix W is V -positive definite and be used in the other cases. To obtain a problem
197
corresponding to (2) we extend all weight matrices k
and hidden units as follows: k

 i k 1 i    si
s I   ui
.
2
 hij hˆij  hij , oi  , hkl hˆkl  hkl , ok  
k
s
 i k 1 i
 We consider different cases with respect to the
  A B C ,
features of the weight matrix. The stop criterion
W  hij , hkl  W  O I 22 D 
ˆ 
   DJ  uk    . (8)
 C T DT I 
  33 
with   1013 is used throughout all our
m  mi m  mk considerations. The number of necessary iterations
where oi  and ok  are zero
vectors, O  mk  mi  mi  , C  mi  m  mk  and    k , n, W  for satisfying (8) is an object of
interest in this section. The fixed-point iteration is
D  mk  mi  m  mk  are zero matrices, and
denoted by FPI in all tables.
I 22  mk  mi  mk  mi  and I33  m  mk  m  mk  Let's start with the case where W is V -
positive definite and W  4 , i.e. the weight matrix
are identical matrices. Then hˆij , hˆkl  m and Ŵ is
satisfies the requirements of Theorem 2. In this case
a m  m matrix. Thus we reduce the rectangular we have very fast convergence for all steplengths.
version of the mean field approximation problem to The same rate of convergence is also established for
a square one. the fixed-point iteration. Table 1 and 2 indicate that
Table 1. The number of necessary iterations for the rate of convergence does not depend neither on
satisfying the stop criterion when W  1. the number of the hidden neurons nor the initial
vk \ n 8 100 512 1024 2048 4096 6400 guesses. Further we analyze the case when W  4
k 7 8 8 8 9 9 9 and the requirement for V -positivity is broken. We
8 9 9 9 9 9 9
establish low rate of convergence in the cases when
k 4  W  20 . Some examples are presented in
k 11 13 14 15 15 15 16
Table 3. In this case the traditional steplength
FPI 7 8 8 9 9 9 9 proposed by J. Barzilai and J. Borwein [9] assures
nonmonotone convergence of the error norm. The
Table 2. The number    k , n, 2  obtained by sequence ek  converges monotonically to zero
random initial guesses. when the method is applied with steplengths  k
vk \ n 8 100 512 1024 2048 4096 6400
and  k . The best results for ill-conditioned matrix
k 9 10 11 11 11 11 13
are obtained by the steplength  k . The number
k 11 11 12 12 12 13 13
   k , n, W  4  grows nonlinearly when the
k 12 16 18 18 18 18 19
number of unknowns n tends to infinity. The
FPI 11 11 12 12 12 12 13 convergence of the two point step size method is
not influenced by whether the weight matrix is
singular or not. The initial guesses does not affect
5. Experiments on the rate of convergence as well. All
From the computational point of view computational tests with W  20.5 and
method (5) takes the following attractive form
1 n  4096 have failed since the Barzilai-Borwein
uk 1

k
Sigm  a  Wuk   1  vk  uk  , k  1 method is divergent with all considered steplengths
as well as the fixed-point iteration.
with the steplengths:
 sk .uk 6. Conclusion
k  2 k uk  Sigm  uk  ,
, s The Barzilai-Borwein method for solving the
uk mean field approximation problem is studied in the
uk
2
 uk 1
2 present paper. Two original steplengths, which
k  2 2
, assure monotone decreasing of the norm of the error
Sigm  uk   Sigm  uk 1  in approximate solutions are found. The mean filed
approximation problem is reduced to a
minimization one. The uniqueness and existence of
Copyright  by Technical University - Sofia, Plovdiv branch, Bulgaria
198
the weak solution are proved. A rigorous proof of on Artificial Intelligence and Statistics
the convergence theorem for the Barzilai-Borwein (AISTATS 2009), pp. 448-455, 2009.
method with the steplength  k is made. 2. Salakhutdinov R., Learning Deep Boltzmann
Table 3. The number of necessary iterations for Machines using Adaptive MCMC, Appearing
in Proceedings of 27 th International
satisfying the stop criterion when W  4 . Conference on Machine Learning, Haifa,
vk n  4096 , n  6400 , Israel, 2010.
W  20.1 W  17 3. Salakhutdinov R., Larochelle H., Efficient
Learning of Deep Boltzmann Machines,
Journal of Machine Learning Research, vol. 9,
k divergence 930
pp. 693-700, 2010.
4. Hinton G., Salakhutdinov R., An Efficient
k 1296 237 Learning Procedure for Deep Boltzmann
Machines, Neural Computation, vol. 24, no. 8,
k 1037 204 pp. 1967-2006, 2012.
5. Cho K., Raiko T., Ilin A., Karhunen J., A Two-
FPI 1616 281 Stage Pretraining Algorithm for Deep
Boltzmann Machines, Artificial Neural
Networks and Machine Learning-ICANN 2013,
The same approach can be applied for the Vol. 8131 of the series Lecture Notes in
steplength k but it is totally inapplicable for the Computer Science, pp. 106-113.
steplength  k . The classical steplength  k 6. Cho K., Raiko T., Ilin A., Gaussian-Bernoulli
deep Boltzmann machine, IEEE International
introduced by J. Barzilai and J. Borwein [9] for Joint Conference on Neural Networks, Dallas,
quadratics assures nonmonotone decreasing of the Texas, USA, pp. 1-7, 2013.
norm ek . A comparison between the two point 7. Dremeau A., Boltzmann Machine and Mean-
step size gradient method and the fixed-point Field Approximation for Structured Sparse
iteration method is made. The computational tests Decompositions, IEEE Transactions on Signal
indicates that Processing, Vol. 60, Issue 7, 2012, pp. 3425-
3438.
   k , n, W     FPI,n, W  , W  4,
8. Srivastava N., Salakhutdinov R., Multimodal
   k , n, W     FPI,n, W  , W  4. learning with Deep Boltzmann Machines,
Journal of Machine Learning Research, vol.
Moreover, the steplength  k is superior in the case 15, pp. 2949-2980, 2014 .
of ill-conditioned weight matrices. The features of 9. Barzilai J., Borwein J. M., Two-point step size
the weight matrix affect the convergence of both gradient methods, IMA Journal of Numerical
considered methods. The mean field approximation Analysis, Vol. 8, Issue 1, 1988, pp. 141-148.
problem has no solution in any case. The lack of 10. Birgin E. G., Martinez J. M., Raydan M.,
solutions leads to improper working of the Spectral Projected Gradient methods: Review
corresponding neural network. Any divergent mean and Perspectives, Journal of Statistical
field approximation procedure generates confusions Software, Vol. 60, Issue 3, 2014, pp. 1-21.
in the process of deep machine learning. To avoid 11. Raydan M., On the Barzilai and Borwein
this difficulty, it is best to work with normalized choice of steplength for the gradient method,
weight matrices. Note that there is a higher rate of IMA Journal of Numerical Analysis 13, 1993,
convergence with a smaller norm of the weight 321-326.
matrix. This is very important from the practical 12. Todorov T. D., Nonlocal problem for a general
point of view. second-order elliptic operator, Computers &
Finally, a designing of artificial neural Mathematics with Applications, vol. 69, issue
networks based on the mean field approximation 5, 2015, pp. 411-422.
without control on the features of the weight matrix
can lead to confusions in the deep learning process. Authors’ contacts.
Department of Mathematics and Informatics,
REFERENCES Technical University of Gabrovo,
5300 Gabrovo,
1. Salakhutdinov R. and Hinton G. E., Deep
e-mail: t.todorov@yahoo.com
Boltzmann machines, In Proc. of the Int. Conf.

You might also like