Professional Documents
Culture Documents
Abstract: The developers of the deep machine learning are essentially inspired from the
activities of the human brain. The main goal in this area is to get a finite software model of
human recognition approach. The deep learning architecture is based on the mean field
approximation. Many authors a priory assume that the mean field approximation problem
has a solution for all random initial guesses. Some of them just make a few steps applied
the fixed point iteration to establish whether neurons in hidden layers are active or not. The
present paper describes the area of application of the mean field approximation for
training multilayer Boltzmann machine. We have a strong proof on the fact that mean field
approximations are not convergent for all random initial guesses. The convergence
strongly depends on the norms of the weight matrices. The new results are supported by
computer implemented examples.
Key words: Deep Boltzmann Machine, mean field approximation, gradient iterative
methods.
Fig. 1. Neural network where all the hidden units in all layers Fig. 2. Neural network with various lengths of the hidden units.
have the same length.
195
3. The weak formulation and the associate v
unconstrained minimization problem 1
J (v )
(v, v) s (t )dt ,1
Let h be a uniform triangulation of the 2
0
interval [0,1] with n linear finite elements.
to associate weak form (3) with the following
The associate finite element space h is spanned minimization problem
by nodal basis functions i , i 1, 2,..., n . The arg min J v . (4)
space is provided with the inner scalar product v ˆ h
h
Theorem 1 establishes existence and uniqueness
u, v uvdx . of the solution of minimization problem (4). To
A linear operator T : defined by prove this theorem we need the following
h
denotations: sigm(x) is the well-known logistic
Tv v . , v , 1 , 2 ,..., n
function; diag(i , i 1, 2,..., n) is n n a
generates a subset ˆ
h
v Tv | v of h . diagonal matrix with i in the main diagonal; I is
The weak problem is compiled as follows the identity n n matrix;
v diag Sigm(a Wv )diag( eˆ - Sigm( a Wu )) .
Find u ˆ h such that
Theorem 1 If the weight matrix W is V positive
(3)
u, v ( s (u ), v) 0, v ˆ h , definite with spectral norm W 4 then the
problem (4) has a unique solution.
where s (v) TSigm a Wv and v T v. The 1 Proof. The functional J (v) is continuous, i.e. we
weak formulation is a very important part of our have to prove that J (v) is bounded below, coercive
investigation. We look for a finite dimensional and convex. We estimate the second term in the
Hilbert space of continuous functions and a objective functional using Hölder inequality
bijection between and . Note that the v 1
Gramian matrix of the basis functions is positive s(t )dt ,1 s v | | 2 s v 2,
definite. As a particular example in our 0, 1, 0,
0
consideration we chose the finite element space
and the functional
ˆ . In this case the Gramian matrix is actually the
h 1 2
mass matrix. The interval [0,1] is not essential for
J v v s v
2 2, 0, 2,
the proof of the main result. Any other finite closed
interval can be successfully applied but we use the
canonical one-dimensional simplex for the sake of
1
2
v 2, 2 s
0,
v 2, , v ˆ h .
simplicity. The choice of the piecewise linear trial The latter inequality shows us that J (v) is
functions is made for the same reasons. The bounded below and coercive. It remains to prove
operator T is the wanted bijection. that J (v) is convex. For this purpose we calculate
Lemma 1 All solutions of the weak problem are Fréchet derivatives of J (v) :
also solutions of the equation (2).
Proof. The zero vector does not belong to the set
DJ u, v s(u ), v ,
(u )v
and the mass matrix M is symmetric and positive v, v ( DsD( a Wu ). , v)
D 2 J (u )(v, v)
definite. Therefore the assertion of the Lemma
v) v T Mv (u )Wv Mv
T
(v, v) ((u )Wv. ,
results from the following equivalences:
u, v (s(u ), v)
0 v T T
u Wv Mv v T I u W Mv.
T
uMv Sigm(a Wu )Mv 0 1
Since u and W 4 the second
u Sigm(a Wu ) Mv
0. 4
derivative D J u v, v 0 and the objective
2
Definition 1 The square matrix W is said to be -
positive definite if v TWv 0, v . functional is strongly convex.
The main goal of the present investigation is to Corollary 1 If the conditions of Theorem 1 hold
present problem (2) as a minimization problem. then the problem (3) has a unique solution.
Therefore we define an objective functional
k
i
k
k 1 I Ds ui DJ ui , DJ ui 3
the steplength ˆk is estimated by k
4
2 4 3
k
i k 1
DJ ui
2, The inequality 1
0
D 2 J uk follows from
Additionally, we define the classical steplength 4
(6). We continue estimating the left hand side of (7)
DJ uk uk
k , where uk uk uk 1, 1
2
uk 2, ek 1, ek 1 I D 2 J uk ek 2, ek 1 2,
k
DJ uk DJ uk DJ uk 1 . k ek 2, ek 1 2, . The contraction factor
The identity matrix and the identity operator are 1 3
denoted by the same letter I. The steplength k is a k is bounded above by k 0 1.
4 16
preconditioned version of the step length Finally
k
i k 1
D 2 J ui DJ ui , DJ ui ek 1 2, k ek 2, k e1 2, , k .
ˆk 2
.
k
i k 1
DJ ui 2,
Further we suppose that all hidden units in the i-th
layer have the same length mi and m max mi .
i 1,2,..., p
The steplength ˆk gives better results than k but Let hij be the j-th hidden unit in the i-th hidden
ˆk essentially increase computational complexity
of the problem. That is why further we analyze the
layer and W hij , hlk be the weight matrix
method with the steplength k . Similar approach generated by the connection between the neurons
can be applied when k or ˆk are used. hij and hlk , i k . We only consider the particular
Theorem 2 assures the convergence of the case 0 mi mk m . In this case the weight
Barzilai-Borwein method independently of the
initial guesses.
matrix W hij , hlk Am m B m m
i i i k mi
Theorem 2 The main result. Let the weight is a block rectangular matrix. Similar approach can
matrix W is V -positive definite and be used in the other cases. To obtain a problem
197
corresponding to (2) we extend all weight matrices k
and hidden units as follows: k
i k 1 i si
s I ui
.
2
hij hˆij hij , oi , hkl hˆkl hkl , ok
k
s
i k 1 i
We consider different cases with respect to the
A B C ,
features of the weight matrix. The stop criterion
W hij , hkl W O I 22 D
ˆ
DJ uk . (8)
C T DT I
33
with 1013 is used throughout all our
m mi m mk considerations. The number of necessary iterations
where oi and ok are zero
vectors, O mk mi mi , C mi m mk and k , n, W for satisfying (8) is an object of
interest in this section. The fixed-point iteration is
D mk mi m mk are zero matrices, and
denoted by FPI in all tables.
I 22 mk mi mk mi and I33 m mk m mk Let's start with the case where W is V -
positive definite and W 4 , i.e. the weight matrix
are identical matrices. Then hˆij , hˆkl m and Ŵ is
satisfies the requirements of Theorem 2. In this case
a m m matrix. Thus we reduce the rectangular we have very fast convergence for all steplengths.
version of the mean field approximation problem to The same rate of convergence is also established for
a square one. the fixed-point iteration. Table 1 and 2 indicate that
Table 1. The number of necessary iterations for the rate of convergence does not depend neither on
satisfying the stop criterion when W 1. the number of the hidden neurons nor the initial
vk \ n 8 100 512 1024 2048 4096 6400 guesses. Further we analyze the case when W 4
k 7 8 8 8 9 9 9 and the requirement for V -positivity is broken. We
8 9 9 9 9 9 9
establish low rate of convergence in the cases when
k 4 W 20 . Some examples are presented in
k 11 13 14 15 15 15 16
Table 3. In this case the traditional steplength
FPI 7 8 8 9 9 9 9 proposed by J. Barzilai and J. Borwein [9] assures
nonmonotone convergence of the error norm. The
Table 2. The number k , n, 2 obtained by sequence ek converges monotonically to zero
random initial guesses. when the method is applied with steplengths k
vk \ n 8 100 512 1024 2048 4096 6400
and k . The best results for ill-conditioned matrix
k 9 10 11 11 11 11 13
are obtained by the steplength k . The number
k 11 11 12 12 12 13 13
k , n, W 4 grows nonlinearly when the
k 12 16 18 18 18 18 19
number of unknowns n tends to infinity. The
FPI 11 11 12 12 12 12 13 convergence of the two point step size method is
not influenced by whether the weight matrix is
singular or not. The initial guesses does not affect
5. Experiments on the rate of convergence as well. All
From the computational point of view computational tests with W 20.5 and
method (5) takes the following attractive form
1 n 4096 have failed since the Barzilai-Borwein
uk 1
k
Sigm a Wuk 1 vk uk , k 1 method is divergent with all considered steplengths
as well as the fixed-point iteration.
with the steplengths:
sk .uk 6. Conclusion
k 2 k uk Sigm uk ,
, s The Barzilai-Borwein method for solving the
uk mean field approximation problem is studied in the
uk
2
uk 1
2 present paper. Two original steplengths, which
k 2 2
, assure monotone decreasing of the norm of the error
Sigm uk Sigm uk 1 in approximate solutions are found. The mean filed
approximation problem is reduced to a
minimization one. The uniqueness and existence of
Copyright by Technical University - Sofia, Plovdiv branch, Bulgaria
198
the weak solution are proved. A rigorous proof of on Artificial Intelligence and Statistics
the convergence theorem for the Barzilai-Borwein (AISTATS 2009), pp. 448-455, 2009.
method with the steplength k is made. 2. Salakhutdinov R., Learning Deep Boltzmann
Table 3. The number of necessary iterations for Machines using Adaptive MCMC, Appearing
in Proceedings of 27 th International
satisfying the stop criterion when W 4 . Conference on Machine Learning, Haifa,
vk n 4096 , n 6400 , Israel, 2010.
W 20.1 W 17 3. Salakhutdinov R., Larochelle H., Efficient
Learning of Deep Boltzmann Machines,
Journal of Machine Learning Research, vol. 9,
k divergence 930
pp. 693-700, 2010.
4. Hinton G., Salakhutdinov R., An Efficient
k 1296 237 Learning Procedure for Deep Boltzmann
Machines, Neural Computation, vol. 24, no. 8,
k 1037 204 pp. 1967-2006, 2012.
5. Cho K., Raiko T., Ilin A., Karhunen J., A Two-
FPI 1616 281 Stage Pretraining Algorithm for Deep
Boltzmann Machines, Artificial Neural
Networks and Machine Learning-ICANN 2013,
The same approach can be applied for the Vol. 8131 of the series Lecture Notes in
steplength k but it is totally inapplicable for the Computer Science, pp. 106-113.
steplength k . The classical steplength k 6. Cho K., Raiko T., Ilin A., Gaussian-Bernoulli
deep Boltzmann machine, IEEE International
introduced by J. Barzilai and J. Borwein [9] for Joint Conference on Neural Networks, Dallas,
quadratics assures nonmonotone decreasing of the Texas, USA, pp. 1-7, 2013.
norm ek . A comparison between the two point 7. Dremeau A., Boltzmann Machine and Mean-
step size gradient method and the fixed-point Field Approximation for Structured Sparse
iteration method is made. The computational tests Decompositions, IEEE Transactions on Signal
indicates that Processing, Vol. 60, Issue 7, 2012, pp. 3425-
3438.
k , n, W FPI,n, W , W 4,
8. Srivastava N., Salakhutdinov R., Multimodal
k , n, W FPI,n, W , W 4. learning with Deep Boltzmann Machines,
Journal of Machine Learning Research, vol.
Moreover, the steplength k is superior in the case 15, pp. 2949-2980, 2014 .
of ill-conditioned weight matrices. The features of 9. Barzilai J., Borwein J. M., Two-point step size
the weight matrix affect the convergence of both gradient methods, IMA Journal of Numerical
considered methods. The mean field approximation Analysis, Vol. 8, Issue 1, 1988, pp. 141-148.
problem has no solution in any case. The lack of 10. Birgin E. G., Martinez J. M., Raydan M.,
solutions leads to improper working of the Spectral Projected Gradient methods: Review
corresponding neural network. Any divergent mean and Perspectives, Journal of Statistical
field approximation procedure generates confusions Software, Vol. 60, Issue 3, 2014, pp. 1-21.
in the process of deep machine learning. To avoid 11. Raydan M., On the Barzilai and Borwein
this difficulty, it is best to work with normalized choice of steplength for the gradient method,
weight matrices. Note that there is a higher rate of IMA Journal of Numerical Analysis 13, 1993,
convergence with a smaller norm of the weight 321-326.
matrix. This is very important from the practical 12. Todorov T. D., Nonlocal problem for a general
point of view. second-order elliptic operator, Computers &
Finally, a designing of artificial neural Mathematics with Applications, vol. 69, issue
networks based on the mean field approximation 5, 2015, pp. 411-422.
without control on the features of the weight matrix
can lead to confusions in the deep learning process. Authors’ contacts.
Department of Mathematics and Informatics,
REFERENCES Technical University of Gabrovo,
5300 Gabrovo,
1. Salakhutdinov R. and Hinton G. E., Deep
e-mail: t.todorov@yahoo.com
Boltzmann machines, In Proc. of the Int. Conf.