You are on page 1of 5

Neural network learning dynamics analysis

Elley I. Shamaev, Aleksandra S. Bosikova, Grigory A. Spirov, and Eugene G. Malkovich

Citation: AIP Conference Proceedings 1907, 030052 (2017);


View online: https://doi.org/10.1063/1.5012674
View Table of Contents: http://aip.scitation.org/toc/apc/1907/1
Published by the American Institute of Physics

Articles you may be interested in


On Timoshenko inclusions in elastic bodies crossing an external boundary
AIP Conference Proceedings 1907, 020007 (2017); 10.1063/1.5012618

The first mixed problem for forward-backward parabolic equations with the strong Keldysh-type degeneration
AIP Conference Proceedings 1907, 020006 (2017); 10.1063/1.5012617

Mathematical analysis in nonlinear acoustics


AIP Conference Proceedings 1907, 020003 (2017); 10.1063/1.5012614

Numerical investigation of thermal stresses in 2D nonstationary fields


AIP Conference Proceedings 1907, 030020 (2017); 10.1063/1.5012642

Optimal size of a rigid thin stiffener reinforcing a Timoshenko-type plate on the outer edge
AIP Conference Proceedings 1907, 030033 (2017); 10.1063/1.5012655

Modeling of evolution of shape of ductile metal disk for isotropic bombardment


AIP Conference Proceedings 1907, 030039 (2017); 10.1063/1.5012661
Neural Network Learning Dynamics Analysis
Elley I. Shamaev1,2,a) , Aleksandra S. Bosikova1,b) , Grigory A. Spirov1,c) and Eugene
G. Malkovich2,d)
1
Ammosov North-Eastern Federal University, Yakutsk, 677000, Russia
2
Sobolev Institute of Mathematics, Novosibirsk, 630090, Russia
a)
Corresponding author: ei.shamaev@s-vfu.ru
b)
bosikova.as.svfu@mail.ru
c)
spirovykt@gmail.com
d)
malkovich@math.nsc.ru

Abstract. We give the analysis of learning dynamics of nonlinear multilayer neural networks with diagonal weight matrices. We
show exponential decaying of the loss function in nonlinear case, and also give evidences that small initial weights slow down
nonlinear network learning. We extend theoretical understanding of learning dynamics of linear multilayer neural nets started by
Saxe, McClelland and Ganguli to nonlinear ones.

INTRODUCTION

Over the past five years, state-of-the-art results for speech [1] and image recognition [2] systems surpassed human
performance on several datasets. The core of machine learning success is iteratively learnable compositions of vari-
able linear and fixed non-linear functions. A mapping constructed by these compositions is called a neural network.
Learning is performed by stochastic gradient descent that minimizes the loss functions defined by the average of the
square error of mini-batches of data [3], [4]. The learning dynamics of the most simple multilayer neural network
(MNN)
N(x) = an an−1 · . . . · a1 x, x ∈ R,
that transforms input 1 into output 1 was explained in [5] by an approximation of behavior of N(x) during learning
 
1 1
N(x) ∼ σ nct − x, σ(x) = (1)
u0 1 + e−x

where c is the step size of gradient descent, t measures time in units of iterations (it appears in approximation of
learning rule by ordinary differential equation for small enough step size of gradient descent), u0 is a small enough
positive coefficient of the initial random network N(x) = u0 x. The approximation (1) shows that the increasing depth
of multilayer network surprisingly does not slow down learning. The small initial data u0 is responsible for slowing
down. Well-known random orthogonal initial condition [6] in deep learning was explained by this analysis in [5].
We recall some definition. A multilayer neural network with (n + 1)-layers is a mapping N : RK → R M

N(x) = τAn σ̃(. . . A2 σ̃(A1 x + b1 ) + b2 ) . . .) + bn , x ∈ RK ,

where A1 , . . . , An are linear operators, b1 , . . . , bn are vectors of appropriate vector spaces, σ̃ acts on each component
of a vector by fixed non-decreasing continuous function σ, τ is a voting mapping that replace the maximal coordinate
component by unit, other are replaced by zero (usually approximated by the softmax function). Stochastic gradient
descent algorithm (SGD) is performed with too large data to reduce the computational cost. SGD is organized as

Proceedings of the 8th International Conference on Mathematical Modeling (ICMM-2017)


AIP Conf. Proc. 1907, 030052-1–030052-4; https://doi.org/10.1063/1.5012674
Published by AIP Publishing. 978-0-7354-1599-7/$30.00

030052-1
following. Pairs of input and output vectors are arranged to be uniformly distributed in mini-batches D1 , . . . , DN .
Then each component a of a matrix of linear operators A1 , . . . , An is iteratively updated by so called backpropagation
∂ 1 
a → a − c i , i = i (A1 , . . . , An ) = ||y − N(x)||2 . (2)
∂a 2 (x,y)∈D
i

where c > 0 is the learning rate, i is the loss function of the mini-batch Di . Biases b1 , . . . , bn are also updated
by (2).
Empirically it was shown that consequent minimization of non-convex i approximates optimization of  = N1 i=1 N
i .
For small enough c the backpropagation (2) is a step of Euler’s method of numerical integration of the ordinary
differential equation
∂i
ȧ = −c . (3)
∂a
Further, we denote respectively MNN N(x) and S(x) with continuously defined coefficients (3) by N(x, t) and S(x, t).
We also assume N = 1 and learning preformed by gradient descent of  to recognize only two patterns x = 1,
y = 1 and x = 0, y = 0. In this case we can consider learning dynamics Ṅ(x, t) = dtd N(x, t). Using the chain rule

Ṅ(x, t) = ni=1 ∂N(x,t) c ∂
∂ai ȧi and ȧi = − 2 ∂ai (y − N(x, t)) , we obtain
2

 n  2
∂N(x, t)
Ṅ(x, t) = c(y − N(x, t)) . (4)
i=1
∂ai
Our main results are stated in Theorems 1 and 2.
Theorem 1 Consider a multilayer neural network
N(x) = an ρ(. . . a2 ρ(a1 x) . . .),
where ρ(z) = z or rectified linear unit ρ(z) = 12 (z + |z|). Let a1 = a1 (t), . . . , an = an (t) be defined by (3) with loss
function 1 = 12 (1 − N(1))2 . Assume initial a1 (0), . . . , an (0) are chosen in a small enough neighborhood of d > 0. Then
N(x, t) satisfy the following inequalities
   
1 u0 1 u0 1
σ nct − − log + 1 x  N(x, t)  σ nct − − log x, σ(z) = ,
u0 1 − u0 u0 1 − u0 1 + e−z
where u0 = a1 (0) · . . . · an (0).
The proof of Theorem 1 is based on the following Lemma 1.
Lemma 1 (Exact Solution) Functions
 
a1 (t) = a2n (0) + h(t), ..., an (t) = a2n (0) + h(t),
where δ = ±1 is chosen the same sign as a1 (0) · . . . · an (0), h(t) is defined as an inverse function of
 h
dz
t= √ √ , R(z) = (z + a21 (0)) · . . . · (z + a2n (0)),
0 (δ − R(z)) R(z)
are globally defined solution of Cauchy problem (3) in the assumptions of Theorem 1.
In case of tanh and sigmoid nonlinear activation functions we can estimate learning dynamics only from below.
Theorem 2 Consider non-linear multilayer neural network
S(x) = an σ(. . . a2 σ(a1 x) . . .),
where σ(x) = 1+e1 −x or σ(x) = tanh(x), a1 (t), . . . , an (t) are defined by (3) with loss function 1 = 12 (1 − S(1))2 . Assume
initial a1 (0), . . . , an (0) are chosen in a small enough neighborhood of d > 0 and a1 (t), . . . , an (t) are monotonically
increasing and well-defined for all t  0. Then S(x, t) satisfy the following inequality
⎛ ⎞
⎜⎜ s2 1 − s0 ⎟⎟⎟
σ ⎜⎝⎜c 2 0 t − ⎠⎟  S(x, t).
an (0) s0
where s0 = S(1, 0).

030052-2
Theorems 1 confirm the results [5] in case of MNN with ReLU. Theorem 2 gives evidence that the smallness of
initial data slows down learning in case of tanh and sigmoid activation functions. Although our mathematical model
extends previous linear case analysis [5] our model does not give a fully adequate model for deep networks. The reason
is that our model does not consider noise, stochasticity, the voting mapping, three or more patterns and biases. The
backpropagation of considered simple MNNs does not suffer local minima, but stuck in the plateau in neighborhood
of point a1 = . . . = an = 0.

PROOFS

Proof of Lemma 1. Straightforward substitution of all ai (t) = a2i (0) + h(t) into ai in (3) gives the following equation

ḣ(t) 1   δ R(h(t))
 = 1 − δ R(h(t))  , i = 1, . . . , n. (5)
2
2 a2i (0) + h(t) a2i (0) + h(t)

Method of integrating of equations with separable variable ensures that function h(t) solves the equation (5) with
initial condition h(0) = 0. 
Proof of Theorem 1. From Lemma 1 follows 1) Nn2 (t) = R(h(t)) converges exponentially to 1 for t → +∞; 2) functions
a1 (t), . . . , an (t) are monotonically increasing, therefore Nn (t) also monotonic and does not exceed 1; 3) a small
perturbation of initial condition preserves the following inequality 12  Nn1(t) , n  2; 4) if a1 (0) = . . . = an (0)
Nnn (t)
then 1
a21 (t)
+ ... + 1
a2n (t)
= 1
2 . From 3) and 4) follows that for a small perturbation of initial condition the following
Nnn (t)
inequality holds.
1 1 n
+ ... + 2  . (6)
a21 (t) an (t) Nn (1, t)

The inequality
n 1 1
 + ... + 2 (7)
2
Nn (1, t)
n a21 (t) an (t)
  
follows from the arithmetics – geometric mean inequality n 1
a21 (t)·...·a2n (t)
 1
n
1
a21 (t)
+ ... + 1
a2n (t)
. Applying inequalities
(7), (6) to the equation, obtained from (4) by straightforward calculations,
⎛ ⎞
⎜⎜ 1 1 ⎟⎟
Ṅ(1, t) = c(1 − N(1, t))N 2 (1, t) ⎜⎜⎝ 2 + . . . + 2 ⎟⎟⎠ . (8)
a1 (t) an (t)

we get the inequalities


nc(1 − N(1, t))N 2 (1, t)  Ṅ(1, t)  nc(1 − N(1, t))N(1, t) (9)

Now, Theorem 1 follows from the technical Lemma 2.

Lemma 2 If smooth function u = u(t) for all t  0 satisfies the inequalities

u̇ u̇
1) 0 < u < 1; 2) κ2  ; 3)  λ2 , κ, λ ∈ R,
(1 − u)u2 (1 − u)u
then u(t) has estimation
   
1 1 − u0 1 − u0 1
σ κ2 t − − log + 1 < u(t) < σ λ2 t − log , σ(z) = ,
u0 u0 u0 1 + e−z

where u0 = u(0).

030052-3
Proof. The definite integral of 2) from 0 to t > 0 gives the first term of the inequalities
1 u 1 1 − u0 u 1 1 − u0
κ2 t  − + log + + log < log + + log − 1. (10)
u 1 − u u0 u0 1 − u u0 u0
The second term of (10) is obtained from the inequality − 1u < −1. The inversion of (10) gives the left-side inequality
of Lemma 1. The integral of 3) from 0 to t is equal to log 1−u u
 λ2 t − log 1−uu0 . By straightforward calculation we get
0

the right-side inequality of Lemma 1. 


Theorem 1 is proven. 
Remark 1. For small positive u0 the value of u10 is much larger than log 1−u u0 .
0

Remark 2. In case ρ(z) is rectified linear unit. If at least one of a1 , . . . , an−1 is not positive then the common mul-
tiplier ρ (Nn−1 ) is equal to zero. If at least two or more of initial a1 , . . . , an−1 are not positive then network coefficients
are constant. If all initial coefficients are positive then ai will be updated from the rule (2) as in linear case. If exactly
one of a1 , . . . , an−1 is not positive then network coefficients will be constant while the non positive coefficient will rise
to positive.
Proof of Theorem 2. Let denote S̃(x, t) = σ(an−1 (t) . . . a2 (t)σ(a1 (t)x) . . .). Then S(x, t) = an S̃(x, t) and ∂S(x,t)
∂an = S̃(x, t).
Monotonicity of a1 (t), . . . , an (t), tanh z, 1+e1 −z imply monotonicity of S̃(x, t) on t. From the inequality 1 > S(x, t) and
monotonicity of S̃(x, t) we obtain
n  2  2
∂S(x, t) ∂S(x, t)
> = S̃2 (x, t)  S̃2 (x, 0) > S̃2 (x, 0)S(x, t).
i=1
∂ai ∂an

Consequently, a solution S(x, t) of (4) satisfies the following inequality


 2
S(x, 0)
Ṡ(x, t) > c S(x, t)(y − S(x, t)).
an (0)
By standard methods of ODE with separable variables, integrating on [0, t], we get
⎛  2 ⎞
⎜⎜⎜ s 1 − s ⎟⎟⎟
σ ⎜⎜⎝c
0
t − log
0 ⎟⎟  S(x, t).
an (0) s0 ⎠


ACKNOWLEDGMENTS
The reported study was funded by RFBR according to the research project 17-29-07077.

REFERENCES
[1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N.
Sainath, et al., IEEE Signal Processing Magazine 29, 82–97 (2012).
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural net-
works,” in Advances in neural information processing systems (2012), pp. 1097–1105.
[3] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,”
Tech. Rep. (DTIC Document, 1985).
[4] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal co-
variate shift,” in International Conference on Machine Learning (2015), pp. 448–456.
[5] A. M. Saxe, J. L. McClelland, and S. Ganguli, arXiv:1312.6120 (2013).
[6] M. Henaff, A. Szlam, and Y. LeCun, “Recurrent orthogonal networks and long-memory tasks,” in Interna-
tional Conference on Machine Learning (2016), pp. 2034–2042.
[7] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in
neural information processing systems (2014), pp. 3104–3112.
[8] H. Guo and S. B. Gelfand, IEEE Transactions on Circuits and Systems 38, 883–894 (1991).
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778.

030052-4

You might also like