Professional Documents
Culture Documents
The Loss Surfaces of Multilayer Networks
The Loss Surfaces of Multilayer Networks
Anna Choromanska Mikael Henaff Michael Mathieu Gérard Ben Arous Yann LeCun
achoroma@cims.nyu.edu mbh305@nyu.edu mathieu@cs.nyu.edu benarous@cims.nyu.edu yann@cs.nyu.edu
Abstract 1 Introduction
Deep learning methods have enjoyed a resurgence
of interest in the last few years for such applica-
We study the connection between the highly
tions as image recognition [Krizhevsky et al., 2012],
non-convex loss function of a simple model of
speech recognition [Hinton et al., 2012], and natu-
the fully-connected feed-forward neural net-
ral language processing [Weston et al., 2014]. Some
work and the Hamiltonian of the spherical
of the most popular methods use multi-stage ar-
spin-glass model under the assumptions of:
chitectures composed of alternated layers of linear
i) variable independence, ii) redundancy in
transformations and max function. In a particu-
network parametrization, and iii) uniformity.
larly popular version, the max functions are known
These assumptions enable us to explain the
as ReLUs (Rectified Linear Units) and compute
complexity of the fully decoupled neural net-
the mapping y = max(x, 0) in a pointwise fash-
work through the prism of the results from
ion [Nair and Hinton, 2010]. In other architectures,
random matrix theory. We show that for
such as convolutional networks [LeCun et al., 1998a]
large-size decoupled networks the lowest crit-
and maxout networks [Goodfellow et al., 2013], the
ical values of the random loss function form
max operation is performed over a small set of vari-
a layered structure and they are located in
able within a layer.
a well-defined band lower-bounded by the
global minimum. The number of local min- The vast majority of practical applications of deep
ima outside that band diminishes exponen- learning use supervised learning with very deep net-
tially with the size of the network. We empir- works. The supervised loss function, generally a cross-
ically verify that the mathematical model ex- entropy or hinge loss, is minimized using some form
hibits similar behavior as the computer sim- of stochastic gradient descent (SGD) [Bottou, 1998],
ulations, despite the presence of high depen- in which the gradient is evaluated using the back-
dencies in real networks. We conjecture that propagation procedure [LeCun et al., 1998b].
both simulated annealing and SGD converge
The general shape of the loss function is very poorly
to the band of low critical points, and that all
understood. In the early days of neural nets (late
critical points found there are local minima
1980s and early 1990s), many researchers and engi-
of high quality measured by the test error.
neers were experimenting with relatively small net-
This emphasizes a major difference between
works, whose convergence tends to be unreliable, par-
large- and small-size networks where for the
ticularly when using batch optimization. Multilayer
latter poor quality local minima have non-
neural nets earned a reputation of being finicky and
zero probability of being recovered. Finally,
unreliable, which in part caused the community to fo-
we prove that recovering the global minimum
cus on simpler method with convex loss functions, such
becomes harder as the network size increases
as kernel machines and boosting.
and that it is in practice irrelevant as global
minimum often leads to overfitting. However, several researchers experimenting with larger
networks and SGD had noticed that, while multilayer
nets do have many local minima, the result of mul-
tiple experiments consistently give very similar per-
formance. This suggests that, while local minima are
Appearing in Proceedings of the 18th International Con- numerous, they are relatively easy to find, and they
ference on Artificial Intelligence and Statistics (AISTATS) are all more or less equivalent in terms of performance
2015, San Diego, CA, USA. JMLR: W&CP volume 38. on the test set. The present paper attempts to explain
Copyright 2015 by the authors.
this peculiar property through the use of random ma-
The Loss Surfaces of Multilayer Networks
work [Nakanishi and Takayama, 1997] examined the Theorem 3.1. Let Ψ be the mass of the network, d
nature of the spin-glass transition in the Hopfield neu- be the number of network inputs and H be the depth
ral network model. None of these works however make of the network. The size of the network is bounded as
the attempt to explain the paradigm of optimizing the
√ H √
highly non-convex neural network objective function Ψ2 H = Λ2H H ≥ N ≥
H H
√ ≥ Ψ = Λ.
Ψ2 H
through the prism of spin-glass theory and thus in this d
respect our approach is very novel.
3 Deep network and spin-glass model We assume the depth of the network H is bounded.
Therefore N → ∞ iff Ψ → ∞, and N → ∞ iff Λ → ∞.
3.1 Preliminaries In the rest of this section we will be establishing a con-
nection between the loss function of the neural network
For the theoretical analysis, we consider a simple and the Hamiltonian of the spin-glass model. We next
model of the fully-connected feed-forward deep net- provide the outline of our approach.
work with a single output and rectified linear units.
We call the network N . We focus on a binary clas-
3.2 Outline of the approach
sification task. Let X be the random input vector of
dimensionality d. Let (H − 1) denote the number of In Subsection 3.3 we introduce randomness to the
hidden layers in the network and we will refer to the model by assuming X’s and A’s are random. We
input layer as the 0th layer and to the output layer as make certain assumptions regarding the neural net-
the H th layer. Let ni denote the number of units in work model. First, we assume certain distributions
the ith layer (note that n0 = d and nH = 1). Let Wi and mutual dependencies concerning the random vari-
be the matrix of weights between (i − 1)th and ith lay- ables X’s and A’s. We also introduce a spherical con-
ers of the network. Also, let σ denote the activation straint on the model weights. We finally make two
function that converts a unit’s weighted input to its other assumptions regarding the redundancy of net-
output activation. We consider linear rectifiers thus work parameters and their uniformity, both of which
σ(x) = max(0, x). We can therefore write the (ran- are justified by empirical evidence in the literature.
dom) network output Y as These assumptions will allow us to show in Subsec-
> > tion 3.4 that the loss function of the neural network,
Y = qσ(WH σ(WH−1 . . . σ(W1> X))) . . . ),
after re-indexing terms1 , has the form of√a centered
p
where q = (n0 n1 ...nH )(H−1)/2H is simply a normal- Gaussian process on the sphere S = S Λ−1 ( Λ), which
ization factor. The same expression for the output of is equivalent to the Hamiltonian of the H-spin spheri-
the network can be re-expressed in the following way: cal spin-glass model, given as
γ
n0 X H Λ
(k)
1 X
LΛ,H (w̃) =
X Y
Y =q Xi,j Ai,j wi,j , (1) Xi1 ,i2 ,...,iHw̃i1w̃i2. . .w̃iH , (2)
Λ(H−1)/2
i ,i
i=1 j=1 k=1 1 2 ,...,iH =1
where the first summation is over the network inputs with spherical constraint
and the second one is over all paths from a given net-
Λ
work input to its output, where γ is the total number 1X 2
of such paths (note that γ = n1 n2 . . . nH ). Also, for w̃ = 1. (3)
Λ i=1 i
all i = {1, 2, . . . , n0 }: Xi,1 = Xi,2 = · · · = Xi,γ . Fur-
(k)
thermore, wi,j is the weight of the k th segment of path The redundancy and uniformity assumptions will be
indexed with (i, j) which connects layer (k − 1) with explained in Subsection 3.3 in detail. However, on
layer k of the network. Note that each path corre- the high level of generality the redundancy assumption
sponds to a certain set of H weights, which we refer to enables us to skip superscript (k) appearing next to
as a configuration of weights, which are multiplied by the weights in Equation 1 (note it does not appear
each other. Finally, Ai,j denotes whether a path (i, j) next to the weights in Equation 2) by determining a
is active (Ai,j = 1) or not (Ai,j = 0). set of unique network weights of size no larger than N ,
Definition 3.1. The mass of the network Ψ is the and the uniformity assumption ensures that all ordered
total number of all paths between all network √inputs products of unique weights appear in Equation 2 the
QH
and outputs: Ψ = i=0 ni . Also let Λ as Λ = H Ψ. same number of times.
Definition 3.2. The size of the network N is the total An asymptotic evaluation of the complexity of H-spin
PH−1
number of network parameters: N = i=0 ni ni+1 . spherical spin-glass models via random matrix theory
1
The terms are re-indexed in Subsection 3.3 and
The mass and the size of the network depend on each it is done to preserve consistency with the notation
other as captured in Theorem 3.1. All proofs in this in [Auffinger et al., 2010] where the proofs of the results
paper are deferred to the Supplementary material. of Section 4 can be found.
The Loss Surfaces of Multilayer Networks
was studied in the literature [Auffinger et al., 2010] later on. ri1 ,i2 ,...,iH denotes whether the configuration
where a precise description of the energy landscape (wi1 , wi2 , . . . , wiH ) appears in Equation 4 or not, thus
for the Hamiltonians of these models is provided. In (j) ri1 ,i2 ,...,iH
ri1 ,i2 ,...,iH ∈ {0∪1}, and {Xi1 ,i2 ,...,iH }j=1 denote
this paper (Section 4) we use these results to explain a set of random variables corresponding to the same
the optimization problem in neural networks. weight configuration (since ri1 ,i2 ,...,iH ∈ {0∪1} this set
has at most one element). Also ri1 ,i2 ,...,iH = 0 implies
3.3 Approximation (j) QH
that summand Xi1 ,i2 ,...,iH ρ k=1 wik is zeroed out).
Input We assume each input Xi,j is a normal ran- Furthermore, the following condition has to be satis-
PN
dom variable such that Xi,j ∼ N (0, 1). Clearly the fied: i1 ,i2 ,...,iH =1 ri1 ,i2 ,...,iH = Ψ. In the notation
model contains several dependencies as one input is as- YN , index N refers to the number of unique weights of
sociated with many paths in the network. That poses a network (this notation will also be helpful later).
a major theoretical problem in analyzing these models Consider a family of networks which have the same
as it is unclear how to account for these dependen- graph of connections as network N but different edge
cies. In this paper we instead study fully decoupled weighting such that they only have s unique weights
model [De la Peña and Giné, 1999], where Xi,j ’s are and s ≤ N (by notation analogy the expected out-
assumed to be independent. We allow this simplifica- put of this network will be called Ys ). It was recently
tion as to the best of our knowledge there exists no shown [Denil et al., 2013, Denton et al., 2014] that for
theoretical description of the optimization paradigm large-size networks large number of network parame-
with neural networks in the literature either under ters (according to [Denil et al., 2013] even up to 95%)
independence assumption or when the dependencies are redundant and can either be learned from a very
are allowed. Also note that statistical learning theory small set of unique parameters or even not learned at
heavily relies on this assumption [Hastie et al., 2001] all with almost no loss in prediction accuracy.
even when the model under consideration is much sim-
pler than a neural network. Under the independence Definition 3.3. A network M which has the same
assumption we will demonstrate the similarity of this graph of connections as N and s unique weights satis-
model to the spin-glass model. We emphasize that de- fying s ≤ N is called a (s, )-reduction image of N for
spite the presence of high dependencies in real neural some ∈ [0, 1] if the prediction accuracy of N and M
networks, both models exhibit high similarities as will differ by no more than (thus they classify at most
be empirically demonstrated. fraction of data points differently).
Theorem 3.2. Let N be a neural network giving the
Paths We assume each path in Equation 1 is equally output whose expectation YN is given in Equation 5.
likely to be active thus Ai,j ’s will be modeled as Let M be its (s, )-reduction image for some s ≤ N
Bernoulli random variables with the same probabil- and ∈ [0, 0.5]. By analogy, let Ys be the expected
ity of success ρ. By assuming the independence of X’s output of network M. Then the following holds
and A’s we get the following
1 − 2
γ
n0 X H corr(sign(Ys ), sign(YN )) ≥ ,
X Y (k) 1 + 2
EA [Y ] = q Xi,j ρ wi,j . (4)
i=1 j=1 k=1 where corr denotes the correlation defined as
corr(A, B) = E[(A−E[A]])(B−E[B]])
std(A)std(B) , std is the standard
Redundancy in network parametrization Let deviation and sign(·) denotes the sign of prediction
W = {w1 , w2 , . . . , wN } be the set of all weights of (sign(Ys ) and sign(YN ) are both random).
the network. Let A denote the set of all H-length
configurations of weights chosen from W (order of The redundancy assumption implies that one can pre-
the weights in a configuration does matter). Note serve to be close to 0 even with s << N .
that the size of A is therefore N H . Also let B be
a set such that each element corresponds to the sin- Uniformity Consider the network M to be a (s, )-
gle configuration of weights from Equation 4, thus reduction image of N for some s ≤ N and ∈ [0, 1].
1 2 H 1 2 H
B = {(w1,1 , w1,1 , . . . , w1,1 ), (w1,2 , w1,2 , . . . , w1,2 ), . . . , The output Ys of the image network can in general be
1 2 H
(wn0 ,γ , wn0 ,γ , . . . , wn0 ,γ )}, where every single weight expressed as
comes from set W (note that B ⊂ A). Thus Equa- ti1 ,...,iH
s H
tion 4 can be equivalently written as X X (j)
Y
Ys = q Xi1 ,...,iH ρ w ik ,
N ri1,i2 ,...,iH H i1 ,...,iH =1 j=1 k=1
(j)
X X Y
YN := EA [Y ] = q Xi1 ,i2 ,...,iH ρ wik . (5)
i1 ,i2 ,...,iH =1 j=1 k=1 where ti1 ,...,iH ∈ {Z+ ∪ 0} is the number of times each
configuration
Ps (wi1 , wi2 , . . . , wiH ) repeats in Equation 5
We will now explain the notation. It is over- and i1 ,...,iH =1 ti1 ,...,iH = Ψ. We assume that unique
complicated for purpose, as this notation will be useful weights are close to being evenly distributed on the
The Loss Surfaces of Multilayer Networks
graph of connections of network M. We call this as- the absolute loss, where S = supw Ŷ , and −1 or 1 in
sumption a uniformity assumption. Thus this assump- case of the hinge loss. Also note that in the case of the
tion implies that for all (i1 , i2 , . . . , iH ) : i1 , i2 , . . . , iH ∈ hinge loss max operator can be modeled as Bernoulli
{1, 2, . . . , s} there exists a positive constant c ≥ 1 such random variable, which we assume is independent of
that the following holds Ŷ . Given that one can show that after approximating
EA [Y ] with Ŷ both losses can be generalized to the
1 Ψ Ψ
· ≤ ti1 ,i2 ,...,iH ≤ c · H . (6) following expression
c sH s
Λ H
The factor sΨH comes from the fact that for the net- X Y
work where every weight is uniformly distributed on LΛ,H (w̃) = C1 + C2 q Xi1 ,i2 ,...,iH w̃ik ,
the graph of connections (thus with high probability i1 ,i2 ,...,iH =1 k=1
LaΛ,H (w) = EA [ |Yt − Y | ] Later in the paper by critical values of the loss function
that have non-diverging (fixed) index, or low-index, we
and mean the ones with index non-diverging with Λ.
LhΛ,H (w) = EA [max(0, 1 − Yt Y )],
2
where Yt is a random variable corresponding to the The number of negative eigenvalues of the Hessian
true data labeling that takes values −S or S in case of ∇ LΛ,H at w is also called index of ∇2 LΛ,H at w.
2
The Loss Surfaces of Multilayer Networks
0.3 −E0
Let us also introduce the number that we will refer to −Einf 0
0.2
as E∞ . We will refer to this important threshold as k=0
ΘH(u)
(u)
−0.02 k=1
the energy barrier and define it as 0.1
k,H
k=2
Θ
r 0 −0.04 k=3
H −1 k=4
E∞ = E∞ (H) = 2 . −0.1 −0.06
k=5
H −1.5 −1 −0.5 0 −1.68 −1.66 −1.64
u u
Theorem 2.14 in [Auffinger et al., 2010] implies that Figure 2: Functions ΘH (u) and Θk,H (u) for k =
for large-size networks all critical values of the loss {0, 1, . . . , 5}. Parameter H was set to H = 3. Black
function that are of non-diverging index must lie be- line: u = −E0 (H), red line: u = −E∞ (H). Figure
low the threshold −ΛE∞ (H). Any critical point that must be read in color.
lies above the energy barrier is a high-index saddle Also note that the following corollary holds.
point with overwhelming probability. Thus for large-
size networks all critical values of the loss function Corollary 4.1. For all k > 0 and u < −E∞ ,
that are of non-diverging index must lie in the band Θk,H (u) < Θ0,H (u).
(−ΛE0 (H), −ΛE∞ (H)). Next we will show the logarithmic asymptotics of the
mean number of critical points (the asymptotics of the
Layered structure of low-index critical points mean number of critical points can be found in the
From Theorem 2.15 in [Auffinger et al., 2010] it fol- Supplementary material).
lows that for large-size networks finding a critical value
with index larger or equal to k (for any fixed inte- Theorem 4.1 ([Auffinger et al., 2010], Theorem 2.5
ger k) below energy level −ΛEk (H) is improbable, and 2.8). For all H ≥ 2
where −Ek (H) ∈ [−E0 (H), −E∞ (H)]. Furthermore, 1
the sequence {Ek (H)}k∈N is strictly decreasing and lim log E[CΛ (u)] = ΘH (u).
converges to E∞ as k → ∞ [Auffinger et al., 2010]. Λ→∞ Λ
These results unravel a layered structure for the low- and for all H ≥ 2 and k ≥ 0 fixed
est critical values of the loss function of a large- 1
size network, where with overwhelming probability lim log E[CΛ,k (u)] = Θk,H (u).
Λ→∞ Λ
the critical values above the global minimum (ground
state) of the loss function are local minima exclusively.
Above the band ((−ΛE0 (H), −ΛE1 (H))) containing
only local minima (critical points of index 0), there From Theorem 4.1 and Corollary 4.1 the number of
is another one, ((−ΛE1 (H), −ΛE2 (H))), where one critical points in the band (−ΛE0 (H), −ΛE∞ (H)) in-
can only find local minima and saddle points of in- creases exponentially as Λ grows and that local minima
dex 1, and above this band there exists another one, dominate over saddle points and this domination also
((−ΛE2 (H), −ΛE3 (H))), where one can only find lo- grows exponentially as Λ grows. Thus for large-size
cal minima and saddle points of index 1 and 2, and so networks the probability of recovering a saddle point
on. in the band (−ΛE0 (H), −ΛE∞ (H)), rather than a lo-
cal minima, goes to 0.
Logarithmic asymptotics of the mean number Figure 1 captures exemplary plots of the distribu-
of critical points We will now define two non- tions of the mean number of critical points, local min-
decreasing, continuous functions on R, ΘH and Θk,H ima and low-index saddle points. Clearly local min-
(their exemplary plots are captured in Figure 2). ima and low-index saddle points are located in the
1 (H−2)u 2 band (−ΛE0 (H), −ΛE∞ (H)) whereas high-index sad-
2 log(H −1)− 4(H−1) −I(u) if u ≤ −E∞
dle points can only be found above the energy barrier
2
ΘH (u) = 1 log(H −1)− (H−2)u if −E∞ ≤ u ≤ 0 , −ΛE∞ (H). Figure 1 also reveals the layered struc-
21
4(H−1)
ture for the lowest critical values of the loss function3 .
2 log(H − 1) if 0 ≤ u
This ’geometric’ structure plays a crucial role in the
and for any integer k ≥ 0: optimization problem. The optimizer, e.g. SGD, eas-
ily avoids the band of high-index critical points, which
(H−2)u2
(
1
log(H −1)− −(k+1)I(u) if u≤−E ∞ 3
The large mass of saddle points above −ΛE∞ is a con-
Θk,H (u) = 21 4(H−1)
H−2
2 log(H −1)− 4(H−1) if u≥−E∞ sequence of Theorem 4.1 and the properties of Θ functions.
The Loss Surfaces of Multilayer Networks
151 8 7
x 10 x 10 x 10
−Λ Einf 2 2
critical points
k=2 2
critical points
2 k=3
1.5 1.5 1.5
k=4
1 k=5 1 1
1
−Λ E0
0.5 0.5 0.5
−Λ Einf
0 0 0
−1500 −1000 −500 0 −1655−1650−1645−1640−1635 −1650 −1645 −1640 −1635 −1634 −1633 −1632
Λu Λu Λu Λu
Figure 1: Distribution of the mean number of critical points, local minima and low-index saddle points (original
and zoomed). Parameters H and Λ were set to H = 3 and Λ = 1000. Black line: u = −ΛE0 (H), red line:
u = −ΛE∞ (H). Figure must be read in color.
have many negative curvature directions, and descends of the spin-glass model as its loss.
to the band of low-index critical points which lie closer
to the global minimum. Thus finding bad-quality so- Neural Network We performed an analogous ex-
lution, i.e. the one far away from the global minimum, periment on a scaled-down version of MNIST, where
is highly unlikely for large-size networks. each image was downsampled to size 10 × 10. Specifi-
cally, we trained 1000 networks with one hidden layer
Hardness of recovering the global minimum and n1 ∈ {25, 50, 100, 250, 500} hidden units (in the
Note that the energy barrier to cross when starting paper we also refer to the number of hidden units as
from any (local) minimum, e.g. the one from the band nhidden), each one starting from a random set of pa-
(−ΛEi (H), −ΛEi+1 (H)), in order to reach the global rameters sampled uniformly within the unit cube. All
minimum diverges with Λ since it is bounded below networks were trained for 200 epochs using SGD with
by Λ(E0 (H) − Ei (H)). Furthermore, suppose we are learning rate decay.
at a local minima with a scaled energy of −E∞ − δ.
To verify the validity of our theoretical assumption of
In order to find a further low lying minimum we must
parameter redundancy, we also trained a neural net-
pass through a saddle point. Therefore we must go up
work on a subset of MNIST using simulated annealing
at least to the level where there is an equal amount
(SA) where 95% of parameters were assumed to be re-
of saddle points to have a decent chance of finding a
dundant. Specifically, we allowed the weights to take
path that might possibly take us to another local mini-
one of 3 values uniformly spaced in the interval [−1, 1].
mum. This process takes an exponentially long time so
We obtained less than 2.5% drop in accuracy, which
in practice finding the global minimum is not feasible.
demonstrates the heavy over-parametrization of neural
Note that the variance of the loss in Equation 2 is networks as discussed in Section 3.
Λ which suggests that the extensive quantities should
scale with Λ. In fact this is the reason behind the Index of critical points It is necessary to ver-
scaling factor in front of the summation in the loss. ify that our solutions obtained through SGD are
The relation to the logarithmic asymptotics is as fol- low-index critical points rather than high-index
lows: the number of critical values of the loss below saddle points of poor quality. As observed by
the level Λu is roughly eΛΘH (u) . The gradient descent [Dauphin et al., 2014] certain optimization schemes
gets trapped roughly at the barrier denoted by −ΛE∞ , have the potential to get trapped in the latters. We ran
as will be shown in the experimental section. two tests to ensure that this was not the case in our
experimental setup. First, for n1 = {10, 25, 50, 100}
we computed the eigenvalues of the Hessian of the loss
5 Experiments function at each solution and computed the index. All
eigenvalues less than 0.001 in magnitude were set to 0.
The theoretical part of the paper considers the prob- Figure 4 captures an exemplary distribution of normal-
lem of training the neural network, whereas the em- ized indices, which is the proportion of negative eigen-
pirical results focus on its generalization properties. values, for n1 = 25 (the results for n1 = {10, 50, 100}
can be found in the Supplementary material). It can
5.1 Experimental Setup be seen that all solutions are either minima or sad-
dle points of very low normalized index (of the order
Spin-Glass To illustrate the theorems in Section 4, 0.01). Next, we compared simulated annealing to SGD
we conducted spin-glass simulations for different di- on a subset of MNIST. Simulated annealing does not
mensions Λ from 25 to 500. For each value of Λ, we compute gradients and thus does not tend to become
obtained an estimate of the distribution of minima by trapped in high-index saddle points. We found that
sampling 1000 initial points on the unit sphere and SGD performed at least as well as simulated annealing,
performing stochastic gradient descent (SGD) to find which indicates that becoming trapped in poor saddle
a minimum energy point. Note that throughout this points is not a problem in our experiments. The result
section we will refer to the energy of the Hamiltonian of this comparison is in the Supplementary material.
The Loss Surfaces of Multilayer Networks
125
60
100
Lambda nhidden
75 25 25
40
count
count
50 50
100 100
50 250 250
500 20 500
25
0
0
−1.6 −1.5 −1.4 −1.3 0.08 0.09 0.10
loss loss
Figure 3: Distributions of the scaled test losses for the spin-glass (left) and the neural network (right) experiments.
nhidden=25
All figures in this paper should be read in color. terial. This indicates that getting stuck in poor lo-
cal minima is a major problem for smaller networks
800
global minimum.
200
Scaling loss values To observe qualitative differ- Table 1: Pearson correlation between training and test
ences in behavior for different values of Λ or n1 , it is loss for different numbers of hidden units.
necessary to rescale the loss values to make their ex-
pected values approximately equal. For spin-glasses, The theory and experiments thus far indicate that
the expected value of the loss at critical points scales minima lie in a band which gets smaller as the network
linearly with Λ, therefore we divided the losses by Λ size increases. This indicates that computable solu-
(note that this normalization is in the statement of tions become increasingly equivalent with respect to
Theorem 4.1) which gave us the histogram of points training error, but how does this relate to error on the
at the correct scale. For MNIST experiments, we em- test set? To determine this, we computed the correla-
pirically found that the loss with respect to number tion ρ between training and test loss for all solutions
of hidden units approximately follows an exponential for each network size. The results are captured in Ta-
β
power law: E[L] ∝ eαn1 . We fitted the coefficients α, β ble 1 and Figure 7 (the latter is in the Supplementary
β
and scaled the loss values to L/eαn1 . material). The training and test error become increas-
ingly decorrelated as the network size increases. This
provides further indication that attempting to find the
5.2 Results
absolute possible minimum is of limited use with re-
Figure 3 shows the distributions of the scaled test gards to generalization performance.
losses for both sets of experiments. For the spin-glasses
(left plot), we see that for small values of Λ, we ob-
tain poor local minima on many experiments, while 6 Conclusion
for larger values of Λ the distribution becomes increas-
ingly concentrated around the energy barrier where lo- This paper establishes a connection between the neural
cal minima have high quality. We observe that the left network and the spin-glass model. We show that under
tails for all Λ touches the barrier that is hard to pene- certain assumptions, the loss function of the fully de-
trate and as Λ increases the values concentrate around coupled large-size neural network of depth H has simi-
−E∞ . In fact this concentration result has long been lar landscape to the Hamiltonian of the H-spin spheri-
predicted but not proved until [Auffinger et al., 2010]. cal spin-glass model. We empirically demonstrate that
We see that qualitatively the distribution of losses for both models studied here are highly similar in real set-
the neural network experiments (right plot) exhibits tings, despite the presence of variable dependencies in
similar behavior. Even after scaling, the variance de- real networks. To the best of our knowledge our work
creases with higher network sizes. This is also clearly is one of the first efforts in the literature to shed light
captured in Figure 8 and 9 in the Supplementary ma- on the theory of neural network optimization.
The Loss Surfaces of Multilayer Networks
where the last inequality is the direct consequence of 11 Asymptotics of the mean number
the uniformity assumption of Equation 6. of critical points and local minima
To simplify the notation in Equation 11 we drop the Theorem 11.2. For H ≥ 3 and u < −E∞ , the fol-
letter accents and simply denote w̃ as w. We skip con- lowing holds as Λ → ∞:
0 0
stant Sρ and C as it does not matter when minimizing 0
1
the loss function. After substituting q = Ψ(H−1)/2H we h(v) exp(I1 (v) − v2 I1 (v)) − 1
obtain E[CΛ,0 (u)] = √ 0 0 Λ 2
2Hπ −Φ (v) + I1 (v)
Λ · exp (ΛΘH (u)) (1 + o(1)),
1 X
LΛ,H (w) = Xi1 ,i2 ,...,iHwi1wi2. . .wiH .
Λ(H−1)/2
i ,i 1 2 ,...,iH =1 where v, Φ, h and I1 were defined in Theorem 11.1.
The Loss Surfaces of Multilayer Networks
count
low index. SA, nh=5
nhidden=10 SGD, nh=10
a) n1 = 10 SGD, nh=15
50
800
SGD, nh=2
SGD, nh=25
600
SGD, nh=3
Frequency
0 SGD, nh=5
400
100
600
Lambda
Frequency
25
50
75
400
count
100
200
200
50 300
400
500
25
0
60
600
Frequency
nhidden
400
40
25
count
50
100
200
250
20 500
0
a) n1 = 2 b) n1 = 5 c) n1 = 10 d) n1 = 25
0.145
● ● ● ●
● ● ●
0.50
● ●
0.25
● ●
●
1.6
●
● ●
● ●
● ● ● ●
0.140
●
● ● ● ● ●
● ● ● ● ● ●●
● ●● ● ● ●
● ●
● ●
● ● ● ●
● ● ●● ● ●●
1.5
● ● ● ● ● ● ● ● ● ●
● ●
0.24
● ● ● ● ● ● ●
● ● ● ●●● ● ● ●
● ● ● ●
● ●
●● ● ● ● ● ●
● ● ● ● ● ● ●●
● ● ● ● ●● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●
● ● ●● ● ● ●
● ● ●●
● ● ●
● ● ● ●●
0.135
● ● ● ● ● ● ●
●● ● ● ● ● ● ● ●● ● ● ● ●
● ● ●
●● ● ● ● ●●● ● ● ● ● ●●● ●● ●●● ● ●
●
● ●● ● ●● ● ● ●●
●
● ●● ● ●● ●
● ● ● ●● ●● ● ●
● ● ●● ●● ●● ●● ● ●
1.4
● ●●●●●● ●●●● ● ●● ●
0.45
● ●● ● ● ●●
●● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ●●
● ●● ● ● ●●●●● ●
● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ●●● ● ●
● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ●
● ● ● ● ● ● ●
●● ●● ● ● ●
● ● ●●
●● ●● ● ● ● ●● ● ● ●● ● ●●●● ● ● ● ●
●● ● ●
●
●●●●● ● ● ●● ● ● ●● ● ● ● ●
● ● ●●● ● ● ● ●
● ●
0.23
●●●
test loss
test loss
test loss
●●
test loss
● ● ●● ●● ●● ●●●
● ●●● ●●
●
●●●● ● ● ●●●● ● ● ● ● ●●
●
● ● ●● ●
● ●●●●● ● ●●
●●● ●● ●● ● ● ●● ●● ● ●●●●●● ● ●
● ●
● ●● ● ●●
● ●
●●● ●● ● ● ●
● ●● ● ● ●● ● ●
● ● ● ● ● ●● ●● ●●● ● ●●
● ● ●● ● ● ● ● ● ● ●
●●● ●
● ● ●●
● ●● ● ● ●●●●●●●●●● ●● ●●● ●●●● ●●● ● ● ●● ●● ● ● ● ●●
●● ● ●● ● ●
●● ●
●● ●● ●●● ●
● ●● ● ● ●●● ●
●
●● ●● ● ● ●●
●
● ● ● ●● ●●● ● ● ● ● ● ●● ●●● ● ●●● ● ● ●●●●●
●●● ● ●
● ● ●● ●● ● ● ● ●●●● ● ● ● ●● ● ●
● ● ●● ●●●●●
●● ● ●●●●●● ● ●
● ●● ● ●●● ● ●● ●●●●● ●● ● ●●
● ● ● ●●●
●● ● ● ●● ● ● ●●● ●● ●
● ● ●● ● ● ●●● ● ● ● ●● ● ● ● ●
0.130
● ●
●●
1.3
●● ● ● ●
●
● ● ●● ● ● ● ● ● ●● ●
● ● ●●● ●● ● ●●●●●● ●● ● ● ● ● ●●● ● ●● ●● ●●●● ●
● ●● ● ● ●●●●● ●
●● ●
●
●
●●●● ●●●● ●
●●●●● ●● ● ● ●
● ●
●● ● ● ●
●
● ● ● ●
● ●
● ●● ●
●
● ●
●● ● ●● ● ● ●
●
●
● ●● ● ●● ● ● ● ●●●● ● ●●●●
● ●●● ●●
● ● ● ●● ●
● ●● ●●● ●●● ● ●● ●● ●
● ● ●● ●● ●● ● ●● ● ●●
●●● ● ●●● ● ● ● ●● ●● ● ●● ● ●● ● ●●● ●
● ●●●
● ● ●
●●●●●● ● ● ●● ● ● ●● ●● ● ●● ● ●
● ●●●● ● ●●● ●●
● ● ● ●● ●●● ●● ●●●
● ● ● ●
●●● ● ●● ●●● ● ● ●●
●
● ● ●
● ●●●●●● ●●●
● ● ● ●
● ●● ● ●
●●
●● ●●●●
● ●
●●● ●●● ● ●
● ● ●● ● ●● ●●
● ● ●
●● ●●● ●● ●●● ●● ● ● ●● ● ●●
● ●●●● ●● ●● ●
●
●●●● ● ● ● ●●● ●●
● ●●●● ● ● ●●●
● ●●
●●●
● ● ●
●
● ●●
●
●● ●● ● ●
●●●
●●
● ●● ● ● ●● ●
● ●●●
● ●●● ● ● ● ● ●
● ● ● ●● ●●●
●●●● ●
● ●● ● ●● ● ●● ●● ● ●● ●● ●●
● ●●● ●● ●●●●
● ●●●● ● ●● ● ●●● ●
● ● ● ●
● ● ● ● ● ● ●●
●● ● ●● ●● ● ● ● ● ● ● ● ●●
0.22
● ● ● ●● ●
● ●● ●●●● ●●● ● ●● ● ●
●
● ●●
●●●● ●● ●
● ● ●● ● ● ●● ●● ●● ●●●●● ● ● ●
●●
●●● ● ●●●●●● ●●
● ●●●●● ●●
● ● ● ● ●●●● ● ●● ● ●● ● ●● ● ● ● ● ●
● ●● ● ●● ● ● ●● ●
●● ●●● ●●● ● ●
●
● ● ● ●●
● ●●●●●● ●● ● ● ● ● ●●●●● ● ●●
● ●
● ●● ●●●● ● ●
●● ● ● ●● ● ●
●● ●● ●● ●● ● ● ● ●
●●
1.2
● ●● ● ● ● ● ●
● ●●
●●● ●● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●●●
●
● ● ●● ●●
● ●● ● ●
●● ● ● ●●●●●●
● ● ● ● ● ●
● ● ● ● ● ● ●
●●●● ●●● ●● ●●● ● ● ● ●●
● ● ●● ●●● ●●● ● ●●● ●●● ● ●
●● ● ● ● ●●●
● ●
●● ●● ●
● ● ● ●● ●●
● ● ● ●
0.125
● ●●●●● ● ● ● ● ●● ● ●● ●●● ●● ● ●
●
●●●●
● ● ● ●●● ●
●● ●● ●
● ● ● ●●
● ● ● ●● ●● ●●● ●
●
● ●●●● ● ● ● ● ●● ● ● ● ● ●●● ● ● ●●
0.40
● ● ● ●● ● ●● ●● ● ● ●●● ● ● ● ● ●●● ● ● ● ●
●● ●●● ●● ● ●●●●●
● ●
●● ●
●● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ●
● ● ● ●●
●● ●
●
● ●●● ● ● ● ● ● ● ●●● ●● ●● ● ● ●● ● ● ●
●●● ● ●
●
● ● ● ● ● ●● ● ●●● ● ●● ● ● ● ●
● ● ●● ● ● ●
● ●●●● ● ●● ● ● ● ● ● ● ●
●● ● ● ● ● ● ●● ● ●● ●●● ●
● ● ● ●
●●
● ●● ● ● ● ●
●● ● ● ●
● ●● ●● ●
●●●● ●
● ● ● ● ●
1.1
●●
● ● ● ● ● ●●●●●●● ●
● ● ●●● ● ●●● ●● ● ● ● ● ●●● ●● ● ●●● ●
0.21
● ● ●● ● ● ● ●● ● ●● ● ●● ● ●● ● ● ●● ●
●●●●●●●
● ●● ● ●● ● ● ● ● ● ● ● ● ●● ●
●●●
●●
●●
● ●●●
●●●●●
●● ●● ● ● ●●● ● ● ● ● ● ●● ● ●
●● ●●●● ● ● ● ● ● ●
● ●●●●● ●● ●● ●●●● ● ●● ● ●
● ● ● ●
●●● ● ●● ●● ●● ● ●
●
●●
●
●●●
●● ● ●● ● ● ●● ●●●● ●●● ● ● ● ● ● ● ●
●●
● ●
●
●
●
●
●
●● ●● ●
●●●●●
●
●
●●●
●●
●●
●● ●● ●● ● ●● ● ● ●
●
●
0.120
●
● ● ● ●●
●●
● ●●
●● ●●●
● ●
●●●● ● ●● ● ●
●●●●
●
●●
●
●
● ● ● ● ● ● ●
●
●●●●● ●●
●●
●●●●
●●● ●
● ● ●
●
●●
●
●●●
● ●●●●
●●●
●● ●
●
●●●
●●
● ●
●● ●●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
● ●
●●●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●●
●●
●●
●●● ● ●
● ● ● ● ● ●● ●
●
●
●
●●
●
● ●
●
●●
●●
●● ●
●● ●
● ●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●●● ●
●
● ● ● ● ●
●
●
● ● ●
●●
●
●●
●●
●●● ●
●
● ● ●●● ●
●
●
●●●
●
●●●●
●
●
●●
●●●
●●
●●
●●●
●● ●
● ● ● ●
●●●
●●
●●
●
●●
●
●●● ●●● ●
●●
●●
●●
●
●●●
●
●●
●●
●
●
●●
●●●
● ●
●
●●
●
●●●
● ●
●●● ●● ● ●
1.0
●
●
●
●
●●
●●
●●
●
●
●
●
●●
●
●●
●● ● ●● ●●
●●
●
●●
●
●●
●●
●
●●
●●
●
●
●●
●●
●
●●
●
●●
●●
●
●
●●
●
●●
●●
●
●
●
●●
●●●●
●●
●
●●
●● ●●●
●●
●●
●
●●
●
●
●●
●
●
●●
●●
●
● ●
●
●●
●●
●●
●
● ● ●
●●
● ●●
●● ● ●●●
●●●●●●
●
●●●
● ●
●
●
●
●●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●● ●●● ●
●●
●
●
●●
●
●●
●●
●●
●
●●
●●
●
●●
●●
●●
● ●● ● ●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●● ●●
● ● ● ● ● ●
●
● ●
●
●●
●
●
●●●
●
● ●●
●●●
●●●
●●
●●●●●
● ● ●
●●
●
●
●
●
●
●
●●
●
●
●
●● ●●●●●
● ●●
●
●
●
●●
●
●●
● ●
●
●
0.20
●
●●
●●
●
●
●
●●●
●●
●
● ●
● ●● ●
● ●
●
●
●
●
●
●●
● ● ●
●
● ● ● ● ●
1.0 1.1 1.2 1.3 1.4 1.5 1.6 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.21 0.22 0.23 0.24 0.25 0.115 0.120 0.125 0.130 0.135
train loss train loss train loss train loss
0.080
0.084
● ● ● ● ●
0.094
0.110
● ●
● ●
●
● ● ●
●
0.079
● ● ● ●
●
●
● ● ● ●
● ● ● ● ● ● ●
● ●
● ● ● ● ● ●●
● ● ●● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ●
● ● ●● ●
●
● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●
● ●● ● ● ● ● ● ● ● ● ● ●● ●
● ● ● ●
0.082
●● ● ● ● ● ●● ● ● ●
● ● ●● ●●● ● ●● ● ● ● ● ● ●● ●
● ● ●
● ● ●● ● ● ● ● ●
● ●● ● ● ●● ● ● ● ●
● ● ● ● ●
●
● ● ● ● ●● ● ●●● ● ● ● ●
● ● ● ● ● ●● ●●
●
● ● ●
● ● ● ● ● ●●
● ●● ●● ● ●
●
●● ●
● ● ●●● ● ●
●
● ● ● ● ● ●● ● ●● ● ●● ●● ●● ● ●
● ● ● ●
● ●●● ● ● ● ● ●
● ● ●
● ● ●●
0.078
● ● ● ● ● ●● ● ● ● ● ● ● ●●
●● ● ● ● ●
●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●
●● ● ●● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ●
0.105
● ● ● ● ●
● ● ● ● ● ● ●● ● ● ● ● ● ●
● ● ● ● ●● ● ● ●
● ● ● ● ●
●
●
● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●
● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
●
● ● ●
●● ● ● ● ●●● ● ● ●● ● ● ●
●● ● ● ●●●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ●● ● ● ● ● ● ●● ●●● ●● ● ● ●
● ●
● ●● ●● ● ●
0.090
● ●● ● ● ● ●●
● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ●
● ● ●
● ● ● ●
● ● ●● ● ●
●●● ●●
● ●
●
● ● ●● ●
●●
●
●● ●● ● ● ● ●● ●
●
●● ● ● ●● ●● ● ●● ● ●
● ●●
●● ● ●● ● ●●
●
● ●
● ●
● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ●● ●● ●● ● ● ● ●●●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ●●● ● ●● ● ●
●
●●●● ●●● ●
● ● ● ● ●● ● ●
●● ● ● ●●
●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ●● ● ● ● ●
● ● ●●
●●● ● ● ● ● ● ● ● ●●● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ●●● ●● ●● ● ● ● ●●
● ● ● ● ● ●
● ● ● ●● ● ●
● ● ●● ● ● ●●
● ●●
●●●● ●● ● ● ●● ●
● ● ● ● ●●●● ● ● ●●
● ● ● ●
● ●● ●● ●●●
● ●● ● ● ● ● ● ● ●
● ●● ● ● ●●● ● ● ●●● ● ●● ● ● ●●●●● ●
● ● ● ● ●
●●
● ●● ● ●● ● ●●
●● ●●
● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●● ●
●● ●●● ● ● ●●● ● ●● ● ● ● ●
● ● ●● ● ● ● ● ●● ●●●● ●●● ● ● ● ● ●●●●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●●
● ● ●● ● ● ● ● ● ●
●● ●● ● ● ● ● ● ● ● ●
● ● ● ●
● ●●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ●●
test loss
test loss
test loss
●
test loss
● ● ●● ● ● ●
● ●● ● ● ● ●●● ●● ● ●● ● ● ● ● ● ● ●● ●●●● ●●● ●● ● ● ● ● ●● ● ●● ●● ●●●●● ●● ●
● ●
● ● ● ● ●●● ● ● ● ●● ● ● ● ● ●●●●● ● ● ● ●● ● ●
●● ●● ●
●● ● ● ● ●●●●● ●● ● ● ● ● ●● ● ●●●●● ●● ●●●●● ●●● ● ●●
●● ● ●●● ● ●● ● ● ●● ● ● ● ●●●● ● ●●● ●● ● ● ● ● ●● ● ● ●●● ● ● ●● ●●● ●●● ● ● ●● ●● ●●
●●
● ●● ● ● ● ● ●
● ●
●●● ●● ●●● ●
● ●●
● ● ●●
● ● ● ●●
●●
● ●●●●
●● ●
● ● ● ●● ●●● ●● ● ●● ●● ●● ●●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●● ●●● ●
● ● ●
●● ● ●
●
●●● ●
● ● ●● ●●● ●
● ●● ●
●
● ●
● ●● ● ●●●● ●● ●
● ●● ●● ● ●● ●●● ●● ●●●● ● ● ● ●● ●
● ● ● ● ● ●
●●
● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ●● ● ● ●● ● ●
●● ●● ● ●●● ●
●
●
● ●
● ● ●●●● ● ● ● ●
● ● ●● ●● ● ● ● ● ● ●●● ●
●●●●
● ● ● ● ● ● ●● ● ●●●
●
●● ● ● ●●● ●● ●● ● ●●●● ●
● ● ● ●
● ● ●● ● ● ●
●● ● ● ● ●● ●●● ● ● ● ● ● ●
● ● ● ● ● ●●●● ● ●
● ●
● ● ● ● ●●● ●● ● ● ● ●● ●● ● ● ● ●●● ● ●
0.077
● ● ● ● ●● ● ● ●● ●
●● ● ● ●●● ● ●● ●● ●●● ● ●● ● ●●
● ● ●● ● ●● ● ●● ● ● ●● ●● ●● ● ●●● ●● ●● ● ● ● ●●● ●● ●● ● ●● ● ● ● ●● ● ●
●●● ●● ● ●● ●●
● ●● ● ●● ● ●
● ● ●●● ●●● ●● ● ● ●●●●●●● ● ●
● ● ● ● ●● ●● ●
● ● ●●● ● ● ●●●● ● ● ●● ● ● ●
●● ● ● ● ●
● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ●● ● ●●●● ● ●
●●●
● ●●● ●
●
●● ● ●
● ●● ● ● ●● ● ● ●●● ● ● ● ●●● ●● ● ●●● ● ● ● ● ● ●●●
● ● ● ●● ●
●● ● ● ●● ●
● ● ●●●●●
● ● ● ●● ● ●●● ● ●● ● ● ● ● ●
●● ●●
●●●
●
●● ● ●●
●
●● ● ●●● ● ● ● ●● ● ● ● ● ●● ●
● ●● ●●● ●
●●● ●
●● ●
●●●● ● ● ●● ●● ● ● ● ● ●
●● ● ● ● ● ● ●● ●● ● ● ●●● ● ● ● ● ●
●
●● ● ●● ●●● ●● ●● ● ●
●●
● ●● ● ● ● ●●
● ● ● ● ● ●● ● ●
●●
● ●● ●
●
● ● ● ●● ●● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ●
● ●●● ●● ●● ● ● ●
●●
●● ●● ●
●●● ● ●● ●
● ● ● ● ● ●
●●
●● ●● ● ● ●● ● ●
● ●● ●● ●● ● ● ● ●● ● ●●● ● ●●●●● ● ●● ● ●● ● ● ● ● ● ● ●●
● ● ● ●●●● ●● ●
● ● ●
●● ● ●●●●
●
●● ● ● ● ● ●● ● ● ● ● ●●
● ●●
● ● ● ● ●
● ● ●●●●●●
●●●●
● ●
●● ●● ● ●● ●●● ● ● ●● ●● ● ● ●●●●●
● ●● ●● ● ●●● ●● ●●● ● ●● ● ● ●
● ●● ● ●●●
● ● ● ●● ● ● ● ●● ●● ●● ● ●●●●●● ●● ● ● ● ● ●
●●
●
●●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●●● ●● ●●●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●●● ● ● ● ●● ●●● ● ●● ●● ● ● ●● ● ● ●●
● ●● ●● ●
●
●●●● ● ●●
0.080
● ● ● ●● ●● ● ● ●● ●● ● ● ●● ● ●
● ● ● ● ● ● ●● ●●
●
●
●●● ●●
●●
●●● ●● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ●● ●● ● ●
●●● ● ●●● ●
● ●● ● ●● ● ●
●
● ● ●● ●
● ●● ●● ● ● ● ●
●● ●
●● ●● ● ● ● ● ●● ●●●● ● ●●● ●
●●● ●●●
● ●● ●● ●●
●● ●
● ● ● ●●● ●●
●
●● ● ●
● ●●●● ●●● ● ●
●● ●● ●● ● ●
● ●
●●● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●
● ●●● ● ●● ● ● ●
●● ●● ●● ● ● ●●● ● ● ●●
● ● ●●● ● ●● ● ●● ● ●
● ●● ● ●● ● ● ● ● ●● ●●●● ●
●
● ● ●●● ●●●● ● ●● ● ● ●
●
● ●●● ● ● ●● ●
● ● ●●●
● ●●●●●●● ●
● ●●● ●● ● ● ● ● ●●●●● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ●●●
● ● ●● ● ● ● ● ● ●
0.100
● ● ●
●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●●● ● ●●● ● ●
●● ● ●●
●● ●● ● ● ●●● ● ●● ● ●
●● ● ●● ●● ●●
●● ●●●●●●● ● ●●● ● ●● ●●● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● ● ●●
● ● ●●
● ●
● ●
●
●
● ●● ● ●●
● ● ●●● ●●●●
● ●
●
● ●
● ●● ●● ● ●
● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ●
●●● ●● ●●● ● ●●●● ●●● ● ●
● ● ●
● ● ●●● ●● ● ● ● ●●
●
●● ●●
●
●
● ● ● ● ●● ●●● ● ● ●●● ●●● ●●●●●●
●● ●● ●● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ●●
● ● ● ●
● ●●● ●● ● ● ● ●●● ● ● ●● ●● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ●
●●●● ●● ●● ● ●
●
●● ●
● ●
● ● ● ● ●
● ● ●● ●● ●● ●●● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ●●● ●● ●●● ●●
● ●●
●● ●● ● ●
●
● ● ● ●
● ●
● ●
● ●● ●● ● ●●
●●● ● ● ● ● ● ● ●● ● ●
● ● ● ●●
● ● ● ● ●●
●● ● ● ● ● ● ● ●
●● ● ● ● ●● ●●●●
●● ●●● ●●●● ●● ● ●● ●● ● ● ●● ●●● ● ● ● ●●●● ● ● ●●● ●● ●●● ● ● ●● ● ●● ●
● ● ●●● ● ●
● ● ● ● ●● ● ●● ●●● ● ●● ● ●● ● ● ● ●●
● ● ●● ●●
● ● ● ● ●●● ● ● ●●● ●
● ● ● ● ● ● ● ● ●
● ●● ●
● ●●●
●● ●
● ● ●● ● ● ●
●●● ●● ●
● ● ●●
● ●● ● ● ●● ● ●
● ●● ● ● ● ● ● ●
● ● ●●●● ● ● ● ●●● ● ●●● ● ● ●●
● ● ● ● ●● ● ●●● ●
● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●●
0.076
●● ●● ● ●● ● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ●●
0.086
●● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●● ●
● ●● ●● ● ● ●● ●● ● ● ●● ●● ●
● ● ● ● ●●
● ●
●
● ● ● ●
● ● ●●● ●● ● ● ● ●● ●● ● ● ● ● ●●
● ● ● ● ●
● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●
●
● ● ●●●●● ●● ●●●●● ● ●● ● ● ● ● ● ●
● ●
● ●
●
● ●● ● ● ● ● ● ● ●● ●●● ● ● ● ● ●● ● ●
● ●● ●● ●● ● ●● ● ● ● ● ●
● ●● ● ● ● ● ● ● ● ●●
● ●
● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●
● ●●
●
●
● ● ●● ●● ●● ● ● ● ● ● ●
● ●
● ● ● ● ●
● ● ● ●
●● ●
● ● ● ●
●●● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ●●● ● ● ● ●●●● ●
● ● ●● ●● ● ●● ● ● ● ● ● ● ● ●
●● ●●● ●
● ● ● ● ● ●●●● ● ● ● ● ●● ● ●
● ● ●
●
● ●
●
● ●
● ●●
● ●● ● ●● ●
●
● ● ● ● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ● ●
●
● ● ● ●● ● ● ● ● ● ● ●
● ● ● ●● ● ● ●
●●
●
●
● ● ●● ● ● ● ● ● ● ● ●
● ●●
● ● ● ● ● ● ● ●
● ●●
● ●
● ● ● ●● ● ● ● ● ●
●
● ● ●●
● ●
● ●●
●●
●
●
● ●
● ● ● ● ●
● ● ●
●● ● ● ●
●
● ● ● ● ● ● ● ● ●
● ● ●
●
● ●● ●
● ● ●
● ● ●
0.075
● ● ● ● ● ●
0.095
● ● ● ● ●
0.078
● ● ● ●
● ● ● ●
●● ● ●
● ●
●
● ● ●
● ● ●
● ● ●
●
●
● ●
0.082
● ● ● ●
0.085 0.090 0.095 0.100 0.074 0.076 0.078 0.080 0.065 0.066 0.067 0.068 0.069 0.0615 0.0620 0.0625 0.0630 0.0635
train loss train loss train loss train loss
Figure 7: Test loss versus train loss for networks with different number of hidden units n1 .
0.25
0.20
0.15 0.20
0.10 0.15
0.10
6e−05
test loss variance
0.15 0.20
test loss
3e−05
0.10
0e+00