You are on page 1of 13

The Loss Surfaces of Multilayer Networks

Anna Choromanska Mikael Henaff Michael Mathieu Gérard Ben Arous Yann LeCun
achoroma@cims.nyu.edu mbh305@nyu.edu mathieu@cs.nyu.edu benarous@cims.nyu.edu yann@cs.nyu.edu

Courant Institute of Mathematical Sciences


New York, NY, USA
arXiv:1412.0233v3 [cs.LG] 21 Jan 2015

Abstract 1 Introduction
Deep learning methods have enjoyed a resurgence
of interest in the last few years for such applica-
We study the connection between the highly
tions as image recognition [Krizhevsky et al., 2012],
non-convex loss function of a simple model of
speech recognition [Hinton et al., 2012], and natu-
the fully-connected feed-forward neural net-
ral language processing [Weston et al., 2014]. Some
work and the Hamiltonian of the spherical
of the most popular methods use multi-stage ar-
spin-glass model under the assumptions of:
chitectures composed of alternated layers of linear
i) variable independence, ii) redundancy in
transformations and max function. In a particu-
network parametrization, and iii) uniformity.
larly popular version, the max functions are known
These assumptions enable us to explain the
as ReLUs (Rectified Linear Units) and compute
complexity of the fully decoupled neural net-
the mapping y = max(x, 0) in a pointwise fash-
work through the prism of the results from
ion [Nair and Hinton, 2010]. In other architectures,
random matrix theory. We show that for
such as convolutional networks [LeCun et al., 1998a]
large-size decoupled networks the lowest crit-
and maxout networks [Goodfellow et al., 2013], the
ical values of the random loss function form
max operation is performed over a small set of vari-
a layered structure and they are located in
able within a layer.
a well-defined band lower-bounded by the
global minimum. The number of local min- The vast majority of practical applications of deep
ima outside that band diminishes exponen- learning use supervised learning with very deep net-
tially with the size of the network. We empir- works. The supervised loss function, generally a cross-
ically verify that the mathematical model ex- entropy or hinge loss, is minimized using some form
hibits similar behavior as the computer sim- of stochastic gradient descent (SGD) [Bottou, 1998],
ulations, despite the presence of high depen- in which the gradient is evaluated using the back-
dencies in real networks. We conjecture that propagation procedure [LeCun et al., 1998b].
both simulated annealing and SGD converge
The general shape of the loss function is very poorly
to the band of low critical points, and that all
understood. In the early days of neural nets (late
critical points found there are local minima
1980s and early 1990s), many researchers and engi-
of high quality measured by the test error.
neers were experimenting with relatively small net-
This emphasizes a major difference between
works, whose convergence tends to be unreliable, par-
large- and small-size networks where for the
ticularly when using batch optimization. Multilayer
latter poor quality local minima have non-
neural nets earned a reputation of being finicky and
zero probability of being recovered. Finally,
unreliable, which in part caused the community to fo-
we prove that recovering the global minimum
cus on simpler method with convex loss functions, such
becomes harder as the network size increases
as kernel machines and boosting.
and that it is in practice irrelevant as global
minimum often leads to overfitting. However, several researchers experimenting with larger
networks and SGD had noticed that, while multilayer
nets do have many local minima, the result of mul-
tiple experiments consistently give very similar per-
formance. This suggests that, while local minima are
Appearing in Proceedings of the 18th International Con- numerous, they are relatively easy to find, and they
ference on Artificial Intelligence and Statistics (AISTATS) are all more or less equivalent in terms of performance
2015, San Diego, CA, USA. JMLR: W&CP volume 38. on the test set. The present paper attempts to explain
Copyright 2015 by the authors.
this peculiar property through the use of random ma-
The Loss Surfaces of Multilayer Networks

trix theory applied to the analysis of critical points in 2 Prior work


high degree polynomials on the sphere.
We first establish that the loss function of a typical In the 1990s, a number of researchers studied
multilayer net with ReLUs can be expressed as a poly- the convergence of gradient-based learning for mul-
nomial function of the weights in the network, whose tilayer networks using the methods of statistical
degree is the number of layers, and whose number physics, i.e. [Saad and Solla, 1995], and the edited
of monomials is the number of paths from inputs to works [Saad, 2009]. Recently, Saxe [Saxe et al., 2014]
output. As the weights (or the inputs) vary, some and Dauphin [Dauphin et al., 2014] explored the sta-
of the monomials are switched off and others become tistical properties of the error surface in multi-layer
activated, leading to a piecewise, continuous polyno- architectures, pointing out the importance of saddle
mial whose monomials are switched in and out at the points.
boundaries between pieces.
Earlier theoretical analyses [Baldi and Hornik, 1989,
An important question concerns the distribution of Wigner, 1958, Fyodorov and Williams, 2007,
critical points (maxima, minima, and saddle points) Bray and Dean, 2007] suggest the existence of a
of such functions. Results from random matrix the- certain structure of critical points of random Gaussian
ory applied to spherical spin glasses have shown that error functions on high dimensional continuous spaces.
these functions have a combinatorially large number of They imply that critical points whose error is much
saddle points. Loss surfaces for large neural nets have higher than the global minimum are exponentially
many local minima that are essentially equivalent from likely to be saddle points with many negative and
the point of view of the test error, and these minima approximate plateau directions whereas all local
tend to be highly degenerate, with many eigenvalues minima are likely to have an error very close to
of the Hessian near zero. that of the global minimum (these results are conve-
niently reviewed in [Dauphin et al., 2014]). The work
We empirically verify several hypotheses regarding
of [Dauphin et al., 2014] establishes a strong empirical
learning with large-size networks:
connection between neural networks and the theory
of random Gaussian fields by providing experimental
• For large-size networks, most local minima are evidence that the cost function of neural networks
equivalent and yield similar performance on a test exhibits the same properties as the Gaussian error
set. functions on high dimensional continuous spaces.
Nevertheless they provide no theoretical justification
• The probability of finding a “bad” (high value)
for the existence of this connection which instead we
local minimum is non-zero for small-size networks
provide in this paper.
and decreases quickly with network size.
This work is inspired by the recent advances in random
• Struggling to find the global minimum on the matrix theory and the work of [Auffinger et al., 2010]
training set (as opposed to one of the many good and [Auffinger and Ben Arous, 2013]. The authors of
local ones) is not useful in practice and may lead these works provided an asymptotic evaluation of the
to overfitting. complexity of the spherical spin-glass model (the spin-
glass model originates from condensed matter physics
The above hypotheses can be directly justified by our where it is used to represent a magnet with irreg-
theoretical findings. We finally conclude the paper ularly aligned spins). They discovered and mathe-
with brief discussion of our results and future research matically proved the existence of a layered structure
directions in Section 6. of the low critical values for the model’s Hamilto-
nian which in fact is a Gaussian process. Their re-
We confirm the intuition and empirical evidence ex-
sults are not discussed in details here as it will be
pressed in previous works that the problem of train-
done in Section 4 in the context of neural networks.
ing deep learning systems resides with avoiding sad-
We build the bridge between their findings and neu-
dle points and quickly “breaking the symmetry” by
ral networks and show that the objective function
picking sides of saddle points and choosing a suit-
used by neural network is analogous to the Hamilto-
able attractor [LeCun et al., 1998b, Saxe et al., 2014,
nian of the spin-glass model under the assumptions
Dauphin et al., 2014].
of: i) variable independence, ii) redundancy in net-
What is new in this paper? To the best of our knowl- work parametrization, and iii) uniformity, and thus
edge, this paper is the first work providing a theo- their landscapes share the same properties. We em-
retical description of the optimization paradigm with phasize that the connection between spin-glass models
neural networks in the presence of large number of pa- and neural networks was already explored back in the
rameters. It has to be emphasized however that this past (a summary can be found in [Dotsenko, 1995]).
connection relies on a number of possibly unrealistic In example in [Amit et al., 1985] the authors showed
assumptions. It is also an attempt to shed light on that the long-term behavior of certain neural network
the puzzling behavior of modern deep learning systems models are governed by the statistical mechanism of
when it comes to optimization and generalization. infinite-range Ising spin-glass Hamiltonians. Another
The Loss Surfaces of Multilayer Networks

work [Nakanishi and Takayama, 1997] examined the Theorem 3.1. Let Ψ be the mass of the network, d
nature of the spin-glass transition in the Hopfield neu- be the number of network inputs and H be the depth
ral network model. None of these works however make of the network. The size of the network is bounded as
the attempt to explain the paradigm of optimizing the
√ H √
highly non-convex neural network objective function Ψ2 H = Λ2H H ≥ N ≥
H H
√ ≥ Ψ = Λ.
Ψ2 H
through the prism of spin-glass theory and thus in this d
respect our approach is very novel.

3 Deep network and spin-glass model We assume the depth of the network H is bounded.
Therefore N → ∞ iff Ψ → ∞, and N → ∞ iff Λ → ∞.
3.1 Preliminaries In the rest of this section we will be establishing a con-
nection between the loss function of the neural network
For the theoretical analysis, we consider a simple and the Hamiltonian of the spin-glass model. We next
model of the fully-connected feed-forward deep net- provide the outline of our approach.
work with a single output and rectified linear units.
We call the network N . We focus on a binary clas-
3.2 Outline of the approach
sification task. Let X be the random input vector of
dimensionality d. Let (H − 1) denote the number of In Subsection 3.3 we introduce randomness to the
hidden layers in the network and we will refer to the model by assuming X’s and A’s are random. We
input layer as the 0th layer and to the output layer as make certain assumptions regarding the neural net-
the H th layer. Let ni denote the number of units in work model. First, we assume certain distributions
the ith layer (note that n0 = d and nH = 1). Let Wi and mutual dependencies concerning the random vari-
be the matrix of weights between (i − 1)th and ith lay- ables X’s and A’s. We also introduce a spherical con-
ers of the network. Also, let σ denote the activation straint on the model weights. We finally make two
function that converts a unit’s weighted input to its other assumptions regarding the redundancy of net-
output activation. We consider linear rectifiers thus work parameters and their uniformity, both of which
σ(x) = max(0, x). We can therefore write the (ran- are justified by empirical evidence in the literature.
dom) network output Y as These assumptions will allow us to show in Subsec-
> > tion 3.4 that the loss function of the neural network,
Y = qσ(WH σ(WH−1 . . . σ(W1> X))) . . . ),
after re-indexing terms1 , has the form of√a centered
p
where q = (n0 n1 ...nH )(H−1)/2H is simply a normal- Gaussian process on the sphere S = S Λ−1 ( Λ), which
ization factor. The same expression for the output of is equivalent to the Hamiltonian of the H-spin spheri-
the network can be re-expressed in the following way: cal spin-glass model, given as

γ
n0 X H Λ
(k)
1 X
LΛ,H (w̃) =
X Y
Y =q Xi,j Ai,j wi,j , (1) Xi1 ,i2 ,...,iHw̃i1w̃i2. . .w̃iH , (2)
Λ(H−1)/2
i ,i
i=1 j=1 k=1 1 2 ,...,iH =1

where the first summation is over the network inputs with spherical constraint
and the second one is over all paths from a given net-
Λ
work input to its output, where γ is the total number 1X 2
of such paths (note that γ = n1 n2 . . . nH ). Also, for w̃ = 1. (3)
Λ i=1 i
all i = {1, 2, . . . , n0 }: Xi,1 = Xi,2 = · · · = Xi,γ . Fur-
(k)
thermore, wi,j is the weight of the k th segment of path The redundancy and uniformity assumptions will be
indexed with (i, j) which connects layer (k − 1) with explained in Subsection 3.3 in detail. However, on
layer k of the network. Note that each path corre- the high level of generality the redundancy assumption
sponds to a certain set of H weights, which we refer to enables us to skip superscript (k) appearing next to
as a configuration of weights, which are multiplied by the weights in Equation 1 (note it does not appear
each other. Finally, Ai,j denotes whether a path (i, j) next to the weights in Equation 2) by determining a
is active (Ai,j = 1) or not (Ai,j = 0). set of unique network weights of size no larger than N ,
Definition 3.1. The mass of the network Ψ is the and the uniformity assumption ensures that all ordered
total number of all paths between all network √inputs products of unique weights appear in Equation 2 the
QH
and outputs: Ψ = i=0 ni . Also let Λ as Λ = H Ψ. same number of times.
Definition 3.2. The size of the network N is the total An asymptotic evaluation of the complexity of H-spin
PH−1
number of network parameters: N = i=0 ni ni+1 . spherical spin-glass models via random matrix theory
1
The terms are re-indexed in Subsection 3.3 and
The mass and the size of the network depend on each it is done to preserve consistency with the notation
other as captured in Theorem 3.1. All proofs in this in [Auffinger et al., 2010] where the proofs of the results
paper are deferred to the Supplementary material. of Section 4 can be found.
The Loss Surfaces of Multilayer Networks

was studied in the literature [Auffinger et al., 2010] later on. ri1 ,i2 ,...,iH denotes whether the configuration
where a precise description of the energy landscape (wi1 , wi2 , . . . , wiH ) appears in Equation 4 or not, thus
for the Hamiltonians of these models is provided. In (j) ri1 ,i2 ,...,iH
ri1 ,i2 ,...,iH ∈ {0∪1}, and {Xi1 ,i2 ,...,iH }j=1 denote
this paper (Section 4) we use these results to explain a set of random variables corresponding to the same
the optimization problem in neural networks. weight configuration (since ri1 ,i2 ,...,iH ∈ {0∪1} this set
has at most one element). Also ri1 ,i2 ,...,iH = 0 implies
3.3 Approximation (j) QH
that summand Xi1 ,i2 ,...,iH ρ k=1 wik is zeroed out).
Input We assume each input Xi,j is a normal ran- Furthermore, the following condition has to be satis-
PN
dom variable such that Xi,j ∼ N (0, 1). Clearly the fied: i1 ,i2 ,...,iH =1 ri1 ,i2 ,...,iH = Ψ. In the notation
model contains several dependencies as one input is as- YN , index N refers to the number of unique weights of
sociated with many paths in the network. That poses a network (this notation will also be helpful later).
a major theoretical problem in analyzing these models Consider a family of networks which have the same
as it is unclear how to account for these dependen- graph of connections as network N but different edge
cies. In this paper we instead study fully decoupled weighting such that they only have s unique weights
model [De la Peña and Giné, 1999], where Xi,j ’s are and s ≤ N (by notation analogy the expected out-
assumed to be independent. We allow this simplifica- put of this network will be called Ys ). It was recently
tion as to the best of our knowledge there exists no shown [Denil et al., 2013, Denton et al., 2014] that for
theoretical description of the optimization paradigm large-size networks large number of network parame-
with neural networks in the literature either under ters (according to [Denil et al., 2013] even up to 95%)
independence assumption or when the dependencies are redundant and can either be learned from a very
are allowed. Also note that statistical learning theory small set of unique parameters or even not learned at
heavily relies on this assumption [Hastie et al., 2001] all with almost no loss in prediction accuracy.
even when the model under consideration is much sim-
pler than a neural network. Under the independence Definition 3.3. A network M which has the same
assumption we will demonstrate the similarity of this graph of connections as N and s unique weights satis-
model to the spin-glass model. We emphasize that de- fying s ≤ N is called a (s, )-reduction image of N for
spite the presence of high dependencies in real neural some  ∈ [0, 1] if the prediction accuracy of N and M
networks, both models exhibit high similarities as will differ by no more than  (thus they classify at most 
be empirically demonstrated. fraction of data points differently).
Theorem 3.2. Let N be a neural network giving the
Paths We assume each path in Equation 1 is equally output whose expectation YN is given in Equation 5.
likely to be active thus Ai,j ’s will be modeled as Let M be its (s, )-reduction image for some s ≤ N
Bernoulli random variables with the same probabil- and  ∈ [0, 0.5]. By analogy, let Ys be the expected
ity of success ρ. By assuming the independence of X’s output of network M. Then the following holds
and A’s we get the following
1 − 2
γ
n0 X H corr(sign(Ys ), sign(YN )) ≥ ,
X Y (k) 1 + 2
EA [Y ] = q Xi,j ρ wi,j . (4)
i=1 j=1 k=1 where corr denotes the correlation defined as
corr(A, B) = E[(A−E[A]])(B−E[B]])
std(A)std(B) , std is the standard
Redundancy in network parametrization Let deviation and sign(·) denotes the sign of prediction
W = {w1 , w2 , . . . , wN } be the set of all weights of (sign(Ys ) and sign(YN ) are both random).
the network. Let A denote the set of all H-length
configurations of weights chosen from W (order of The redundancy assumption implies that one can pre-
the weights in a configuration does matter). Note serve  to be close to 0 even with s << N .
that the size of A is therefore N H . Also let B be
a set such that each element corresponds to the sin- Uniformity Consider the network M to be a (s, )-
gle configuration of weights from Equation 4, thus reduction image of N for some s ≤ N and  ∈ [0, 1].
1 2 H 1 2 H
B = {(w1,1 , w1,1 , . . . , w1,1 ), (w1,2 , w1,2 , . . . , w1,2 ), . . . , The output Ys of the image network can in general be
1 2 H
(wn0 ,γ , wn0 ,γ , . . . , wn0 ,γ )}, where every single weight expressed as
comes from set W (note that B ⊂ A). Thus Equa- ti1 ,...,iH
s H
tion 4 can be equivalently written as X X (j)
Y
Ys = q Xi1 ,...,iH ρ w ik ,
N ri1,i2 ,...,iH H i1 ,...,iH =1 j=1 k=1
(j)
X X Y
YN := EA [Y ] = q Xi1 ,i2 ,...,iH ρ wik . (5)
i1 ,i2 ,...,iH =1 j=1 k=1 where ti1 ,...,iH ∈ {Z+ ∪ 0} is the number of times each
configuration
Ps (wi1 , wi2 , . . . , wiH ) repeats in Equation 5
We will now explain the notation. It is over- and i1 ,...,iH =1 ti1 ,...,iH = Ψ. We assume that unique
complicated for purpose, as this notation will be useful weights are close to being evenly distributed on the
The Loss Surfaces of Multilayer Networks

graph of connections of network M. We call this as- the absolute loss, where S = supw Ŷ , and −1 or 1 in
sumption a uniformity assumption. Thus this assump- case of the hinge loss. Also note that in the case of the
tion implies that for all (i1 , i2 , . . . , iH ) : i1 , i2 , . . . , iH ∈ hinge loss max operator can be modeled as Bernoulli
{1, 2, . . . , s} there exists a positive constant c ≥ 1 such random variable, which we assume is independent of
that the following holds Ŷ . Given that one can show that after approximating
EA [Y ] with Ŷ both losses can be generalized to the
1 Ψ Ψ
· ≤ ti1 ,i2 ,...,iH ≤ c · H . (6) following expression
c sH s
Λ H
The factor sΨH comes from the fact that for the net- X Y
work where every weight is uniformly distributed on LΛ,H (w̃) = C1 + C2 q Xi1 ,i2 ,...,iH w̃ik ,
the graph of connections (thus with high probability i1 ,i2 ,...,iH =1 k=1

every node is adjacent to an edge with any of the


unique weights) it holds that ti1 ,i2 ,...,iH = sΨH . For and C1 , C2 are some constants and weights w̃ are sim-
√ PΛ
simplicity assume sΨH ∈ Z+ and H Ψ ∈ Z+ . Consider ply scaled weights w satisfying Λ1 i=1 w̃i2 = 1. In
therefore an expression as follows case of the absolute loss the term Yt is incorporated
into the term C1 , and in case of the hinge loss it van-
Ψ
s sH H ishes (note that Ŷ is a symmetric random quantity
(j) thus multiplying it by Yt does not change its distri-
X X Y
Ŷs = q Xi1 ,...,iH ρ wik , (7)
i1 ,...,iH =1 j=1 k=1
bution). We skip the technical details showing this
equivalence, and defer them to the Supplementary ma-
which corresponds to a network for which the lower- terial. Note that after simplifying the notation by i)
bound and upper-bound in Equation 6 match. Note dropping the letter accents and simply denoting w̃ as
that one can combine both summations in Equation 7 w, ii) skipping constants C1 and C2 which do not mat-
and re-index its terms to obtain ter when minimizing the loss function, and iii) sub-
1 1
stituting q = Ψ(H−1)/2H = Λ(H−1)/2 , we obtain the
Λ H
X Y Hamiltonian of the H-spin spherical spin-glass model
Ŷ := Ŷ(s=Λ) = q Xi1 ,...,iH ρ wik . (8) of Equation 2 with spherical constraint captured in
i1 ,...,iH =1 k=1 Equation 3.
The following theorem (Theorem 3.3) captures the
connection between Ŷs and Ys . 4 Theoretical results
Theorem 3.3. Under the uniformity assumption of
Equation 6, random variable Ŷs in Equation 7 and In this section we use the results of the theoretical
random variable Ys in Equation 5 satisfy the follow- analysis of the complexity of spherical spin-glass mod-
ing: corr(Ŷs , Ys ) ≥ c12 . els of [Auffinger et al., 2010] to gain an understanding
of the optimization of strongly non-convex loss func-
tions of neural networks. These results show that for
Spherical constraint We finally assume that for
high-dimensional (large Λ) spherical spin-glass models
some positive constant C weights satisfy the spherical
the lowest critical values of the Hamiltonians of these
condition
Λ models form a layered structure and are located in a
1X 2 well-defined band lower-bounded by the global mini-
w = C. (9)
Λ i=1 i mum. Simultaneously, the probability of finding them
outside the band diminishes exponentially with the di-
Next we will consider two frequently used loss func- mension of the spin-glass model. We next present the
tions, absolute loss and hinge loss, where we approxi- details of these results in the context of neural net-
mate YN (recall YN := EA [Y ]) with Ŷ . works. We first introduce the notation and definitions.
Definition 4.1. Let u ∈ R and k be an integer such
3.4 Loss function as a H-spin spherical that 0 ≤ k < Λ. We will denote as CΛ,k (u) a random
spin-glass model number of critical values of LΛ,H (w) in the set ΛB =
{ΛX : x ∈ (−∞, u)} with index2 equal to k. Similarly
Let LaΛ,H (w) and LhΛ,H (w) be the (random) absolute we will denote as CΛ (B) a random total number of
loss and (random) hinge loss that we define as follows critical values of LΛ,H (w).

LaΛ,H (w) = EA [ |Yt − Y | ] Later in the paper by critical values of the loss function
that have non-diverging (fixed) index, or low-index, we
and mean the ones with index non-diverging with Λ.
LhΛ,H (w) = EA [max(0, 1 − Yt Y )],
2
where Yt is a random variable corresponding to the The number of negative eigenvalues of the Hessian
true data labeling that takes values −S or S in case of ∇ LΛ,H at w is also called index of ∇2 LΛ,H at w.
2
The Loss Surfaces of Multilayer Networks

The existence of the band of low-index crit- where


ical points One can directly use Theorem 2.12 u p 2 p
I(u) = − 2 −log(−u+ u2 −E 2 )+log E .
u −E∞
in [Auffinger et al., 2010] to show that for large-size 2 ∞ ∞
networks (more precisely when Λ → ∞ but recall that E∞
Λ → ∞ iff N → ∞) it is improbable to find a critical
value below certain level −ΛE0 (H) (which we call the Function Θ (u)
ground state), where E0 (H) is some real number. H Function Θk,H(u)

0.3 −E0
Let us also introduce the number that we will refer to −Einf 0
0.2
as E∞ . We will refer to this important threshold as k=0

ΘH(u)

(u)
−0.02 k=1
the energy barrier and define it as 0.1

k,H
k=2

Θ
r 0 −0.04 k=3
H −1 k=4
E∞ = E∞ (H) = 2 . −0.1 −0.06
k=5
H −1.5 −1 −0.5 0 −1.68 −1.66 −1.64
u u
Theorem 2.14 in [Auffinger et al., 2010] implies that Figure 2: Functions ΘH (u) and Θk,H (u) for k =
for large-size networks all critical values of the loss {0, 1, . . . , 5}. Parameter H was set to H = 3. Black
function that are of non-diverging index must lie be- line: u = −E0 (H), red line: u = −E∞ (H). Figure
low the threshold −ΛE∞ (H). Any critical point that must be read in color.
lies above the energy barrier is a high-index saddle Also note that the following corollary holds.
point with overwhelming probability. Thus for large-
size networks all critical values of the loss function Corollary 4.1. For all k > 0 and u < −E∞ ,
that are of non-diverging index must lie in the band Θk,H (u) < Θ0,H (u).
(−ΛE0 (H), −ΛE∞ (H)). Next we will show the logarithmic asymptotics of the
mean number of critical points (the asymptotics of the
Layered structure of low-index critical points mean number of critical points can be found in the
From Theorem 2.15 in [Auffinger et al., 2010] it fol- Supplementary material).
lows that for large-size networks finding a critical value
with index larger or equal to k (for any fixed inte- Theorem 4.1 ([Auffinger et al., 2010], Theorem 2.5
ger k) below energy level −ΛEk (H) is improbable, and 2.8). For all H ≥ 2
where −Ek (H) ∈ [−E0 (H), −E∞ (H)]. Furthermore, 1
the sequence {Ek (H)}k∈N is strictly decreasing and lim log E[CΛ (u)] = ΘH (u).
converges to E∞ as k → ∞ [Auffinger et al., 2010]. Λ→∞ Λ

These results unravel a layered structure for the low- and for all H ≥ 2 and k ≥ 0 fixed
est critical values of the loss function of a large- 1
size network, where with overwhelming probability lim log E[CΛ,k (u)] = Θk,H (u).
Λ→∞ Λ
the critical values above the global minimum (ground
state) of the loss function are local minima exclusively.
Above the band ((−ΛE0 (H), −ΛE1 (H))) containing
only local minima (critical points of index 0), there From Theorem 4.1 and Corollary 4.1 the number of
is another one, ((−ΛE1 (H), −ΛE2 (H))), where one critical points in the band (−ΛE0 (H), −ΛE∞ (H)) in-
can only find local minima and saddle points of in- creases exponentially as Λ grows and that local minima
dex 1, and above this band there exists another one, dominate over saddle points and this domination also
((−ΛE2 (H), −ΛE3 (H))), where one can only find lo- grows exponentially as Λ grows. Thus for large-size
cal minima and saddle points of index 1 and 2, and so networks the probability of recovering a saddle point
on. in the band (−ΛE0 (H), −ΛE∞ (H)), rather than a lo-
cal minima, goes to 0.
Logarithmic asymptotics of the mean number Figure 1 captures exemplary plots of the distribu-
of critical points We will now define two non- tions of the mean number of critical points, local min-
decreasing, continuous functions on R, ΘH and Θk,H ima and low-index saddle points. Clearly local min-
(their exemplary plots are captured in Figure 2). ima and low-index saddle points are located in the

1 (H−2)u 2 band (−ΛE0 (H), −ΛE∞ (H)) whereas high-index sad-
 2 log(H −1)− 4(H−1) −I(u) if u ≤ −E∞
 dle points can only be found above the energy barrier
2
ΘH (u) = 1 log(H −1)− (H−2)u if −E∞ ≤ u ≤ 0 , −ΛE∞ (H). Figure 1 also reveals the layered struc-
 21
 4(H−1)
ture for the lowest critical values of the loss function3 .
2 log(H − 1) if 0 ≤ u
This ’geometric’ structure plays a crucial role in the
and for any integer k ≥ 0: optimization problem. The optimizer, e.g. SGD, eas-
ily avoids the band of high-index critical points, which
(H−2)u2
(
1
log(H −1)− −(k+1)I(u) if u≤−E ∞ 3
The large mass of saddle points above −ΛE∞ is a con-
Θk,H (u) = 21 4(H−1)
H−2
2 log(H −1)− 4(H−1) if u≥−E∞ sequence of Theorem 4.1 and the properties of Θ functions.
The Loss Surfaces of Multilayer Networks

151 8 7
x 10 x 10 x 10

Mean number of low−index

Mean number of low−index


8
x 10

critical points (zoomed)


3

Mean number of low−index


−Λ E0 2.5 k=0 2.5

critical points (zoomed)


2.5
k=1
Mean number of

−Λ Einf 2 2
critical points

k=2 2

critical points
2 k=3
1.5 1.5 1.5
k=4
1 k=5 1 1
1
−Λ E0
0.5 0.5 0.5
−Λ Einf
0 0 0
−1500 −1000 −500 0 −1655−1650−1645−1640−1635 −1650 −1645 −1640 −1635 −1634 −1633 −1632
Λu Λu Λu Λu
Figure 1: Distribution of the mean number of critical points, local minima and low-index saddle points (original
and zoomed). Parameters H and Λ were set to H = 3 and Λ = 1000. Black line: u = −ΛE0 (H), red line:
u = −ΛE∞ (H). Figure must be read in color.

have many negative curvature directions, and descends of the spin-glass model as its loss.
to the band of low-index critical points which lie closer
to the global minimum. Thus finding bad-quality so- Neural Network We performed an analogous ex-
lution, i.e. the one far away from the global minimum, periment on a scaled-down version of MNIST, where
is highly unlikely for large-size networks. each image was downsampled to size 10 × 10. Specifi-
cally, we trained 1000 networks with one hidden layer
Hardness of recovering the global minimum and n1 ∈ {25, 50, 100, 250, 500} hidden units (in the
Note that the energy barrier to cross when starting paper we also refer to the number of hidden units as
from any (local) minimum, e.g. the one from the band nhidden), each one starting from a random set of pa-
(−ΛEi (H), −ΛEi+1 (H)), in order to reach the global rameters sampled uniformly within the unit cube. All
minimum diverges with Λ since it is bounded below networks were trained for 200 epochs using SGD with
by Λ(E0 (H) − Ei (H)). Furthermore, suppose we are learning rate decay.
at a local minima with a scaled energy of −E∞ − δ.
To verify the validity of our theoretical assumption of
In order to find a further low lying minimum we must
parameter redundancy, we also trained a neural net-
pass through a saddle point. Therefore we must go up
work on a subset of MNIST using simulated annealing
at least to the level where there is an equal amount
(SA) where 95% of parameters were assumed to be re-
of saddle points to have a decent chance of finding a
dundant. Specifically, we allowed the weights to take
path that might possibly take us to another local mini-
one of 3 values uniformly spaced in the interval [−1, 1].
mum. This process takes an exponentially long time so
We obtained less than 2.5% drop in accuracy, which
in practice finding the global minimum is not feasible.
demonstrates the heavy over-parametrization of neural
Note that the variance of the loss in Equation 2 is networks as discussed in Section 3.
Λ which suggests that the extensive quantities should
scale with Λ. In fact this is the reason behind the Index of critical points It is necessary to ver-
scaling factor in front of the summation in the loss. ify that our solutions obtained through SGD are
The relation to the logarithmic asymptotics is as fol- low-index critical points rather than high-index
lows: the number of critical values of the loss below saddle points of poor quality. As observed by
the level Λu is roughly eΛΘH (u) . The gradient descent [Dauphin et al., 2014] certain optimization schemes
gets trapped roughly at the barrier denoted by −ΛE∞ , have the potential to get trapped in the latters. We ran
as will be shown in the experimental section. two tests to ensure that this was not the case in our
experimental setup. First, for n1 = {10, 25, 50, 100}
we computed the eigenvalues of the Hessian of the loss
5 Experiments function at each solution and computed the index. All
eigenvalues less than 0.001 in magnitude were set to 0.
The theoretical part of the paper considers the prob- Figure 4 captures an exemplary distribution of normal-
lem of training the neural network, whereas the em- ized indices, which is the proportion of negative eigen-
pirical results focus on its generalization properties. values, for n1 = 25 (the results for n1 = {10, 50, 100}
can be found in the Supplementary material). It can
5.1 Experimental Setup be seen that all solutions are either minima or sad-
dle points of very low normalized index (of the order
Spin-Glass To illustrate the theorems in Section 4, 0.01). Next, we compared simulated annealing to SGD
we conducted spin-glass simulations for different di- on a subset of MNIST. Simulated annealing does not
mensions Λ from 25 to 500. For each value of Λ, we compute gradients and thus does not tend to become
obtained an estimate of the distribution of minima by trapped in high-index saddle points. We found that
sampling 1000 initial points on the unit sphere and SGD performed at least as well as simulated annealing,
performing stochastic gradient descent (SGD) to find which indicates that becoming trapped in poor saddle
a minimum energy point. Note that throughout this points is not a problem in our experiments. The result
section we will refer to the energy of the Hamiltonian of this comparison is in the Supplementary material.
The Loss Surfaces of Multilayer Networks

125

60

100
Lambda nhidden
75 25 25
40
count

count
50 50
100 100
50 250 250
500 20 500

25

0
0
−1.6 −1.5 −1.4 −1.3 0.08 0.09 0.10
loss loss
Figure 3: Distributions of the scaled test losses for the spin-glass (left) and the neural network (right) experiments.
nhidden=25
All figures in this paper should be read in color. terial. This indicates that getting stuck in poor lo-
cal minima is a major problem for smaller networks
800

but becomes gradually of less importance as the net-


work size increases. This is because critical points
600
Frequency

of large networks exhibit the layered structure where


high-quality low-index critical points lie close to the
400

global minimum.
200

5.3 Relationship between train and test loss


0

0.000 0.005 0.010 0.015


normalized index
Figure 4: Distribution of normalized index of solutions n1 25 50 100 250 500
for 25 hidden units. ρ 0.7616 0.6861 0.5983 0.5302 0.4081

Scaling loss values To observe qualitative differ- Table 1: Pearson correlation between training and test
ences in behavior for different values of Λ or n1 , it is loss for different numbers of hidden units.
necessary to rescale the loss values to make their ex-
pected values approximately equal. For spin-glasses, The theory and experiments thus far indicate that
the expected value of the loss at critical points scales minima lie in a band which gets smaller as the network
linearly with Λ, therefore we divided the losses by Λ size increases. This indicates that computable solu-
(note that this normalization is in the statement of tions become increasingly equivalent with respect to
Theorem 4.1) which gave us the histogram of points training error, but how does this relate to error on the
at the correct scale. For MNIST experiments, we em- test set? To determine this, we computed the correla-
pirically found that the loss with respect to number tion ρ between training and test loss for all solutions
of hidden units approximately follows an exponential for each network size. The results are captured in Ta-
β
power law: E[L] ∝ eαn1 . We fitted the coefficients α, β ble 1 and Figure 7 (the latter is in the Supplementary
β
and scaled the loss values to L/eαn1 . material). The training and test error become increas-
ingly decorrelated as the network size increases. This
provides further indication that attempting to find the
5.2 Results
absolute possible minimum is of limited use with re-
Figure 3 shows the distributions of the scaled test gards to generalization performance.
losses for both sets of experiments. For the spin-glasses
(left plot), we see that for small values of Λ, we ob-
tain poor local minima on many experiments, while 6 Conclusion
for larger values of Λ the distribution becomes increas-
ingly concentrated around the energy barrier where lo- This paper establishes a connection between the neural
cal minima have high quality. We observe that the left network and the spin-glass model. We show that under
tails for all Λ touches the barrier that is hard to pene- certain assumptions, the loss function of the fully de-
trate and as Λ increases the values concentrate around coupled large-size neural network of depth H has simi-
−E∞ . In fact this concentration result has long been lar landscape to the Hamiltonian of the H-spin spheri-
predicted but not proved until [Auffinger et al., 2010]. cal spin-glass model. We empirically demonstrate that
We see that qualitatively the distribution of losses for both models studied here are highly similar in real set-
the neural network experiments (right plot) exhibits tings, despite the presence of variable dependencies in
similar behavior. Even after scaling, the variance de- real networks. To the best of our knowledge our work
creases with higher network sizes. This is also clearly is one of the first efforts in the literature to shed light
captured in Figure 8 and 9 in the Supplementary ma- on the theory of neural network optimization.
The Loss Surfaces of Multilayer Networks

Acknowledgements [Goodfellow et al., 2013] Goodfellow, I. J., Warde-


Farley, D., Mirza, M., Courville, A., and Bengio,
The authors thank L. Sagun and the referees for valu- Y. (2013). Maxout networks. In ICML.
able feedback.
[Hastie et al., 2001] Hastie, T., Tibshirani, R., and
Friedman, J. (2001). The Elements of Statistical
References Learning. Springer Series in Statistics.
[Amit et al., 1985] Amit, D. J., Gutfreund, H., and [Hinton et al., 2012] Hinton, G., Deng, L., Yu, D.,
Sompolinsky, H. (1985). Spin-glass models of neural Dahl, G., rahman Mohamed, A., Jaitly, N., Senior,
networks. Phys. Rev. A, 32:1007–1018. A., Vanhoucke, V., Nguyen, P., Sainath, T., and
Kingsbury, B. (2012). Deep neural networks for
[Auffinger and Ben Arous, 2013] Auffinger, A. and
acoustic modeling in speech recognition. Signal Pro-
Ben Arous, G. (2013). Complexity of random
cessing Magazine.
smooth functions on the high-dimensional sphere.
arXiv:1110.5872. [Krizhevsky et al., 2012] Krizhevsky, A., Sutskever,
I., and Hinton, G. (2012). Imagenet classification
[Auffinger et al., 2010] Auffinger, A., Ben Arous, G.,
with deep convolutional neural networks. In NIPS.
and Cerny, J. (2010). Random matrices and com-
plexity of spin glasses. arXiv:1003.1129. [LeCun et al., 1998a] LeCun, Y., Bottou, L., Bengio,
[Baldi and Hornik, 1989] Baldi, P. and Hornik, K. Y., and Haffner, P. (1998a). Gradient-based learning
(1989). Neural networks and principal component applied to document recognition. Proceedings of the
analysis: Learning from examples without local IEEE, 86:2278–2324.
minima. Neural Networks, 2:53–58. [LeCun et al., 1998b] LeCun, Y., Bottou, L., Orr, G.,
[Bottou, 1998] Bottou, L. (1998). Online algorithms and Muller, K. (1998b). Efficient backprop. In Neu-
and stochastic approximations. In Online Learning ral Networks: Tricks of the trade. Springer.
and Neural Networks. Cambridge University Press. [Nair and Hinton, 2010] Nair, V. and Hinton, G.
[Bray and Dean, 2007] Bray, A. J. and Dean, D. S. (2010). Rectified linear units improve restricted
(2007). The statistics of critical points of gaussian boltzmann machines. In ICML.
fields on large-dimensional spaces. Physics Review [Nakanishi and Takayama, 1997] Nakanishi, K. and
Letter. Takayama, H. (1997). Mean-field theory for a spin-
[Dauphin et al., 2014] Dauphin, Y., Pascanu, R., glass model of neural networks: Tap free energy and
Gülçehre, Ç., Cho, K., Ganguli, S., and Bengio, Y. the paramagnetic to spin-glass transition. Journal
(2014). Identifying and attacking the saddle point of Physics A: Mathematical and General, 30:8085.
problem in high-dimensional non-convex optimiza-
[Saad, 2009] Saad, D. (2009). On-line learning in neu-
tion. In NIPS.
ral networks, volume 17. Cambridge University
[De la Peña and Giné, 1999] De la Peña, V. H. and Press.
Giné, E. (1999). Decoupling : from dependence
[Saad and Solla, 1995] Saad, D. and Solla, S. A.
to independence : randomly stopped processes, U-
(1995). Exact solution for on-line learning in mul-
statistics and processes, martingales and beyond.
tilayer neural networks. Physical Review Letters,
Probability and its applications. Springer.
74(21):4337.
[Denil et al., 2013] Denil, M., Shakibi, B., Dinh, L.,
Ranzato, M., and Freitas, N. D. (2013). Predicting [Saxe et al., 2014] Saxe, A. M., McClelland, J. L., and
parameters in deep learning. In NIPS. Ganguli, S. (2014). Exact solutions to the nonlinear
dynamics of learning in deep linear neural networks.
[Denton et al., 2014] Denton, E., Zaremba, W., In ICLR.
Bruna, J., LeCun, Y., and Fergus, R. (2014).
Exploiting linear structure within convolutional [Weston et al., 2014] Weston, J., Chopra, S., and
networks for efficient evaluation. In NIPS. Adams, K. (2014). #tagspace: Semantic embed-
dings from hashtags. In EMNLP.
[Dotsenko, 1995] Dotsenko, V. (1995). An Introduc-
tion to the Theory of Spin Glasses and Neural Net- [Wigner, 1958] Wigner, E. P. (1958). On the Distri-
works. World Scientific Lecture Notes in Physics. bution of the Roots of Certain Symmetric Matrices.
The Annals of Mathematics, 67:325–327.
[Fyodorov and Williams, 2007] Fyodorov, Y. V. and
Williams, I. (2007). Replica symmetry breaking
condition exposed by random matrix calculation of
landscape complexity. Journal of Statistical Physics,
129(5-6),1081-1116.
The Loss Surfaces of Multilayer Networks

|X+ |+|X− | |X+ |


Z2 = +1, where X = . Also let + = X
|X− |
and − = X . Therefore
The Loss Surfaces of Multilayer 
1 if x ∈ X +
Z1 =
Networks −1 if x ∈ X −
(Supplementary Material) and
1 if x ∈ X + ∪ X+ \ X−

Z2 =
−1 if x ∈ X − ∪ X− \ X+ .
One can compute that E[Z1 ] = 2p − 1, E[Z2 ] =
2(p + + − p − ) − 1, E[Z1 Z2 ] = 1 − 2,
7 Proof of Theorem 3.1
p s ) = 2 p(1 − p), and finally std(ZΛ ) =
std(Z
Proof. First we will prove the lower-bound on N . By 2 (p + + − − )(1 − p − + + − ). Thus we obtain
the inequality between arithmetic and geometric mean corr(sign(Y1 ), sign(Y2 )) = corr(Z1 , Z2 )
the mass and the size of the network are connected as E[Z1 Z2 ] − E[Z1 ]E[Z2 ]
follows =
std(Z1 )std(Z2 )
√ H √ H
N≥
H
Ψ2 H

H
= Ψ2 H √ , 1 − 2 − (1 − 2p)2 + 2(1 − 2p)(+ − − )
n0 nH d = p
4 p(1 − p)(p + + − − )(1 − p − + + − )
H
√ H
qQ
H 1 − 2 − (1 − 2p)2 − 2(1 − 2p)
and since Ψ √
H = H i=1 ni H ≥ 1 then ≥ p (10)
d 4 p(1 − p)(p + )(1 − p + )

H H √
H
N≥ √ ≥ Ψ.
Ψ2 H
d
Note that when the first classifier is network N consid-
Next we show the upper-bound on N . Let nmax = ered in this paper and M is its (s, )-reduction image
maxi∈{1,2,...,H} ni . Then E[Y1 ] = 0 and E[Y2 ] = 0 (that follows from the fact
that X’s in Equation 5 have zero-mean). That implies
N ≤ Hn2max ≤ HΨ2 . p = 0.5 which, when substituted to Equation 10 gives
the theorem statement.

9 Proof of Theorem 3.3


8 Proof of Theorem 3.2
Proof. Note that E[Ŷs ] = 0 and E[Ys ] = 0. Further-
Proof. We will first proof the following more general more
lemma. s  H
Y
X Ψ
E[Ŷs Ys ] = q 2 ρ2 min H , ti1 ,i2 ,...,iH wi2k
Lemma 8.1. Let Y1 and Y2 be the outputs of two ar- s
i ,i ,...,i =1
1 2 H k=1
bitrary binary classifiers. Assume that the first clas-
sifiers predicts 1 with probability p where, without loss and v
of generality, we assume p ≤ 0.5 and −1 otherwise. u
u s
X H
Ψ Y 2
Furthemore, let the prediction accuracy of the second std(Ŷs ) = qρt wik
classifier differ from the prediction accuracy of the first i1 ,i2 ,...,iH =1
sH
k=1
classifier by no more than  ∈ [0, p]. Then the follow- v
u s H
ing holds u X Y
corr(sign(Y1 ), sign(Y2 )) std(Ys ) = qρt ti1 ,i2 ,...,iH wi2k .
i1 ,i2 ,...,iH =1 k=1
1 − 2 − (1 − 2p)2 − 2(1 − 2p) Therefore
≥ p .
4 p(1 − p)(p + )(1 − p + )
corr(Ŷs , Ys )
s  Y H
Proof. Consider two random variables Z1 = sign(Y1 ) X Ψ
min
, t
H i1 ,...,iH
wi2k
and Z2 = sign(Y2 ). Let X + denote the set of data s
i1 ,...,iH =1 k=1
points for which the first classifier predicts +1 and let =v
X − denote the set of data points for which the first u  
s H s H
Ψ
u
classifier predicts −1 (X + ∪ X − = X , where X is the
X Y X Y
wi2k  ti1 ,...,iH wi2k 
u
H
t
+
s
entire dataset). Also let p = |X |
|X | . Furthermore, let
i ,...,i =1
1 H i ,...,i =1
k=1 1 H k=1

X denote the dataset for which Z1 = +1 and Z2 = 1
−1 and X+ denote the dataset for which Z1 = −1 and ≥ ,
c2
The Loss Surfaces of Multilayer Networks

where the last inequality is the direct consequence of 11 Asymptotics of the mean number
the uniformity assumption of Equation 6. of critical points and local minima

10 Loss function as a H - spin Below, we provide the asymptotics of the mean


spherical spin-glass model number of critical points (Theorem 11.1) and the
mean number of local minima (Theorem 11.2),
We consider two loss functions, (random) absolute loss which extend Theorem 4.1. Those results are
LaΛ,H (w) and (random) hinge loss LhΛ,H (w) defined in the consequences of Theorem 2.17. and Corollary
the main body of the paper. Recall that in case of the 2.18. [Auffinger et al., 2010].
hinge loss max operator can be modeled as Bernoulli Theorem 11.1. For H ≥ 3, the following holds as
random variable, that we will refer to as M , with suc- Λ → ∞:
0 0
cess (M = 1) probability ρ = √C H for some non-
ρ C
0 • For u < −E∞
negative constant C . We assume M is independent of
Ŷ . Therefore we obtain that
0
h(v) exp(I1 (v) − v2 I1 (v)) − 1

S − Ŷ if Yt = S E[CΛ (u)] = √
LaΛ,H (w) = 0 0 Λ 2
S + Ŷ if Yt = −S 2Hπ −Φ (v) + I1 (v)
· exp (ΛΘH (u)) (1 + o(1)),
and
q
LhΛ,H (w) = EM,A [M (1 − Yt Ŷ )] H
where v = −u 2(H−1) , Φ(v) = − H−2 2
2H v ,
 √ 14 √ 14
EM [M (1 − Ŷ )] if Yt = 1
h(v) = v−
= √2 v+√2
EM [M (1 + Ŷ )] if Yt = −1 v+ 2
+ v− 2

Rv p
2
and I1 (v) = 2 |x − 2|dx.

Note that both cases can be generalized as

EM [M (S − Ŷ )] if Yt > 0 • For u = −E∞
LΛ,H (w) = ,
EM [M (S + Ŷ )] if Yt < 0

0 2A(0) 2H − 1
where in case of the absolute loss ρ = 1 and in case of E[CΛ (u)] = Λ 3
the hinge loss S = 1. Furthermore, using the fact that 3(H − 2)
X’s are Gaussian random variables one we can further · exp (ΛΘH (u)) (1 + o(1)),
generalize both cases as
where A is the Airy function of first kind.
Λ H
0 X 0 Y
LΛ,H (w) = Sρ +q Xi1 ,i2 ,...,iH ρρ wik . • For u ∈ (−E∞ , 0)
i1 ,i2 ,...,iH =1 k=1
q 0
Let w̃i = H ρρ
p
C0
wi for all i = {1, 2, . . . , k}. Note that 2 2H(E∞ 2 − u2 )
1
E[CΛ (u)] =
w̃i = C wi . Thus
√ (2 − H)πu
· exp (ΛΘH (u)) (1 + o(1)),
Λ H
0 0 X Y
LΛ,H (w) = Sρ + qC Xi1 ,...,iH w̃ik . (11)
i1 ,...,iH =1 k=1 • For u > 0

Note that the spherical assumption in Equation 9 di- √


rectly implies that 4 2 1
E[CΛ (u)] = p Λ2
Λ π(H − 2)
1X 2
w̃ = 1 · exp (ΛΘH (0)) (1 + o(1)),
Λ i=1 i

To simplify the notation in Equation 11 we drop the Theorem 11.2. For H ≥ 3 and u < −E∞ , the fol-
letter accents and simply denote w̃ as w. We skip con- lowing holds as Λ → ∞:
0 0
stant Sρ and C as it does not matter when minimizing 0
1
the loss function. After substituting q = Ψ(H−1)/2H we h(v) exp(I1 (v) − v2 I1 (v)) − 1
obtain E[CΛ,0 (u)] = √ 0 0 Λ 2
2Hπ −Φ (v) + I1 (v)
Λ · exp (ΛΘH (u)) (1 + o(1)),
1 X
LΛ,H (w) = Xi1 ,i2 ,...,iHwi1wi2. . .wiH .
Λ(H−1)/2
i ,i 1 2 ,...,iH =1 where v, Φ, h and I1 were defined in Theorem 11.1.
The Loss Surfaces of Multilayer Networks

12 Additional Experiments 12.2 Comparison of SGD and SA.

Figure 6 compares SGD with SA.


12.1 Distribution of normalized indices of
critical points.
150
optimization
Figure 5 shows the distribution of normalized indices, SA, nh=10
which is the proportion of negative eigenvalues, for SA, nh=15
SA, nh=2
neural networks with n1 = {10, 25, 50, 100}. We see SA, nh=25
100
that all solutions are minima or saddle points of very SA, nh=3

count
low index. SA, nh=5
nhidden=10 SGD, nh=10
a) n1 = 10 SGD, nh=15
50
800

SGD, nh=2
SGD, nh=25
600

SGD, nh=3
Frequency

0 SGD, nh=5
400

0.25 0.50 0.75 1.00


loss
Figure 6: Test loss distributions for SGD and SA for
200

different numbers of hidden units (nh).


0

0.0000 0.0005 0.0010 0.0015 0.0020 0.0025


normalized index
12.3 Distributions of the scaled test losses
nhidden=25
b) n1 = 25
125
800

100
600

Lambda
Frequency

25
50
75
400

count

100
200
200

50 300
400
500
25
0

0.000 0.005 0.010 0.015


normalized index 0
nhidden=50 −1.6 −1.5 −1.4 −1.3
c) n1 = 50 loss
800

60
600
Frequency

nhidden
400

40
25
count

50
100
200

250
20 500
0

0.000 0.001 0.002 0.003 0.004 0.005


normalized index 0

d) n1 = 100 nhidden=100 0.08 0.09 0.10


loss
250

Figure 8: Distributions of the scaled test losses for the


spin-glass (with Λ = {25, 50, 100, 200, 300, 400, 500})
200

(top) and the neural network (with n1 =


Frequency
150

{25, 50, 100, 250, 500}) (bottom) experiments.


100

Figure 8 shows the distributions of the scaled test


50

losses for the spin-glass experiment (with Λ =


{25, 50, 100, 200, 300, 400, 500}) and the neural net-
0

0e+00 1e−04 2e−04 3e−04 4e−04


work experiment (with n1 = {25, 50, 100, 250, 500}).
normalized index
Figure 9 captures the boxplot generated based on the
distributions of the scaled test losses for the neural net-
Figure 5: Distribution of normalized index of solutions work experiment (for n1 = {10, 25, 50, 100, 250, 500})
for n1 = {10, 25, 50, 100} hidden units. and its zoomed version (for n1 = {10, 25, 50, 100}).
The Loss Surfaces of Multilayer Networks

a) n1 = 2 b) n1 = 5 c) n1 = 10 d) n1 = 25

0.145
● ● ● ●
● ● ●

0.50
● ●

0.25
● ●

1.6


● ●
● ●
● ● ● ●

0.140

● ● ● ● ●
● ● ● ● ● ●●
● ●● ● ● ●
● ●
● ●
● ● ● ●
● ● ●● ● ●●
1.5

● ● ● ● ● ● ● ● ● ●
● ●

0.24
● ● ● ● ● ● ●
● ● ● ●●● ● ● ●
● ● ● ●
● ●
●● ● ● ● ● ●
● ● ● ● ● ● ●●
● ● ● ● ●● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●
● ● ●● ● ● ●
● ● ●●
● ● ●
● ● ● ●●

0.135
● ● ● ● ● ● ●
●● ● ● ● ● ● ● ●● ● ● ● ●
● ● ●
●● ● ● ● ●●● ● ● ● ● ●●● ●● ●●● ● ●

● ●● ● ●● ● ● ●●

● ●● ● ●● ●
● ● ● ●● ●● ● ●
● ● ●● ●● ●● ●● ● ●
1.4

● ●●●●●● ●●●● ● ●● ●

0.45
● ●● ● ● ●●
●● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ●●
● ●● ● ● ●●●●● ●
● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ●●● ● ●
● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ●
● ● ● ● ● ● ●
●● ●● ● ● ●
● ● ●●
●● ●● ● ● ● ●● ● ● ●● ● ●●●● ● ● ● ●
●● ● ●

●●●●● ● ● ●● ● ● ●● ● ● ● ●
● ● ●●● ● ● ● ●
● ●

0.23
●●●
test loss

test loss

test loss
●●

test loss
● ● ●● ●● ●● ●●●
● ●●● ●●

●●●● ● ● ●●●● ● ● ● ● ●●

● ● ●● ●
● ●●●●● ● ●●
●●● ●● ●● ● ● ●● ●● ● ●●●●●● ● ●
● ●
● ●● ● ●●
● ●
●●● ●● ● ● ●
● ●● ● ● ●● ● ●
● ● ● ● ● ●● ●● ●●● ● ●●
● ● ●● ● ● ● ● ● ● ●
●●● ●
● ● ●●
● ●● ● ● ●●●●●●●●●● ●● ●●● ●●●● ●●● ● ● ●● ●● ● ● ● ●●
●● ● ●● ● ●
●● ●
●● ●● ●●● ●
● ●● ● ● ●●● ●

●● ●● ● ● ●●

● ● ● ●● ●●● ● ● ● ● ● ●● ●●● ● ●●● ● ● ●●●●●
●●● ● ●
● ● ●● ●● ● ● ● ●●●● ● ● ● ●● ● ●
● ● ●● ●●●●●
●● ● ●●●●●● ● ●
● ●● ● ●●● ● ●● ●●●●● ●● ● ●●
● ● ● ●●●
●● ● ● ●● ● ● ●●● ●● ●
● ● ●● ● ● ●●● ● ● ● ●● ● ● ● ●

0.130
● ●
●●
1.3

●● ● ● ●

● ● ●● ● ● ● ● ● ●● ●
● ● ●●● ●● ● ●●●●●● ●● ● ● ● ● ●●● ● ●● ●● ●●●● ●
● ●● ● ● ●●●●● ●
●● ●


●●●● ●●●● ●
●●●●● ●● ● ● ●
● ●
●● ● ● ●

● ● ● ●
● ●
● ●● ●

● ●
●● ● ●● ● ● ●


● ●● ● ●● ● ● ● ●●●● ● ●●●●
● ●●● ●●
● ● ● ●● ●
● ●● ●●● ●●● ● ●● ●● ●
● ● ●● ●● ●● ● ●● ● ●●
●●● ● ●●● ● ● ● ●● ●● ● ●● ● ●● ● ●●● ●
● ●●●
● ● ●
●●●●●● ● ● ●● ● ● ●● ●● ● ●● ● ●
● ●●●● ● ●●● ●●
● ● ● ●● ●●● ●● ●●●
● ● ● ●
●●● ● ●● ●●● ● ● ●●

● ● ●
● ●●●●●● ●●●
● ● ● ●
● ●● ● ●
●●
●● ●●●●
● ●
●●● ●●● ● ●
● ● ●● ● ●● ●●
● ● ●
●● ●●● ●● ●●● ●● ● ● ●● ● ●●
● ●●●● ●● ●● ●

●●●● ● ● ● ●●● ●●
● ●●●● ● ● ●●●
● ●●
●●●
● ● ●

● ●●

●● ●● ● ●
●●●
●●
● ●● ● ● ●● ●
● ●●●
● ●●● ● ● ● ● ●
● ● ● ●● ●●●
●●●● ●
● ●● ● ●● ● ●● ●● ● ●● ●● ●●
● ●●● ●● ●●●●
● ●●●● ● ●● ● ●●● ●
● ● ● ●
● ● ● ● ● ● ●●
●● ● ●● ●● ● ● ● ● ● ● ● ●●

0.22
● ● ● ●● ●
● ●● ●●●● ●●● ● ●● ● ●

● ●●
●●●● ●● ●
● ● ●● ● ● ●● ●● ●● ●●●●● ● ● ●
●●
●●● ● ●●●●●● ●●
● ●●●●● ●●
● ● ● ● ●●●● ● ●● ● ●● ● ●● ● ● ● ● ●
● ●● ● ●● ● ● ●● ●
●● ●●● ●●● ● ●

● ● ● ●●
● ●●●●●● ●● ● ● ● ● ●●●●● ● ●●
● ●
● ●● ●●●● ● ●
●● ● ● ●● ● ●
●● ●● ●● ●● ● ● ● ●
●●
1.2

● ●● ● ● ● ● ●
● ●●
●●● ●● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●●●

● ● ●● ●●
● ●● ● ●
●● ● ● ●●●●●●
● ● ● ● ● ●
● ● ● ● ● ● ●
●●●● ●●● ●● ●●● ● ● ● ●●
● ● ●● ●●● ●●● ● ●●● ●●● ● ●
●● ● ● ● ●●●
● ●
●● ●● ●
● ● ● ●● ●●
● ● ● ●

0.125
● ●●●●● ● ● ● ● ●● ● ●● ●●● ●● ● ●

●●●●
● ● ● ●●● ●
●● ●● ●
● ● ● ●●
● ● ● ●● ●● ●●● ●

● ●●●● ● ● ● ● ●● ● ● ● ● ●●● ● ● ●●

0.40
● ● ● ●● ● ●● ●● ● ● ●●● ● ● ● ● ●●● ● ● ● ●
●● ●●● ●● ● ●●●●●
● ●
●● ●
●● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ●
● ● ● ●●
●● ●

● ●●● ● ● ● ● ● ● ●●● ●● ●● ● ● ●● ● ● ●
●●● ● ●

● ● ● ● ● ●● ● ●●● ● ●● ● ● ● ●
● ● ●● ● ● ●
● ●●●● ● ●● ● ● ● ● ● ● ●
●● ● ● ● ● ● ●● ● ●● ●●● ●
● ● ● ●
●●
● ●● ● ● ● ●
●● ● ● ●
● ●● ●● ●
●●●● ●
● ● ● ● ●
1.1

●●
● ● ● ● ● ●●●●●●● ●
● ● ●●● ● ●●● ●● ● ● ● ● ●●● ●● ● ●●● ●

0.21
● ● ●● ● ● ● ●● ● ●● ● ●● ● ●● ● ● ●● ●
●●●●●●●
● ●● ● ●● ● ● ● ● ● ● ● ● ●● ●
●●●
●●
●●
● ●●●
●●●●●
●● ●● ● ● ●●● ● ● ● ● ● ●● ● ●
●● ●●●● ● ● ● ● ● ●
● ●●●●● ●● ●● ●●●● ● ●● ● ●
● ● ● ●
●●● ● ●● ●● ●● ● ●

●●

●●●
●● ● ●● ● ● ●● ●●●● ●●● ● ● ● ● ● ● ●
●●
● ●





●● ●● ●
●●●●●


●●●
●●
●●
●● ●● ●● ● ●● ● ● ●

0.120

● ● ● ●●
●●
● ●●
●● ●●●
● ●
●●●● ● ●● ● ●
●●●●

●●


● ● ● ● ● ● ●

●●●●● ●●
●●
●●●●
●●● ●
● ● ●

●●

●●●
● ●●●●
●●●
●● ●

●●●
●●
● ●
●● ●●

●●
●●●























● ●●
● ●
●●●

● ●
●●
















●●

















●●
●●●
●●
●●
●●
●●● ● ●
● ● ● ● ● ●● ●



●●

● ●

●●
●●
●● ●
●● ●
● ●●
●●



























●●



























●●

●●●●●● ●

● ● ● ● ●


● ● ●
●●

●●
●●
●●● ●

● ● ●●● ●


●●●

●●●●


●●
●●●
●●
●●
●●●
●● ●
● ● ● ●
●●●
●●
●●

●●

●●● ●●● ●
●●
●●
●●

●●●

●●
●●


●●
●●●
● ●

●●

●●●
● ●
●●● ●● ● ●
1.0





●●
●●
●●




●●

●●
●● ● ●● ●●
●●

●●

●●
●●

●●
●●


●●
●●

●●

●●
●●


●●

●●
●●



●●
●●●●
●●

●●
●● ●●●
●●
●●

●●


●●


●●
●●

● ●

●●
●●
●●

● ● ●
●●
● ●●
●● ● ●●●
●●●●●●

●●●
● ●



●●
●●●






●●





●●
●● ●●● ●
●●


●●

●●
●●
●●

●●
●●

●●
●●
●●
● ●● ● ●
●●
●●
● ●














● ●●●
●● ●


















●●










●●

●●




●● ●●
● ● ● ● ● ●

● ●

●●


●●●

● ●●
●●●
●●●
●●
●●●●●
● ● ●
●●






●●



●● ●●●●●
● ●●



●●

●●
● ●

0.20

●●
●●



●●●
●●

● ●
● ●● ●
● ●





●●
● ● ●

● ● ● ● ●

1.0 1.1 1.2 1.3 1.4 1.5 1.6 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.21 0.22 0.23 0.24 0.25 0.115 0.120 0.125 0.130 0.135
train loss train loss train loss train loss

e) n1 = 50 f) n1 = 100 g) n1 = 250 h) n1 = 500

0.080
0.084
● ● ● ● ●
0.094
0.110

● ●
● ●

● ● ●

0.079
● ● ● ●


● ● ● ●
● ● ● ● ● ● ●
● ●
● ● ● ● ● ●●
● ● ●● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ●
● ● ●● ●

● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●
● ●● ● ● ● ● ● ● ● ● ● ●● ●
● ● ● ●

0.082
●● ● ● ● ● ●● ● ● ●
● ● ●● ●●● ● ●● ● ● ● ● ● ●● ●
● ● ●
● ● ●● ● ● ● ● ●
● ●● ● ● ●● ● ● ● ●
● ● ● ● ●

● ● ● ● ●● ● ●●● ● ● ● ●
● ● ● ● ● ●● ●●

● ● ●
● ● ● ● ● ●●
● ●● ●● ● ●

●● ●
● ● ●●● ● ●

● ● ● ● ● ●● ● ●● ● ●● ●● ●● ● ●
● ● ● ●
● ●●● ● ● ● ● ●
● ● ●
● ● ●●

0.078
● ● ● ● ● ●● ● ● ● ● ● ● ●●
●● ● ● ● ●
●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●
●● ● ●● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ●
0.105

● ● ● ● ●
● ● ● ● ● ● ●● ● ● ● ● ● ●
● ● ● ● ●● ● ● ●
● ● ● ● ●


● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●
● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

● ● ●
●● ● ● ● ●●● ● ● ●● ● ● ●
●● ● ● ●●●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ●● ● ● ● ● ● ●● ●●● ●● ● ● ●
● ●
● ●● ●● ● ●
0.090

● ●● ● ● ● ●●
● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ●
● ● ●
● ● ● ●
● ● ●● ● ●
●●● ●●
● ●

● ● ●● ●
●●

●● ●● ● ● ● ●● ●

●● ● ● ●● ●● ● ●● ● ●
● ●●
●● ● ●● ● ●●

● ●
● ●
● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ●● ●● ●● ● ● ● ●●●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ●●● ● ●● ● ●

●●●● ●●● ●
● ● ● ● ●● ● ●
●● ● ● ●●
●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ●● ● ● ● ●
● ● ●●
●●● ● ● ● ● ● ● ● ●●● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ●●● ●● ●● ● ● ● ●●
● ● ● ● ● ●
● ● ● ●● ● ●
● ● ●● ● ● ●●
● ●●
●●●● ●● ● ● ●● ●
● ● ● ● ●●●● ● ● ●●
● ● ● ●
● ●● ●● ●●●
● ●● ● ● ● ● ● ● ●
● ●● ● ● ●●● ● ● ●●● ● ●● ● ● ●●●●● ●
● ● ● ● ●
●●
● ●● ● ●● ● ●●
●● ●●
● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●● ●
●● ●●● ● ● ●●● ● ●● ● ● ● ●
● ● ●● ● ● ● ● ●● ●●●● ●●● ● ● ● ● ●●●●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●●
● ● ●● ● ● ● ● ● ●
●● ●● ● ● ● ● ● ● ● ●
● ● ● ●
● ●●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ●●
test loss

test loss

test loss

test loss
● ● ●● ● ● ●
● ●● ● ● ● ●●● ●● ● ●● ● ● ● ● ● ● ●● ●●●● ●●● ●● ● ● ● ● ●● ● ●● ●● ●●●●● ●● ●
● ●
● ● ● ● ●●● ● ● ● ●● ● ● ● ● ●●●●● ● ● ● ●● ● ●
●● ●● ●
●● ● ● ● ●●●●● ●● ● ● ● ● ●● ● ●●●●● ●● ●●●●● ●●● ● ●●
●● ● ●●● ● ●● ● ● ●● ● ● ● ●●●● ● ●●● ●● ● ● ● ● ●● ● ● ●●● ● ● ●● ●●● ●●● ● ● ●● ●● ●●
●●
● ●● ● ● ● ● ●
● ●
●●● ●● ●●● ●
● ●●
● ● ●●
● ● ● ●●
●●
● ●●●●
●● ●
● ● ● ●● ●●● ●● ● ●● ●● ●● ●●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●● ●●● ●
● ● ●
●● ● ●

●●● ●
● ● ●● ●●● ●
● ●● ●

● ●
● ●● ● ●●●● ●● ●
● ●● ●● ● ●● ●●● ●● ●●●● ● ● ● ●● ●
● ● ● ● ● ●
●●
● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ●● ● ● ●● ● ●
●● ●● ● ●●● ●


● ●
● ● ●●●● ● ● ● ●
● ● ●● ●● ● ● ● ● ● ●●● ●
●●●●
● ● ● ● ● ● ●● ● ●●●

●● ● ● ●●● ●● ●● ● ●●●● ●
● ● ● ●
● ● ●● ● ● ●
●● ● ● ● ●● ●●● ● ● ● ● ● ●
● ● ● ● ● ●●●● ● ●
● ●
● ● ● ● ●●● ●● ● ● ● ●● ●● ● ● ● ●●● ● ●

0.077
● ● ● ● ●● ● ● ●● ●
●● ● ● ●●● ● ●● ●● ●●● ● ●● ● ●●
● ● ●● ● ●● ● ●● ● ● ●● ●● ●● ● ●●● ●● ●● ● ● ● ●●● ●● ●● ● ●● ● ● ● ●● ● ●
●●● ●● ● ●● ●●
● ●● ● ●● ● ●
● ● ●●● ●●● ●● ● ● ●●●●●●● ● ●
● ● ● ● ●● ●● ●
● ● ●●● ● ● ●●●● ● ● ●● ● ● ●
●● ● ● ● ●
● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ●● ● ●●●● ● ●
●●●
● ●●● ●

●● ● ●
● ●● ● ● ●● ● ● ●●● ● ● ● ●●● ●● ● ●●● ● ● ● ● ● ●●●
● ● ● ●● ●
●● ● ● ●● ●
● ● ●●●●●
● ● ● ●● ● ●●● ● ●● ● ● ● ● ●
●● ●●
●●●

●● ● ●●

●● ● ●●● ● ● ● ●● ● ● ● ● ●● ●
● ●● ●●● ●
●●● ●
●● ●
●●●● ● ● ●● ●● ● ● ● ● ●
●● ● ● ● ● ● ●● ●● ● ● ●●● ● ● ● ● ●

●● ● ●● ●●● ●● ●● ● ●
●●
● ●● ● ● ● ●●
● ● ● ● ● ●● ● ●
●●
● ●● ●

● ● ● ●● ●● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ●
● ●●● ●● ●● ● ● ●
●●
●● ●● ●
●●● ● ●● ●
● ● ● ● ● ●
●●
●● ●● ● ● ●● ● ●
● ●● ●● ●● ● ● ● ●● ● ●●● ● ●●●●● ● ●● ● ●● ● ● ● ● ● ● ●●
● ● ● ●●●● ●● ●
● ● ●
●● ● ●●●●

●● ● ● ● ● ●● ● ● ● ● ●●
● ●●
● ● ● ● ●
● ● ●●●●●●
●●●●
● ●
●● ●● ● ●● ●●● ● ● ●● ●● ● ● ●●●●●
● ●● ●● ● ●●● ●● ●●● ● ●● ● ● ●
● ●● ● ●●●
● ● ● ●● ● ● ● ●● ●● ●● ● ●●●●●● ●● ● ● ● ● ●
●●

●●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●●● ●● ●●●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●●● ● ● ● ●● ●●● ● ●● ●● ● ● ●● ● ● ●●
● ●● ●● ●

●●●● ● ●●

0.080
● ● ● ●● ●● ● ● ●● ●● ● ● ●● ● ●
● ● ● ● ● ● ●● ●●


●●● ●●
●●
●●● ●● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ●● ●● ● ●
●●● ● ●●● ●
● ●● ● ●● ● ●

● ● ●● ●
● ●● ●● ● ● ● ●
●● ●
●● ●● ● ● ● ● ●● ●●●● ● ●●● ●
●●● ●●●
● ●● ●● ●●
●● ●
● ● ● ●●● ●●

●● ● ●
● ●●●● ●●● ● ●
●● ●● ●● ● ●
● ●
●●● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●
● ●●● ● ●● ● ● ●
●● ●● ●● ● ● ●●● ● ● ●●
● ● ●●● ● ●● ● ●● ● ●
● ●● ● ●● ● ● ● ● ●● ●●●● ●

● ● ●●● ●●●● ● ●● ● ● ●

● ●●● ● ● ●● ●
● ● ●●●
● ●●●●●●● ●
● ●●● ●● ● ● ● ● ●●●●● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ●●●
● ● ●● ● ● ● ● ● ●
0.100

● ● ●
●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●●● ● ●●● ● ●
●● ● ●●
●● ●● ● ● ●●● ● ●● ● ●
●● ● ●● ●● ●●
●● ●●●●●●● ● ●●● ● ●● ●●● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● ● ●●
● ● ●●
● ●
● ●


● ●● ● ●●
● ● ●●● ●●●●
● ●

● ●
● ●● ●● ● ●
● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ●
●●● ●● ●●● ● ●●●● ●●● ● ●
● ● ●
● ● ●●● ●● ● ● ● ●●

●● ●●


● ● ● ● ●● ●●● ● ● ●●● ●●● ●●●●●●
●● ●● ●● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ●●
● ● ● ●
● ●●● ●● ● ● ● ●●● ● ● ●● ●● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ●
●●●● ●● ●● ● ●

●● ●
● ●
● ● ● ● ●
● ● ●● ●● ●● ●●● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ●●● ●● ●●● ●●
● ●●
●● ●● ● ●

● ● ● ●
● ●
● ●
● ●● ●● ● ●●
●●● ● ● ● ● ● ● ●● ● ●
● ● ● ●●
● ● ● ● ●●
●● ● ● ● ● ● ● ●
●● ● ● ● ●● ●●●●
●● ●●● ●●●● ●● ● ●● ●● ● ● ●● ●●● ● ● ● ●●●● ● ● ●●● ●● ●●● ● ● ●● ● ●● ●
● ● ●●● ● ●
● ● ● ● ●● ● ●● ●●● ● ●● ● ●● ● ● ● ●●
● ● ●● ●●
● ● ● ● ●●● ● ● ●●● ●
● ● ● ● ● ● ● ● ●
● ●● ●
● ●●●
●● ●
● ● ●● ● ● ●
●●● ●● ●
● ● ●●
● ●● ● ● ●● ● ●
● ●● ● ● ● ● ● ●
● ● ●●●● ● ● ● ●●● ● ●●● ● ● ●●
● ● ● ● ●● ● ●●● ●
● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●●

0.076
●● ●● ● ●● ● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ●●
0.086

●● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●● ●
● ●● ●● ● ● ●● ●● ● ● ●● ●● ●
● ● ● ● ●●
● ●

● ● ● ●
● ● ●●● ●● ● ● ● ●● ●● ● ● ● ● ●●
● ● ● ● ●
● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●

● ● ●●●●● ●● ●●●●● ● ●● ● ● ● ● ● ●
● ●
● ●

● ●● ● ● ● ● ● ● ●● ●●● ● ● ● ● ●● ● ●
● ●● ●● ●● ● ●● ● ● ● ● ●
● ●● ● ● ● ● ● ● ● ●●
● ●
● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●
● ●●


● ● ●● ●● ●● ● ● ● ● ● ●
● ●
● ● ● ● ●
● ● ● ●
●● ●
● ● ● ●
●●● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ●●● ● ● ● ●●●● ●
● ● ●● ●● ● ●● ● ● ● ● ● ● ● ●
●● ●●● ●
● ● ● ● ● ●●●● ● ● ● ● ●● ● ●
● ● ●

● ●

● ●
● ●●
● ●● ● ●● ●

● ● ● ● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ● ●

● ● ● ●● ● ● ● ● ● ● ●
● ● ● ●● ● ● ●
●●


● ● ●● ● ● ● ● ● ● ● ●
● ●●
● ● ● ● ● ● ● ●
● ●●
● ●
● ● ● ●● ● ● ● ● ●

● ● ●●
● ●
● ●●
●●


● ●
● ● ● ● ●
● ● ●
●● ● ● ●

● ● ● ● ● ● ● ● ●
● ● ●

● ●● ●
● ● ●
● ● ●

0.075
● ● ● ● ● ●
0.095

● ● ● ● ●

0.078
● ● ● ●
● ● ● ●
●● ● ●
● ●

● ● ●
● ● ●
● ● ●


● ●
0.082

● ● ● ●

0.085 0.090 0.095 0.100 0.074 0.076 0.078 0.080 0.065 0.066 0.067 0.068 0.069 0.0615 0.0620 0.0625 0.0630 0.0635
train loss train loss train loss train loss

Figure 7: Test loss versus train loss for networks with different number of hidden units n1 .
0.25

0.20
0.15 0.20

test loss mean


test loss

0.10 0.15
0.10

10 25 50 100 250 500 0 100 200 300 400 500


nhidden nhidden
0.25

6e−05
test loss variance
0.15 0.20
test loss

3e−05
0.10

0e+00

10 25 50 100 0 100 200 300 400 500


nhidden nhidden
Figure 9: Top: Boxplot generated based on the distri- Figure 10: Mean value and the variance of the test loss
butions of the scaled test losses for the neural network as a function of the number of hidden units.
experiment, Bottom: Zoomed version of the same
boxplot for n1 = {10, 25, 50, 100}. 12.4 Correlation between train and test loss

Figure 7 captures the correlation between training and


Figure 10 shows the mean value and the variance of test loss for networks with different number of hidden
the test loss as a function of the number of hidden units n1 .
units.

You might also like