Professional Documents
Culture Documents
Why Does Deep and Cheap Learning Work So Well?
Why Does Deep and Cheap Learning Work So Well?
how properties frequently encountered in physics such as symmetry, locality, compositionality, and
polynomial log-probability translate into exceptionally simple neural networks. We further argue
that when the statistical process generating the data is of a certain hierarchical form prevalent
in physics and machine-learning, a deep neural network can be more efficient than a shallow one.
We formalize these claims using information theory and discuss the relation to the renormalization
group. We prove various “no-flattening theorems” showing when efficient linear deep networks can-
not be accurately approximated by shallow ones without efficiency loss; for example, we show that
n variables cannot be multiplied using fewer than 2n neurons in a single hidden layer.
virtually all popular nonlinear activation functions σ(x) input layer, hidden layer and output layer have sizes 2, 4
can approximate any smooth function to any desired ac- and 1, respectively. Then f can approximate a multipli-
curacy — even using merely a single hidden layer. How- cation gate arbitrarily well.
ever, these theorems do not guarantee that this can be
accomplished with a network of feasible size, and the fol- To see this, let us first Taylor-expand the function σ
lowingn simple example explains why they cannot: There around the origin:
are 22 different Boolean functions of n variables, so a
network implementing a generic function in this class re- u2
σ(u) = σ0 + σ1 u + σ2 + O(u3 ). (10)
quires at least 2n bits to describe, i.e., more bits than 2
there are atoms in our universe if n > 260.
Without loss of generality, we can assume that σ2 6= 0:
The fact that neural networks of feasible size are nonethe- since σ is non-linear, it must have a non-zero second
less so useful therefore implies that the class of functions derivative at some point, so we can use the biases in
we care about approximating is dramatically smaller. We A1 to shift the origin to this point to ensure σ2 6= 0.
will see below in Section II D that both physics and ma- Equation (10) now implies that
chine learning tend to favor Hamiltonians that are poly-
nomials2 — indeed, often ones that are sparse, symmetric σ(u+v)+σ(−u−v)−σ(u−v)−σ(−u+v)
and low-order. Let us therefore focus our initial investi- m(u, v) ≡
2 4σ
gation on Hamiltonians that can be expanded as a power
= uv 1 + O u2 + v 2 ,
series: (11)
X X X
Hy (x) = h+ hi xi + hij xi xj + hijk xi xj xk +· · · . where we will term m(u, v) the multiplication approxima-
i i≤j i≤j≤k tor. Taylor’s theorem guarantees that m(u, v) is an arbi-
(9) trarily good approximation of uv for arbitrarily small |u|
If the vector x has n components (i = 1, ..., n), then there and |v|. However, we can always make |u| and |v| arbi-
are (n + d)!/(n!d!) terms of degree up to d. trarily small by scaling A1 → λA1 and then compensat-
ing by scaling A2 → λ−2 A2 . In the limit that λ → ∞,
this approximation becomes exact. In other words, ar-
bitrarily accurate multiplication can always be achieved
1. Continuous input variables using merely 4 neurons. Figure 2 illustrates such a multi-
plication approximator. (Of course, a practical algorithm
like stochastic gradient descent cannot achieve arbitrarily
If we can accurately approximate multiplication using a
large weights, though a reasonably good approximation
small number of neurons, then we can construct a net-
can be achieved already for λ−1 ∼ 10.)
work efficiently approximating any polynomial Hy (x) by
repeated multiplication and addition. We will now see Corollary: For any given multivariate polynomial and
that we can, using any smooth but otherwise arbitrary any tolerance > 0, there exists a neural network of fixed
non-linearity σ that is applied element-wise. The popular finite size N (independent of ) that approximates the
logistic sigmoid activation function σ(x) = 1/(1 + e−x ) polynomial to accuracy better than . Furthermore, N
will do the trick. is bounded by the complexity of the polynomial, scaling
Theorem: Let f be a neural network of the form f = as the number of multiplications required times a factor
A2 σA1 , where σ acts elementwise by applying some that is typically slightly larger than 4.3
smooth non-linear function σ to each element. Let the
This is a stronger statement than the classic universal
universal approximation theorems for neural networks [8,
9], which guarantee that for every there exists some
ules: any computable function can be accurately evaluated by
N (), but allows for the possibility that N () → ∞ as
a sufficiently large network of them. Just as NAND gates are → 0. An approximation theorem in [10] provides an
not unique (NOR gates are also universal), nor is any particular -independent bound on the size of the neural network,
neuron implementation — indeed, any generic smooth nonlinear but at the price of choosing a pathological function σ.
activation function is universal [8, 9].
2 The class of functions that can be exactly expressed by a neu-
ral network must be invariant under composition, since adding
more layers corresponds to using the output of one function
as the input to another. Important such classes include lin- 3 In addition to the four neurons required for each multiplication,
ear functions, affine functions, piecewise linear functions (gen- additional neurons may be deployed to copy variables to higher
erated by the popular Rectified Linear unit “ReLU” activation layers bypassing the nonlinearity in σ. Such linear “copy gates”
function σ(x) = max[0, x]), polynomials, continuous functions implementing the function u → u are of course trivial to imple-
and smooth functions whose nth derivatives are continuous. Ac- ment using a simpler version of the above procedure: using A1
cording to the Stone-Weierstrass theorem, both polynomials and to shift and scale down the input to fall in a tiny range where
piecewise linear functions can approximate continuous functions σ 0 (u) 6= 0, and then scaling it up and shifting accordingly with
arbitrarily well. A2 .
5
Continuous multiplication gate: Binary multiplication gate: stretch the sigmoid to exploit the identity
uv μ λ-2
4σ”(0)
uvw " !#
Y 1 X
μ μ -μ -μ 1
xi = lim σ −β k − −
β→∞ 2
xi . (13)
σ σ σ σ σ
i∈K x∈K
can be computed with n log2 n rather than n2 multipli- In our physics example (Figure 3, left), a set of cosmo-
cations using Fast Fourier transform. logical parameters y0 (the density of dark matter, etc.)
determines the power spectrum y1 of density fluctuations
in our universe, which in turn determines the pattern
III. WHY DEEP? of cosmic microwave background radiation y2 reaching
us from our early universe, which gets combined with
foreground radio noise from our Galaxy to produce the
Above we investigated how probability distributions from frequency-dependent sky maps (y3 ) that are recorded by
physics and computer science applications lent them- a satellite-based telescope that measures linear combina-
selves to “cheap learning”, being accurately and effi- tions of different sky signals and adds electronic receiver
ciently approximated by neural networks with merely a noise. For the recent example of the Planck Satellite
handful of layers. Let us now turn to the separate ques- [17], these datasets yi , y2 , ... contained about 101 , 104 ,
tion of depth, i.e., the success of deep learning: what 108 , 109 and 1012 numbers, respectively.
properties of real-world probability distributions cause
More generally, if a given data set is generated by a (clas-
efficiency to further improve when networks are made
sical) statistical physics process, it must be described by
deeper? This question has been extensively studied from
an equation in the form of equation (16), since dynamics
a mathematical point of view [14–16], but mathemat-
in classical physics is fundamentally Markovian: classi-
ics alone cannot fully answer it, because part of the an-
cal equations of motion are always first order differential
swer involves physics. We will argue that the answer in-
equations in the Hamiltonian formalism. This techni-
volves the hierarchical/compositional structure of gener-
cally covers essentially all data of interest in the machine
ative processes together with inability to efficiently “flat-
learning community, although the fundamental Marko-
ten” neural networks reflecting this structure.
vian nature of the generative process of the data may be
an in-efficient description.
select color,
LABEL
>
PARAMTERS
n, nT, Q, T/S 0 0
fluctuations
generate M1 M1
f0
param 1 param 2 param 3
>
SPECTRUM 8454543
1004356
9345593
8345388
654.766
-305.567 PARAMTERS
... ... ...
ray trace
simulate
sky map
M2 M2
f1
y2 CMB SKY y2=T2(x) RAY-TRACED y2
>
MAP OBJECT
f2
M3 M3
add
MULTΙ-
y3=T3(x) TRANSFORMED
>
FREQUENCY
y3 MAPS OBJECT y3
combinations,
select background
take linear
add noise
f3
M4 M4
Pixel 1 Pixel 2 ΔT
TELESCOPE 6422347 6443428 -454.841
x=y4
y4 DATA
3141592
8454543
2718281
9345593
141.421
654.766
FINAL IMAGE y4
1004356 8345388 -305.567
... ... ...
FIG. 3: Causal hierarchy examples relevant to physics (left) and image classification (right). As information flows down the
hierarchy y0 → y1 → ... → yn = y, some of it is destroyed by random Markov processes. However, no further information is
lost as information flows optimally back up the hierarchy as ybn−1 → ... → yb0 . The right example is deliberately contrived and
over-simplified for pedagogy; for example, translation and scaling are more naturally performed before ray tracing, which in
turn breaks down into multiple steps.
image of a galaxy, its entire probability distribution is Note that a random resulting image typically contains
defined by the standard model of particle physics with much more information than the generative process cre-
its 32 parameters [13], which together specify the pro- ating it; for example, the simple instruction “generate
cess transforming primordial hydrogen gas into galaxies. a random string of 109 bits” contains much fewer than
109 bits. Not only are the typical steps in the genera-
tive hierarchy specified by a non-astronomical number of
The same parameter-counting argument can also be ap- parameters, but as discussed in Section II D, it is plausi-
plied to all artificial images of interest to machine learn- ble that neural networks can implement each of the steps
ing: for example, giving the simple low-information- efficiently.5
content instruction “draw a cute kitten” to a random
sample of artists will produce a wide variety of images y
with a complicated probability distribution over colors,
postures, etc., as each artist makes random choices at a 5 Although our discussion is focused on describing probability dis-
series of steps. Even the pre-stored information about tributions, which are not random, stochastic neural networks
cat probabilities in these artists’ brains is modest in size.
9
A deep neural network stacking these simpler networks The sufficient statistic formalism enables us to state some
on top of one another would then implement the entire simple but important results that apply to any hierarchi-
generative process efficiently. In summary, the data sets cal generative process cast in the Markov chain form of
and functions we care about form a minuscule minority, equation (16).
and it is plausible that they can also be efficiently im-
Theorem 2: Given a Markov chain described by our
plemented by neural networks reflecting their generative
notation above, let Ti be a minimal sufficient statistic of
process. So what is the remainder? Which are the data
P (yi |yn ). Then there exists some functions fi such that
sets and functions that we do not care about?
Ti = fi ◦ Ti+1 . More casually speaking, the generative hi-
Almost all images are indistinguishable from random erarchy of Figure 3 can be optimally reversed one step at
noise, and almost all data sets and functions are in- a time: there are functions fi that optimally undo each
distinguishable from completely random ones. This fol- of the steps, distilling out all information about the level
lows from Borel’s theorem on normal numbers [26], which above that was not destroyed by the Markov process.
states that almost all real numbers have a string of dec- Here is the proof. Note that for any k ≥ 1, the “back-
imals that would pass any randomness test, i.e., are wards” Markov property P (yi |yi+1 , yi+k ) = P (yi |yi+1 )
indistinguishable from random noise. Simple parame- follows from the Markov property via Bayes’ theorem:
ter counting shows that deep learning (and our human
P (yi+k |yi , yi+1 )P (yi |yi+1 )
brains, for that matter) would fail to implement almost P (yi |yi+k , yi+1 ) =
all such functions, and training would fail to find any P (yi+k |yi+1 )
useful patterns. To thwart pattern-finding efforts. cryp- P (yi+k |yi+1 )P (yi |yi+1 ) (17)
=
tography therefore aims to produces random-looking pat- P (yi+k |yi+1 )
terns. Although we might expect the Hamiltonians de- = P (yi |yi+1 ).
scribing human-generated data sets such as drawings,
text and music to be more complex than those describing Using this fact, we see that
simple physical systems, we should nonetheless expect X
them to resemble the natural data sets that inspired their P (yi |yn ) = P (yi |yi+1 yn )P (yi+1 |yn )
creation much more than they resemble random func- yi+1
tions. X (18)
= P (yi |yi+1 )P (yi+1 |Ti+1 (yn )).
yi+1
could use the fluid approximation instead of working di- In summary, renormalization can be thought of as a type
rectly from the positions and momenta of ∼ 1023 par- of supervised learning7 , where the large scale properties
ticles). But without specifying that we wish to predict of the system are considered the features. If the desired
(long-wavelength physics), there is nothing natural about features are not large-scale properties (as in most ma-
an effective field theory approximation. chine learning cases), one might still expect the a gener-
alized formalism of renormalization to provide some intu-
To be more explicit about the link between renormaliza-
ition to the problem by replacing a scale transformation
tion and deep-learning, consider a toy model for natural
with some other transformation. But calling some pro-
images. Each image is described by an intensity field
cedure renormalization or not is ultimately a matter of
φ(r), where r is a 2-dimensional vector. We assume that
semantics; what remains to be seen is whether or not se-
an ensemble of images can be described by a quadratic
mantics has teeth, namely, whether the intuition about
Hamiltonian of the form
fixed points of the renormalization group flow can pro-
vide concrete insight into machine learning algorithms.
Z h i
2
Hy (φ) = y0 φ2 + y1 (∇φ)2 + y2 ∇2 φ + · · · d2 r. In many numerical methods, the purpose of the renor-
(23) malization group is to efficiently and accurately evaluate
Each parameter vector y defines an ensemble of images; the free energy of the system as a function of macroscopic
we could imagine that the fictitious classes of images that variables of interest such as temperature and pressure.
we are trying to distinguish are all generated by Hamilto- Thus we can only sensibly talk about the accuracy of
nians Hy with the same above form but different parame- an RG-scheme once we have specified what macroscopic
ter vectors y. We further assume that the function φ(r) is variables we are interested in.
specified on pixels that are sufficiently close that deriva-
tives can be well-approximated by differences. Deriva-
tives are linear operations, so they can be implemented F. No-flattening theorems
in the first layer of a neural network. The translational
symmetry of equation (23) allows it to be implemented
with a convnet. If can be shown [31] that for any course- Above we discussed how Markovian generative models
graining operation that replaces each block of b × b pixels cause p(x|y) to be a composition of a number of sim-
by its average and divides the result by b2 , the Hamil- pler functions fi . Suppose that we can approximate each
tonian retains the form of equation (23) but with the function fi with an efficient neural network for the rea-
parameters yi replaced by sons given in Section II. Then we can simply stack these
networks on top of each other, to obtain an deep neural
yi0 = b2−2i yi . (24) network efficiently approximating p(x|y).
This means that all parameters yi with i ≥ 2 decay expo- But is this the most efficient way to represent p(x|y)?
nentially with b as we repeatedly renormalize and b keeps Since we know that there are shallower networks that
increasing, so that for modest b, one can neglect all but accurately approximate it, are any of these shallow net-
the first few yi ’s. What would have taken an arbitrarily works as efficient as the deep one, or does flattening nec-
large neural network can now be computed on a neural essarily come at an efficiency cost?
network of finite and bounded size, assuming that we To be precise, for a neural network f defined by equa-
are only interested in classifying the data based only on tion (6), we will say that the neural network f` is the
the coarse-grained variables. These insufficient statistics flattened version of f if its number ` of hidden layers is
will still have discriminatory power if we are only inter- smaller and f` approximates f within some error (as
ested in discriminating Hamiltonians which all differ in
their first few Ck . In this example, the parameters y0
and y1 correspond to “relevant operators” by physicists
and “signal” by machine-learners, whereas the remain- 7 A subtlety regarding the above statements is presented by the
ing parameters correspond to “irrelevant operators” by Multi-scale Entanglement Renormalization Ansatz (MERA) [37].
physicists and “noise” by machine-learners. MERA can be viewed as a variational class of wave functions
whose parameters can be tuned to to match a given wave func-
The fixed point structure of the transformation in this ex- tion as closely as possible. From this perspective, MERA is as an
ample is very simple, but one can imagine that in more unsupervised machine learning algorithm, where classical proba-
bility distributions over many variables are replaced with quan-
complicated problems the fixed point structure of var- tum wavefunctions. Due to the special tensor network struc-
ious transformations might be highly non-trivial. This ture found in MERA, the resulting variational approximation
is certainly the case in statistical mechanics problems of a given wavefunction has an interpretation as generating an
where renormalization methods are used to classify var- RG flow. Hence this is an example of an unsupervised learning
ious phases of matters; the point here is that the renor- problem whose solution gives rise to an RG flow. This is only
possible due to the extra mathematical structure in the problem
malization group flow can be thought of as solving the (the specific tensor network found in MERA); a generic varia-
pattern-recognition problem of classifying the long-range tional Ansatz does not give rise to any RG interpretation and
behavior of various statistical systems. vice versa.
12
measured by some reasonable norm). We say that f` is (or complex) numbers, and composition is implemented
a neuron-efficient flattening if the sum of the dimensions by matrix multiplication.
of its hidden layers (sometimes referred to as the number
One might suspect that such a network is so simple that
of neurons Nn ) is less than for f. We say that f` is a the questions concerning flattening become entirely triv-
synapse-efficient flattening if the number Ns of non-zero ial: after all, successive multiplication with n different
entries (sometimes called synapses) in its weight matrices matrices is equivalent to multiplying by a single matrix
is less than for f. This lets us define the flattening cost (their product). While the effect of flattening is indeed
of a network f as the two functions trivial for expressibility (f can express any linear func-
Nn (f` ) tion, independently of how many layers there are), this
Cn (f, `, ) ≡ min , (25) is not the case for the learnability, which involves non-
f` Nn (f) linear and complex dynamics despite the linearity of the
Ns (f` ) network [45]. We will show that the efficiency of such
Cs (f, `, ) ≡ min , (26) linear networks is also a very rich question.
f` Ns (f)
Neuronal efficiency is trivially attainable for linear net-
specifying the factor by which optimal flattening in- works, since all hidden-layer neurons can be eliminated
creases the neuron count and the synapse count, respec- without accuracy loss by simply multiplying all the
tively. We refer to results where Cn > 1 or Cs > 1 weight matrices together. We will instead consider the
for some class of functions f as “no-flattening theorems”, case of synaptic efficiency and set ` = = 0.
since they imply that flattening comes at a cost and ef-
ficient flattening is impossible. A complete list of no- Many divide-and-conquer algorithms in numerical linear
flattening theorems would show exactly when deep net- algebra exploit some factorization of a particular ma-
works are more efficient than shallow networks. trix A in order to yield significant reduction in complex-
ity. For example, when A represents the discrete Fourier
There has already been very interesting progress in this transform (DFT), the fast Fourier transform (FFT) algo-
spirit, but crucial questions remain. On one hand, it rithm makes use of a sparse factorization of A which only
has been shown that deep is not always better, at least contains O(n log n) non-zero matrix elements instead of
empirically for some image classification tasks [38]. On the naive single-layer implementation, which contains n2
the other hand, many functions f have been found for non-zero matrix elements. As first pointed out in [46],
which the flattening cost is significant. Certain deep this is an example where depth helps and, in our termi-
Boolean circuit networks are exponentially costly to flat- nology, of a linear no-flattening theorem: fully flattening
ten [39]. Two families of multivariate polynomials with a network that performs an FFT of n variables increases
an exponential flattening cost Cn are constructed in[14]. the synapse count Ns from O(n log n) to O(n2 ), i.e., in-
[6, 15, 16] focus on functions that have tree-like hierar- curs a flattening cost Cs = O(n/ log n) ∼ O(n). This
chical compositional form, concluding that the flattening argument applies also to many variants and generaliza-
cost Cn is exponential for almost all functions in Sobolev tions of the FFT such as the Fast Wavelet Transform and
space. For the ReLU activation function, [40] finds a class the Fast Walsh-Hadamard Transform.
of functions that exhibit exponential flattening costs; [41]
study a tailored complexity measure of deep versus shal- Another important example illustrating the subtlety of
low ReLU networks. [42] shows that given weak condi- linear networks is matrix multiplication. More specifi-
tions on the activation function, there always exists at cally, take the input of a neural network to be the entries
least one function that can be implemented in a 3-layer of a matrix M and the output to be NM, where both
network which has an exponential flattening cost. Fi- M and N have size n × n. Since matrix multiplication is
nally, [43, 44] study the differential geometry of shallow linear, this can be exactly implemented by a 1-layer lin-
versus deep networks, and find that flattening is expo- ear neural network. Amazingly, the naive algorithm for
nentially neuron-inefficient. Further work elucidating the matrix multiplication, which requires n3 multiplications,
cost of flattening various classes of functions will clearly is not optimal: the Strassen algorithm [47] requires only
be highly valuable. O(nω ) multiplications (synapses), where ω = log2 7 ≈
2.81, and recent work has cut this scaling exponent down
to ω ≈ 2.3728639 [48]. This means that fully optimized
matrix multiplication on a deep neural network has a
G. Linear no-flattening theorems flattening cost of at least Cs = O(n0.6271361 ).
Low-rank matrix multiplication gives a more elementary
In the mean time, we will now see that interesting no- no-flattening theorem. If A is a rank-k matrix, we can
flattening results can be obtained even in the simpler-to- factor it as A = BC where B is a k × n matrix and
model context of linear neural networks [45], where the C is an n × k matrix. Hence the number of synapses is
σ operators are replaced with the identity and all biases n2 for an ` = 0 network and 2nk for an ` = 1-network,
are set to zero such that Ai are simply linear operators giving a flattening cost Cs = n/2k > 1 as long as the
(matrices). Every map is specified by a matrix of real rank k < n/2.
13
Finally, let us consider flattening a network f = AB, matics but also on physics, which favors certain classes of
where A and B are random sparse n × n matrices such exceptionally simple probability distributions that deep
that each element is 1 with probability p and 0 with prob- learning is uniquely suited to model. We argued that the
ability 1P− p. Flattening the network results in a matrix success of shallow neural networks hinges on symmetry,
Fij = k Aik Bkj , so the probability that Fij = 0 is locality, and polynomial log-probability in data from or
(1 − p2 )n . Hence the number of non-zero components inspired by the natural world, which favors sparse low-
will on average be 1 − (1 − p2 )n n2 , so order polynomial Hamiltonians that can be efficiently ap-
proximated. These arguments should be particularly rel-
1 − (1 − p2 )n n2 1 − (1 − p2 )n evant for explaining the success of machine-learning ap-
Cs = = . (27)
2n2 p 2p plications to physics, for example using a neural network
Note that Cs ≤ 1/2p and that this bound is asymptoti- to approximate a many-body wavefunction [49]. Whereas
cally saturated for n 1/p2 . Hence in the limit where n previous universality theorems guarantee that there ex-
is very large, flattening multiplication by sparse matrices ists a neural network that approximates any smooth func-
p 1 is horribly inefficient. tion to within an error , they cannot guarantee that the
size of the neural network does not grow to infinity with
shrinking or that the activation function σ does not be-
come pathological. We show constructively that given a
H. A polynomial no-flattening theorem
multivariate polynomial and any generic non-linearity, a
neural network with a fixed size and a generic smooth ac-
In Section II, we saw that multiplication of two variables tivation function can indeed approximate the polynomial
could be implemented by a flat neural network with 4 highly efficiently.
neurons in the hidden layer, using equation (11) as illus-
Turning to the separate question of depth, we have ar-
trated in Figure 2. In Appendix A, we show that equa-
gued that the success of deep learning depends on the
tion (11) is merely the n = 2 special case of the formula
ubiquity of hierarchical and compositional generative
n
Y 1 X processes in physics and other machine-learning appli-
xi = s1 ...sn σ(s1 x1 + ... + sn xn ), (28) cations. By studying the sufficient statistics of the gen-
i=1
2n
{s} erative process, we showed that the inference problem
where the sum is over all possible 2n configurations of requires approximating a compositional function of the
s1 , · · · sn where each si can take on values ±1. In other form f1 ◦ f2 ◦ f2 ◦ · · · that optimally distills out the in-
words, multiplication of n variables can be implemented formation of interest from irrelevant noise in a hierar-
by a flat network with 2n neurons in the hidden layer. We chical process that mirrors the generative process. Al-
also prove in Appendix A that this is the best one can though such compositional functions can be efficiently
do: no neural network can implement an n-input multi- implemented by a deep neural network as long as their
plication gate using fewer than 2n neurons in the hidden individual steps can, it is generally not possible to retain
layer. This is another powerful no-flattening theorem, the efficiency while flattening the network. We extend ex-
telling us that polynomials are exponentially expensive isting “no-flattening” theorems [14–16] by showing that
to flatten. For example, if n is a power of two, then the efficient flattening is impossible even for many important
monomial x1 x2 ...xn can be evaluated by a deep network cases involving linear networks. In particular, we prove
using only 4n neurons arranged in a deep neural network that flattening polynomials is exponentially expensive,
where n copies of the multiplication gate from Figure 2 with 2n neurons required to multiply n numbers using
are arranged in a binary tree with log2 n layers (the 5th a single hidden layer, a task that a deep network can
top neuron at the top of Figure 2 need not be counted, as perform using only ∼ 4n neurons.
it is the input to whatever computation comes next). In Strengthening the analytic understanding of deep learn-
contrast, a functionally equivalent flattened network re- ing may suggest ways of improving it, both to make it
quires a whopping 2n neurons. For example, a deep neu- more capable and to make it more robust. One promis-
ral network can multiply 32 numbers using 4n = 160 neu- ing area is to prove sharper and more comprehensive
rons while a shallow one requires 232 = 4, 294, 967, 296 no-flattening theorems, placing lower and upper bounds
neurons. Since a broad class of real-world functions can on the cost of flattening networks implementing various
be well approximated by polynomials, this helps explain classes of functions.
why many useful neural networks cannot be efficiently
flattened.
Acknowledgements: This work was supported by the
Foundational Questions Institute http://fqxi.org/,
IV. CONCLUSIONS
the Rothberg Family Fund for Cognitive Science and NSF
grant 1122374. We thank Scott Aaronson, Frank Ban,
Yoshua Bengio, Rico Jonschkowski, Tomaso Poggio, Bart
We have shown that the success of deep and cheap (low- Selman, Viktoriya Krakovna, Krishanu Sankar and Boya
parameter-count) learning depends not only on mathe- Song for helpful discussions and suggestions, Frank Ban,
14
Fernando Perez, Jared Jolton, and the anonymous referee degree greater than n, since they do not affect the ap-
for helpful corrections and the Center for Brains, Minds, proximation. Thus, we wish to find the minimal m such
and Machines (CBMM) for hospitality. that there exist constants aij and wj satisfying
m n
!n n
X X Y
σn wj aij xi = xi , (A2)
Appendix A: The polynomial no-flattening theorem j=1 i=1 i=1
m n
!k
X X
We saw above that a neural network can compute poly- σk wj aij xi = 0, (A3)
nomials accurately and efficiently at linear cost, using j=1 i=1
only about 4 neurons per multiplication. ForQnexample,
if n is a power of two, then the monomial i=1 xi can for all 0 ≤ k ≤ n − 1. Let us set m = 2n , and enumerate
be evaluated using 4n neurons arranged in a binary tree the subsets of {1, . . . , n} as S1 , . . . , Sm in some order.
network with log2 n hidden layers. In this appendix, we Define a network of m neurons in a single hidden layer
will prove a no-flattening theorem demonstrating that by setting aij equal to the function si (Sj ) which is −1 if
flattening polynomials is exponentially expensive: i ∈ Sj and +1 otherwise, setting
n
Theorem: Suppose we are P∞using a generic smooth ac- 1 Y (−1)|Sj |
tivation function σ(x) = k=0 σk xk , where σk 6= 0 for wj ≡ n
aij = n . (A4)
2 n!σn i=1 2 n!σn
0 ≤ k ≤ n. Then for any desired accuracy > 0, there
exists
Qn a neural network that can implement the function In other words, up to an overall normalization constant,
n
i=1 xi using a single hidden layer of 2 neurons. Fur- all coefficients aij and wj equal ±1, and each weight wj
thermore, this is the smallest possible number of neurons is simply the product of the corresponding aij .
in any such network with only a single hidden layer.
We must prove that this network indeed satisfies equa-
This result may be compared to problems in Boolean cir- tions (A2) and (A3). The essence of our proof will be
cuit complexity, notably the question of whether T C 0 = to expand the left hand side of Equation (A1) and show
T C 1 [50]. Here circuit depth is analogous to number that all monomial terms except x1 · · · xn come in pairs
of layers, and the number of gates is analogous to the that cancel. To show this, consider a single monomial
number of neurons. In both the Boolean circuit model p(x) = xr11 · · · xrnn where r1 + . . . + rn = r ≤ n.
and the neural network model, one is allowed to use neu- Qn
rons/gates which have an unlimited number of inputs. If p(x) 6= i=1 xi , then we must show that the coef-
Pm Pn r
The constraint in the definition of T C i that each of the ficient of p(x) in σr j=1 wj ( i=1 aij xi ) is 0. Since
Qn
gate elements be from a standard universal library (AND, p(x) 6= i=1 xi , there must be some i0 such that ri0 = 0.
OR, NOT, Majority) is analogous to our constraint to use In other words, p(x) does not depend on the variable xi0 .
a particular nonlinear function. Note, however, that our Since the sum in Equation (A1) is over all combinations
theorem is weaker by applying only to depth 1, while of ± signs for all variables, every term will be canceled
T C 0 includes all circuits of depth O(1). by another term where the (non-present) xi0 has the op-
posite sign and the weight wj has the opposite sign:
Observe that the coefficient of p(x) is equal in We will show that A has full row rank. Suppose, towards
Pn r Pn r
( i=1 si (Sj )xi ) and ( i=1 si (Sj ∪ {i0 })xi ) , since contradiction, that ct A = 0 for some non-zero vector c.
ri0 = 0. Therefore, the overall coefficient of p(x) in the Specifically, suppose that there is a linear dependence
above expression must vanish, which implies that (A3) is between rows of A given by
satisfied.
Qn
If instead p(x) = i=1 xi , then all terms have the coeffi- r
Pn n Qn
cient of p(x) in ( i=1 aij xi ) is n! i=1 aij = (−1)|Sj | n!,
X
c` AS` ,j = 0, (A8)
because all n! terms are identical and there is no cancela- `=1
tion. Hence, the coefficient of p(x) on the left-hand side where the S` are distinct and c` 6= 0 for every `. Let s be
of (A2) is the maximal cardinality of any S` . Defining the vector d
whose components are
m
X (−1)|Sj |
σn (−1)|Sj | n! = 1,
j=1
2n n!σn
n
!n−s
X
completing our proof that this network indeed approxi- dj ≡ wj aij xi , (A9)
mates the desired product gate. i=1
m n
!k−|S|
k! σk X Y X
wj ahj aij xi = 0, (A6)
|k − S|! i.e., to a statement that a set of monomials are linearly
j=1 h∈S i=1
dependent. Since all distinct monomials are in fact lin-
for all 0 ≤ k ≤ n − 1. Let A denote the 2n × m matrix early independent, this is a contradiction of our assump-
with elements tion that the S` are distinct and c` are nonzero. We
Y conclude that A has full row rank, and therefore that
ASj ≡ ahj . (A7) m ≥ 2n , which concludes the proof.
h∈S