Why Does Deep and Cheap Learning Work So Well? PDF

Why does deep and cheap learning work so well?
Henry W. Lin and Max Tegmark

Dept. of Physics, Harvard University, Cambridge, MA 02138 and
Dept. of Physics & MIT Kavli Institute, Massachusetts Institute of Technology, Cambridge, MA 02139
(Dated: August 31, 2016)
We show how the success of deep learning depends not only on mathematics but also on physics:
although well-known mathematical theorems guarantee that neural networks can approximate arbi-
trary functions well, the class of functions of practical interest can be approximated through “cheap
learning” with exponentially fewer parameters than generic ones, because they have simplifying
properties tracing back to the laws of physics. The exceptional simplicity of physics-based functions
arXiv:1608.08225v1 [cond-mat.dis-nn] 29 Aug 2016
hinges on properties such as symmetry, locality, compositionality and polynomial log-probability,

and we explore how these properties translate into exceptionally simple neural networks approximat-
ing both natural phenomena such as images and abstract representations thereof such as drawings.
We further argue that when the statistical process generating the data is of a certain hierarchi-
cal form prevalent in physics and machine-learning, a deep neural network can be more efficient
than a shallow one. We formalize these claims using information theory and discuss the relation
to renormalization group procedures. Various “no-flattening theorems” show when these efficient
deep networks cannot be accurately approximated by shallow ones without efficiency loss — even
for linear networks.
I. INTRODUCTION useful similarities between deep learning and statistical

mechanics.
Deep learning works remarkably well, and has helped dra- For concreteness, let us focus on the task of approximat-
matically improve the state-of-the-art in areas ranging ing functions. As illustrated in Figure 1, this covers most
from speech recognition, translation and visual object core sub-fields of machine learning, including unsuper-
recognition to drug discovery, genomics and automatic vised learning, classification and prediction. For exam-
game playing [1]. However, it is still not fully under- ple, if we are interested in classifying faces, then we may
stood why deep learning works so well. In contrast to want our neural network to implement a function where
GOFAI (“good old-fashioned AI”) algorithms that are we feed in an image represented by a million greyscale
hand-crafted and fully understood analytically, many al- pixels and get as output the probability distribution over
gorithms using artificial neural networks are understood a set of people that the image might represent.
only at a heuristic level, where we empirically know that
certain training protocols employing large data sets will
result in excellent performance. This is reminiscent of the
situation with human brains: we know that if we train
a child according to a certain curriculum, she will learn p(x,y)
certain skills — but we lack a deep understanding of how Unsupervised
her brain accomplishes this. learning
This makes it timely and interesting to develop new an-

alytic insights on deep learning and its successes, which
is the goal of the present paper. Such improved under-
standing is not only interesting in its own right, and for p(x |y) p(y |x)
potentially providing new clues about how brains work, Classification Prediction
but it may also have practical applications. Better under-
standing the shortcomings of deep learning may suggest
ways of improving it, both to make it more capable and FIG. 1: Neural networks can approximate probability dis-
to make it more robust [2]. tributions. Given many samples of random vectors x and
y, both classification and prediction involve viewing y as a
stochastic function of x and attempting to estimate the prob-
ability distributions for x given y and y given x, respectively.
A. The swindle: why does “cheap learning” work? In contrast, unsupervised learning attempts to approximate
the joint probability distribution of x and y without making
any assumptions about causality. In all three cases, the neu-
Throughout this paper, we will adopt a physics perspec- ral network searches for patterns in the data that can be used
tive on the problem, to prevent application-specific de- to better model the probability distribution.
tails from obscuring simple general results related to dy-
namics, symmetries, renormalization, etc., and to exploit When investigating the quality of a neural net, there are
2
several important factors to consider: II. EXPRESSIBILITY AND EFFICIENCY OF

SHALLOW NEURAL NETWORKS
• Expressibility: What class of functions can the

neural network express? Let us now explore what classes of probability distribu-
tions p are the focus of physics and machine learning, and
• Efficiency: How many resources (neurons, param- how accurately and efficiently neural networks can ap-
eters, etc.) does the neural network require to approximate them. Although our results will be fully gen-
proximate a given function? eral, it will help illustrate key points of we give the math-
• Learnability: How rapidly can the neural network ematical notation from Figure 1 concrete interpretations.
learn good parameters for approximating a func- For a machine-learning example, we might interpret x as
tion? an element of some set of animals {cat, dog, rabbit, ...}
and y as the vector of pixels in an image depicting such
an animal, so that p(y|x) for x = cat gives the proba-
This paper is focused on expressibility and efficiency, and bility distribution of images of cats with different col-
more specifically on the following paradox: How can neu- oring, size, posture, viewing angle, lighting condition,
ral networks approximate functions well in practice, when electronic camera noise, etc. For a physics example, we
the set of possible functions is exponentially larger than might interpret x as an element of some set of metals
the set of practically possible networks? For example, {iron, aluminum, copper, ...} and y as the vector of mag-
suppose that we wish to classify megapixel greyscale im- netization values for different parts of a metal bar. The
ages into two categories, e.g., cats or dogs. If each pixel prediction problem from Figure 1 is then to evaluate
can take one of 256 values, then there are 2561000000 pos- p(y|x), whereas the classification problem is to evaluate
sible images, and for each one, we wish to compute the p(x|y).
probability that it depicts a cat. This means that an ar- Because of the above-mentioned “swindle”, accurate ap-
bitrary function is defined by a list of 2561000000 probabil- proximations are only possible for a tiny subclass of all
ities, i.e., way more numbers than there are atoms in our probability distributions. Fortunately, as we will explore
universe (about 1078 ). Yet neural networks with merely below, the function p(y|x) often has many simplifying
thousands or millions of parameters somehow manage to features enabling accurate approximation, because it fol-
perform such classification tasks quite well. How can lows from some simple physical law or some generative
deep learning be so “cheap”, in the sense of requiring so model with relatively few free parameters: for exam-
few parameters? ple, its dependence on y may exhibit symmetry, locality
We will see in below that neural networks perform a com- and/or be of a simple form such as the exponential of
binatorial swindle, replacing exponentiation by multipli- a low-order polynomial. In contrast, the dependence of
cation: if there are say n = 106 inputs taking v = 256 p(x|y) on x tends to be more complicated; it makes no
values each, this swindle cuts the number of parameters sense to speak of symmetries or polynomials involving a
from v n to v×n times some constant factor. We will show variable x = cat.
that this success of this swindle depends fundamentally Let us therefore start by tackling the more complicated
on physics: although neural networks only work well for case of modeling p(x|y). This probability distribution
an exponentially tiny fraction of all possible inputs, the p(x|y) is determined by the hopefully simpler function
laws of physics are such that the data sets we care about p(y|x) via Bayes’ theorem:
for machine learning (natural images, sounds, drawings,
text, etc.) are also drawn from an exponentially tiny frac- p(y|x)p(x)
tion of all imaginable data sets. Moreover, we will see p(x|y) = P 0 0
, (1)
x0 p(y|x )(x )
that these two tiny subsets are remarkably similar, en-
abling deep learning to work well in practice. where p(x) is the probability distribution over x (animals
or metals, say) a priori, before examining the data vector
The rest of this paper is organized as follows. In Sec- y.
tion II, we present results for shallow neural networks
with merely a handful of layers, focusing on simplifica-
tions due to locality, symmetry and polynomials. In Sec-
tion III, we study how increasing the depth of a neural A. Probabilities and Hamiltonians
network can provide polynomial or exponential efficiency
gains even though it adds nothing in terms of expres-
sivity, and we discuss the connections to renormaliza- It is useful to introduce the negative logarithms of two of
tion, compositionality and complexity. We summarize these probabilities:
our conclusions in Section IV and discuss a technical
point about renormalization and deep learning in Ap- Hx (y) ≡ − ln p(y|x),
pendix V. µx ≡ − ln p(x). (2)
3
Statisticians refer to − ln p as “self-information” or “sur- The softmax operator is therefore defined by

prisal”, and statistical physicists refer to Hx (y) as the
ey
Hamiltonian, quantifying the energy of y (up to an arbi- σ(y) ≡ P yi . (7)
trary and irrelevant additive constant) given the param- ie
eter x. These definitions transform equation (1) into the This allows us to rewrite equation (5) in the extremely
Boltzmann form simple form
1 −[Hx (y)+µx ]
p(x|y) = e , (3) p(y) = σ[−H(y) − µ]. (8)
N (y)
This means that if we can compute the Hamiltonian vec-
where tor H(y) with some n-layer neural net, we can evaluate
X the desired classification probability vector p(y) by sim-
N (y) ≡ e−[Hx (y)+µx ] . (4)
ply adding a softmax layer. The µ-vector simply becomes
x
the bias term in this final layer.
This recasting of equation (1) is useful because the
Hamiltonian tends to have properties making it simple
to evaluate. We will see in Section III that it also helps C. What Hamiltonians can be approximated by
understand the relation between deep learning and renor- feasible neural networks?
malization.
It has long been known that neural networks are univer-
sal approximators [3, 4], in the sense that networks with
B. Bayes theorem as a softmax virtually all popular nonlinear activation functions σ(y)
can approximate any smooth function to any desired ac-
Since the variable x takes one of a discrete set of values, curacy — even using merely a single hidden layer. How-
we will often write it as an index instead of as an argu- ever, these theorems do not guarantee that this can be
ment, as px (y) ≡ p(x|y). Moreover, we will often find it accomplished with a network of feasible size, and the fol-
convenient to view all values indexed by x as elements lowingn simple example explains why they cannot: There
of a vector, written in boldface, thus viewing px , Hx and are 22 different Boolean functions of n variables, so a
µx as elements of the vectors p, H and µ, respectively. network implementing a generic function in this class re-
Equation (3) thus simplifies to quires at least 2n bits to describe, i.e., more bits than
there are atoms in our universe if n > 260.
1 −[H(y)+µ]
p(y) = e , (5) The fact that neural networks of feasible size are nonethe-
N (y) less so useful therefore implies that the class of functions
we care about approximating is dramatically smaller. We
using the standard convention that a function (in this
will see below in Section II D that both physics and ma-
case exp) applied to a vector acts on its elements.
chine learning tend to favor Hamiltonians that are poly-
We wish to investigate how well this vector-valued func- nomials1 — indeed, often ones that are sparse, symmetric
tion p(y) can be approximated by a neural net. A stan- and low-order. Let us therefore focus our initial investi-
dard n-layer feedforward neural network maps vectors gation on Hamiltonians that can be expanded as a power
to vectors by applying a series of linear and nonlinear series:
transformations in succession. Specifically, it implements X X X
vector-valued functions of the form [1] Hx (y) = h+ hi yi + hij yi yj + hijk yi yj yk +· · · .
i i≤j i≤j≤k
f(y) = σ n An · · · σ 2 A2 σ 1 A1 y, (6) (9)
If the vector y has n components (i = 1, ..., n), then there
where the σ i are relatively simple nonlinear operators are (n + d)!/(n!d!) terms of degree up to d.
on vectors and the Ai are affine transformations of the
form Ai y = Wi y+bi for matrices Wi and so-called bias
vectors bi . Popular choices for these nonlinear operators
σ i include 1 The class of functions that can be exactly expressed by a neu-
ral network must be invariant under composition, since adding
• Local function (apply some nonlinear function σ to more layers corresponds to using the output of one function as
the input to another. Important such classes include linear func-
each vector element), tions, affine functions, piecewise linear functions (generated by
the popular Rectified Linear unit “ReLU” activation function
• Max-pooling (compute the maximum of all vector σ(y) = max[0, y]), polynomials, continuous functions and smooth
elements), functions whose nth derivatives are continuous. According to the
Stone-Weierstrass theorem, both polynomials and piecewise lin-
• Softmax (exponentiate all vector elements and nor- ear functions can approximate continuous functions arbitrarily
malize them to so sum to unity). well.
4
1. Continuous input variables Continuous multiplication gate: Binary multiplication gate:

uv μ λ-2
8σ”(0)
uvw
If we can accurately approximate multiplication using a μ μ -μ -μ 1
σ σ σ σ σ
small number of neurons, then we can construct a net-
work efficiently approximating any polynomial Hx (y) by
repeated multiplication and addition. We will now see
that we can, using any smooth but otherwise arbitrary -λ λ
non-linearity σ that is applied element-wise. The popular λ -λ -λ λ β β β -2.5β
logistic sigmoid activation function σ(y) = 1/(1 + e−y ) -λ λ
will do the trick. u v u v w 1
Theorem: Let f be a neural network of the form f =
A2 σA1 , where σ acts elementwise by applying some FIG. 2: Multiplication can be efficiently implemented by sim-
smooth non-linear function σ to each element. Let the ple neural nets, becoming arbitrarily accurate as λ → 0 (left)
input layer, hidden layer and output layer have sizes 2, 4 and β → ∞ (right). Squares apply the function σ, circles
and 1, respectively. Then f can approximate a multipli- perform summation, and lines multiply by the constants la-
cation gate arbitrarily well. beling them. The “1” input implements the bias term. The
left gate requires σ 00 (0) 6= 0, which can always be arranged by
To see this, let us first Taylor-expand the function σ biasing the input to σ. The right gate requires the sigmoidal
around the origin: behavior σ(x) → 0 and σ(x) → 1 as x → −∞ and x → ∞,
respectively.
σ(u) = σ0 + σ1 u + σ2 u2 + O(u3 ). (10)
Without loss of generality, we can assume that σ2 6= 0:
This is a stronger statement than the classic universal
since σ is non-linear, it must have a non-zero second
universal approximation theorems for neural networks [3,
derivative at some point, so we can use the biases in
4], which guarantee that for every there exists some
A1 to shift the origin to this point to ensure σ2 6= 0.
N (), but allows for the possibility that N () → ∞ as
Equation (10) now implies that
→ 0. An approximation theorem in [5] provides an
σ(u+v)+σ(−u−v)−σ(u−v)−σ(−u+v) -independent bound on the size of the neural network,
m(u, v) ≡ but at the price of choosing a pathological function σ.
2 8σ
= uv 1 + O u2 + v 2 ,

(11)
where we will term m(x, y) the multiplication approxima- 2. Discrete input variables
tor. Taylor’s theorem guarantees that m(x, y) is an ar-
bitrarily good approximation of xy for arbitrarily small
|x| and |y|. However, we can always make |x| and |y| For the simple but important case where y is a vector
arbitrarily small by scaling A1 → λA1 and then com- of bits, so that yi = 0 or yi = 1, the fact that yi2 = yi
pensating by scaling A2 → λ−2 A2 . In the limit that makes things even simpler. This means that only terms
λ → ∞, this approximation becomes exact. In other where all variables are different need be included, which
words, arbitrarily accurate multiplication can always be simplifies equation (9) to
achieved using merely 4 neurons. Figure 2 illustrates such X X X
a multiplication approximator using a logistic sigmoid σ. Hx (y) = h+ hi yi + hij yi yj + hijk yi yj yk +· · · .
i i<j i<j<k
Corollary: For any given multivariate polynomial and (12)
any tolerance > 0, there exists a neural network of fixed The infinite series equation (9) thus gets replaced by
finite size N (independent of ) that approximates the a finite series with 2n terms, ending with the term
polynomial to accuracy better than . Furthermore, N h1...n y1 · · · yn . Since there are 2n possible bit strings y,
is bounded by the complexity of the polynomial, scaling the 2n h−parameters in equation (12) suffice to exactly
as the number of multiplications required times a factor parametrize an arbitrary function Hx (y).
that is typically slightly larger than 4.2
The efficient multiplication approximator above multi-
plied only two variables at a time, thus requiring multi-
ple layers to evaluate general polynomials. In contrast,
2 In addition to the four neurons required for each multiplication, H(y) for a bit vector y can be implemented using merely
additional neurons may be deployed to copy variables to higher three layers as illustrated in Figure 2, where the middle
layers bypassing the nonlinearity in σ. Such linear “copy gates” layer evaluates the bit products and the third layer takes
implementing the function u → u are of course trivial to imple- a linear combination of them. This is because bits al-
ment using a simpler version of the above procedure: using A1
to shift and scale down the input to fall in a tiny range where low an accurate multiplication approximator that takes
σ 0 (u) 6= 0, and then scaling it up and shifting accordingly with the product of an arbitrary number of bits at once, ex-
A2 . ploiting the fact that a product of bits can be trivially
5
determined from their sum: for example, the product ranging from 2 to 4. This means that the number of
y1 y2 y3 = 1 if and only if the sum y1 + y2 + y3 = 3. This polynomial coefficients is not infinite as in equation (9)
sum-checking can be implemented using one of the most or exponential in n as in equation (12), merely of order
popular choices for a nonlinear function σ: the logistic n2 , n3 or n4 .
sigmoid σ(y) = 1+e1−y which satisfies σ(y) ≈ 0 for y 0 Thanks to the Central Limit Theorem [6], many proba-
and σ(y) ≈ 1 for y 1. To compute the product of bility distributions in machine-learning and statistics can
some set of k bits described by the set K (for our exam- be accurately approximated by multivariate Gaussians,
ple above, K = {1, 2, 3}), we let A1 and A2 shift and i.e., of the form
stretch the sigmoid to exploit the identity P
hj yi −
P
p(y) = eh+ i ij hij yi yj
, (14)
  
Y 1 X which means that the Hamiltonian H = − ln p is a
yi = lim σ −β k − − yi  . (13) quadratic polynomial. More generally, the maximum-
β→∞ 2
i∈K y∈K entropy probability distribution subject to constraints on
some of the lowest moments, say expectation values of the
Since σ decays exponentially fast toward 0 or 1 as β is in- form hy1α1 y2α2 · · · ynαn i for some integers αi ≥ 0 would
creased, modestly large β-values suffice in practice; if, for P lead
to a Hamiltonian of degree no greater than d ≡ i αi [7].
example, we want the correct answer to D = 10 decimal
places, we merely need β > D ln 10 ≈ 23. In summary, Image classification tasks often exploit invariance under
when y is a bit string, an arbitrary function px (y) can be translation, rotation, and various nonlinear deformations
evaluated by a simple 3-layer neural network: the mid- of the image plane that move pixels to new locations. All
dle layer uses sigmoid functions to compute the products such spatial transformations are linear function functions
from equation (12), and the top layer performs the sums (d = 1 polynomials) of the pixel vector y. Functions im-
from equation (12) and the softmax from equation (8). plementing convolutions and Fourier transforms are also
d = 1 polynomials.
D. What Hamiltonians do we want to 2. Locality

approximate?
One of the deepest principles of physics is locality: that

We have seen that polynomials can be accurately approx- things directly affect only what is in their immediate
imated by neural networks using a number of neurons vicinity. When physical systems are simulated on a com-
scaling either as the number of multiplications required puter by discretizing space onto a rectangular lattice, lo-
(for the continuous case) or as the number of terms (for cality manifests itself by allowing only nearest-neighbor
the binary case). But polynomials per se are no panacea: interaction. In other words, almost all coefficients in
with binary input, all functions are polynomials, and equation (9) are forced to vanish, and the total number
with continuous input, there are (n + d)!/(n!d!) coeffi- of non-zero coefficients grows only linearly with n. For
cients in a generic polynomial of degree d in n variables, the binary case of equation (9), which applies to magne-
which easily becomes unmanageably large. We will now tizations (spins) that can take one of two values, locality
see how exceptionally simple polynomials that are sparse, also limits the degree d to be no greater than the num-
symmetric and/or low-order play a special role in physics ber of neighbors that a given spin is coupled to (since all
and machine-learning. variables in a polynomial term must be different).
This can be stated more generally and precisely using the
Markov network formalism [8]. View the spins as vertices
1. Low polynomial order of a Markov network; the edges represent dependencies.
Let Nc be the clique cover number of the network (the
smallest number of cliques whose union is the entire net-
For reasons that are still not fully understood, our uni- work) and let Sc be the size of the largest clique. Then
verse can be accurately described by polynomial Hamil- the number of required neurons is ≤ Nc 2Sc . For fixed Sc ,
tonians of low order d. At a fundamental level, the Nc is proportional to the number of vertices, so locality
Hamiltonian of the standard model of particle physics means that the number of neurons scales only linearly
has d = 4. There are many approximations of this quar- with the number of spins n.
tic Hamiltonian that are accurate in specific regimes, for
example the Maxwell equations governing electromag-
netism, the Navier-Stokes equations governing fluid dy- 3. Symmetry
namics, the Alvén equations governing magnetohydro-
dynamics and various Ising models governing magneti-
zation — all of these approximations have Hamiltonians Whenever the Hamiltonian obeys some symmetry (is in-
that are polynomials in the field variables, of degree d variant under some transformation), the number of in-
6
dependent parameters required to describe it is further Both examples involve a Markov chain3 where the prob-
reduced. For instance, many probability distributions in ability distribution p(xi ) at the ith level of the hierarchy
both physics and machine learning are invariant under is determined from its causal predecessor alone:
translation and rotation. As an example, consider a vec-
tor y of air pressures yi measured by a microphone at pi Mi pi−1 , (15)
times i = 1, ..., n. Assuming that the Hamiltonian de-
scribing it has d = 2 reduces the number of parameters where the probability vector pi specifies the probabil-
N from ∞ to (n + 1)(n + 2)/2. Further assuming lo- ity distribution of p(xi ) according to (pi )x ≡ p(xi ) and
cality (nearest-neighbor couplings only) reduces this to the Markov matrix Mi specifies the transition probabili-
N = 2n, after which requiring translational symmetry ties between two neighboring levels, p(xi |xi−1 ). Iterating
reduces the parameter count to N = 3. Taken together, equation (15) gives
the constraints on locality, symmetry and polynomial or-
pn = Mn Mn−1 · · · M1 p0 , (16)
der reduce the number of continuous parameters in the
Hamiltonian of the standard model of physics to merely so we can write the combined effect of the the entire
32 [9]. generative process as a matrix product.
Symmetry can reduce not merely the parameter count, In our physics example (Figure 3, left), a set of cosmo-
but also the computational complexity. For example, if logical parameters x0 (the density of dark matter, etc.)
a linear vector-valued function f(y) mapping a set of n determines the power spectrum x1 of density fluctuations
variables onto itself happens to satisfy translational sym- in our universe, which in turn determines the pattern
metry, then it is a convolution (implementable by a con- of cosmic microwave background radiation x2 reaching
volutional neural net; “convnet”), which means that it us from our early universe, which gets combined with
can be computed with n log2 n rather than n2 multipli- foreground radio noise from our Galaxy to produce the
cations using Fast Fourier transform. frequency-dependent sky maps (x3 ) that are recorded by
a satellite-based telescope that measures linear combina-
tions of different sky signals and adds electronic receiver
III. WHY DEEP? noise. For the recent example of the Planck Satellite
[13], these datasets xi , x2 , ... contained about 101 , 104 ,
108 , 109 and 1012 numbers, respectively.
Above we investigated how probability distributions from
More generally, if a given data set is generated by a (clas-
physics and computer science applications lent them-
sical) statistical physics process, it must be described by
selves to “cheap learning”, being accurately and effi-
an equation in the form of equation (16), since dynamics
ciently approximated by neural networks with merely a
in classical physics is fundamentally Markovian: classi-
handful of layers. Let us now turn to the separate ques-
cal equations of motion are always first order differential
tion of depth, i.e., the success of deep learning: what
equations in the Hamiltonian formalism. This techni-
properties of real-world probability distributions cause
cally covers essentially all data of interest in the machine
efficiency to further improve when networks are made
learning community, although the fundamental Marko-
deeper? This question has been extensively studied from
vian nature of the generative process of the data may be
a mathematical point of view [10–12], but mathemat-
an in-efficient description.
ics alone cannot fully answer it, because part of the an-
swer involves physics. We will argue that the answer in- Our toy image classification example (Figure 3, right) is
volves the hierarchical/compositional structure of gener- deliberately contrived and over-simplified for pedagogy:
ative processes together with inability to efficiently “flat- x0 is a single bit signifying “cat or dog”, which deter-
ten” neural networks reflecting this structure. mines a set of parameters determining the animal’s col-
oration, body shape, posture, etc. using approxiate prob-
ability distributions, which determine a 2D image via
ray-tracing, which is scaled and translated by random
A. Hierarchical processess
amounts before a randomly background is added.
In both examples, the goal is to reverse this generative hi-
One of the most striking features of the physical world erarchy to learn about the input x ≡ x0 from the output
is its hierarchical structure. Spatially, it is an object xn ≡ y, specifically to provide the best possibile estimate
hierarchy: elementary particles form atoms which in turn
form molecules, cells, organisms, planets, solar systems,
galaxies, etc. Causally, complex structures are frequently
created through a distinct sequence of simpler steps. 3 If the next step in the generative hierarchy requires knowledge
of not merely of the present state but also information of the
Figure 3 gives two examples of such causal hierarchies past, the present state can be redefined to include also this in-
generating data vectors x0 7→ x1 7→ ... 7→ xn that are formation, thus ensuring that the generative process is a Markov
relevant to physics and image classification, respectively. process.
7
COSMO- Ω, Ω b , Λ, τ, h x =T (y) cat or dog?
shape & posture

CATEGORY
x0=x LOGICAL x0=x
select color,
LABEL
>
PARAMTERS
n, nT, Q, T/S 0 0
fluctuations
generate M1 M1
f0
param 1 param 2 param 3
x1 POWER 6422347 6443428 -454.841

SOLIDWORKS
x1
x1=T1(y) 3141592 2718281 141.421
>
SPECTRUM 8454543
1004356
9345593
8345388
654.766
-305.567 PARAMTERS
... ... ...
ray trace
simulate
sky map
M2 M2
f1
x2 CMB SKY x2=T2(y) RAY-TRACED x2
>
MAP OBJECT
scale & translate

foregrounds
f2
M3 M3
add
MULTΙ-
x3=T3(y) TRANSFORMED
>
FREQUENCY
x3 MAPS OBJECT x3
combinations,
select background
take linear
add noise
f3
M4 M4
Pixel 1 Pixel 2 ΔT
TELESCOPE 6422347 6443428 -454.841
y=x4
x4 DATA
3141592
8454543
2718281
9345593
141.421
654.766
FINAL IMAGE x4
1004356 8345388 -305.567
... ... ...
FIG. 3: Causal hierarchy examples relevant to physics (left) and image classification (right). As information flows down the
hierarchy x0 → x1 → ... → xn = y, some of it is destroyed by random Markov processes. However, no further information is
bn−1 → ... → x
lost as information flows optimally back up the hierarchy as x b0 . The right example is deliberately contrived and
over-simplified for pedagogy; for example, translation and scaling are more naturally performed before ray tracing, which in
turn breaks down into multiple steps.
of the probability distribution p(x|y) = p(x0 |xn ) — i.e., can be specified by a more modest number of parameters,
to determine the probability distribution for the cosmo- because each of its steps can. Whereas specifying an ar-
logical parameters and to determine the probability that bitrary probability distribution over multi-megapixel im-
the image is a cat, respectively. ages y requires far more bits than there are atoms in
our universe, the information specifying how to compute
the probability distribution p(y|x) for a microwave back-
ground map fits into a handful of published journal arti-
B. Resolving the swindle cles or software packages [14–20]. For a megapixel image
of a galaxy, its entire probability distribution is defined
by the standard model of particle physics with its 32
This decomposition of the generative process into a hier- parameters [9], which together specify the process trans-
archy of simpler steps helps resolve the“swindle” paradox forming primordial hydrogen gas into galaxies.
from the introduction: although the number of parame-
ters required to describe an arbitrary function of the in- The same parameter-counting argument can also be ap-
put data y is beyond astronomical, the generative process plied to all artificial images of interest to machine learn-
8
ing: for example, giving the simple low-information- C. Sufficient statistics and hierarchies
content instruction “draw a cute kitten” to a random
sample of artists will produce a wide variety of images y
with a complicated probability distribution over colors, The goal of deep learning classifiers is to reverse the hi-
postures, etc., as each artist makes random choices at a erarchical generative process as well as possible, to make
series of steps. Even the pre-stored information about inferences about the input x from the output y. Let us
cat probabilities in these artists’ brains is modest in size. now treat this hierarchical problem more rigorously using
information theory.
Given P (x|y), a sufficient statistic T (y) is defined by the
Note that a random resulting image typically contains equation P (x|y) = P (x|T (y)) and has played an impor-
much more information than the generative process cre- tant role in statistics for almost a century [23]. All the
ating it; for example, the simple instruction “generate information about x contained in y is contained in the
a random string of 109 bits” contains much fewer than sufficient statistic. A minimal sufficient statistic [23] is
109 bits. Not only are the typical steps in the genera- some sufficient statistic T∗ which is a sufficient statistic
tive hierarchy specified by a non-astronomical number of for all other sufficient statistics. This means that if T (y)
parameters, but as discussed in Section II D, it is plausi- is sufficient, then there exists some function f such that
ble that neural networks can implement each of the steps T∗ (y) = f (T (y)). As illustrated in Figure 3, T∗ can be
efficiently.4 thought of as a an information distiller, optimally com-
pressing the data so as to retain all information relevant
to determining x and discarding all irrelevant informa-
tion.
A deep neural network stacking these simpler networks
on top of one another would then implement the entire The sufficient statistic formalism enables us to state some
generative process efficiently. In summary, the data sets simple but important results that apply to any hierarchi-
and functions we care about form a minuscule minority, cal generative process cast in the Markov chain form of
and it is plausible that they can also be efficiently im- equation (16).
plemented by neural networks reflecting their generative Theorem 2: Given a Markov chain described by our
process. So what is the remainder? Which are the data notation above, let Ti be a minimal sufficient statistic of
sets and functions that we do not care about? P (xi |xn ). Then there exists some functions fi such that
Ti = fi ◦ Ti+1 . More casually speaking, the generative
hierarchy of Figure 3 can be optimally reversed one step
Almost all images are indistinguishable from random at a time: there are functions fi that optimally undo each
noise, and almost all data sets and functions are in- of the steps, distilling out all information about the level
distinguishable from completely random ones. This fol- above that was not destroyed by the Markov process.
lows from Borel’s theorem on normal numbers [22], which Here is the proof. Note that for any k ≥ 1, “backwards”
states that almost all real numbers have a string of dec- Markovity P (xi |xi+1 , xi+k ) = P (xi |xi+1 ) follows from
imals that would pass any randomness test, i.e., are Markovity via Bayes’ theorem:
indistinguishable from random noise. Simple parame-
ter counting shows that deep learning (and our human P (xi+k |xi , xi+1 )P (xi |xi+1 )
P (xi |xi+k , xi+1 ) =
brains, for that matter) would fail to implement almost P (xi+k |xi+1 )
all such functions, and training would fail to find any P (xi+k |xi+1 )P (xi |xi+1 ) (17)
useful patterns. To thwart pattern-finding efforts. cryp- =
P (xi+k |xi+1 )
tography therefore aims to produces random-looking pat-
terns. Although we might expect the Hamiltonians de- = P (xi |xi+1 ).
scribing human-generated data sets such as drawings,
Using this fact, we see that
text and music to be more complex than those describing
simple physical systems, we should nonetheless expect P (xi |xn ) =
X
P (xi |xi+1 xn )P (xi+1 |xn )
them to resemble the natural data sets that inspired their xi+1
creation much more than they resemble random func- X (18)
tions. = P (xi |xi+1 )P (xi+1 |Ti+1 (xn )).
xi+1
Since the above equation depends on xn only through

4 Although our discussion is focused on describing probability dis-
Ti+1 (xn ), this means that Ti+1 is a sufficient statistic for
tributions, which are not random, stochastic neural networks P (xi |xn ). But since Ti is the minimal sufficient statistic,
can generate random variables as well. In biology, spiking neu- there exists a function fi such that Ti = fi ◦ Ti+1 .
rons provide a good random number generator, and in machine
learning, stochastic architectures such as restricted Boltzmann Corollary 2: With the same assumptions and notation
machines [21] do the same. as theorem 2, define the function f0 (T0 ) = P (x0 |T0 ) and
9
fn = Tn−1 . Then to trade some loss of mutual information with a dra-

matic reduction in the complexity of the Hamiltonian;
P (x0 |xn ) = (f0 ◦ f1 ◦ · · · ◦ fn ) (xn ), (19) e.g., Hx (f (y)) may be considerably easier to implement
in a neural network than Hx (y). Precisely this situa-
tion applies to the physics example from Figure 3, where
The proof is easy. By induction, a hierarchy of efficient near-perfect information distillers
fi have been found, the numerical cost of f3 [19, 20], f2
T0 = f1 ◦ f2 ◦ · · · ◦ Tn−1 , (20) [17, 18], f1 [15, 16] and f0 [13] scaling with the number of
inputs parameters n as O(n), O(n3/2 ), O(n2 ) and O(n3 ),
which implies the corollary. respectively.
Roughly speaking, Corollary 2 states that the structure of
the inference problem reflects the structure of the genera-
tive process. In this case, we see that the neural network E. Distillation and renormalization
trying to approximate P (x|y) must approximate a com-
positional function. We will argue below in Section III F
that in many cases, this can only be accomplished effi- The systematic framework for distilling out desired in-
ciently if the neural network has & n hidden layers. formation from unwanted “noise” in physical theories is
known as Effective Field Theory [27]. Typically, the de-
In neuroscience parlance, the functions fi compress the sired information involves relatively large-scale features
data into forms with ever more invariance [24], contain- that can be experimentally measured, whereas the noise
ing features invariant under irrelevant transformations involves unobserved microscopic scales. A key part of
(for example background substitution, scaling and trans- this framework is known as the renormalization group
lation). (RG) transformation [27, 28]. Although the connection
bi ≡ fi (b
Let us denote the distilled vectors x xi+1 ), where between RG and machine learning has been studied or
bn ≡ y. As summarized by Figure 3, as information flows
x alluded to repeatedly [29–33], there are significant mis-
down the hierarchy x = x0 → x1 → ...xn = y, some of it conceptions in the literature concerning the connection
is destroyed by random processes. However, no further which we will now attempt to clear up.
information is lost as information flows optimally back Let us first review a standard working definition of what
up the hierarchy as y → x bn−1 → ... → x
b0 . renormalization is in the context of statistical physics,
involving three ingredients: a vector y of random vari-
ables, a course-graining operation R and a requirement
D. Approximate information distillation that this operation leaves the Hamiltonian invariant ex-
cept for parameter changes. We think of y as the micro-
scopic degrees of freedom — typically physical quantities
Although minimal sufficient statistics are often difficult defined at a lattice of points (pixels or voxels) in space.
to calculate in practice, it is frequently possible to come Its probability distribution is specified by a Hamiltonian
up with statistics which are nearly sufficient in a certain Hx (y), with some parameter vector x. We interpret the
sense which we now explain. map R : y → y as implementing a coarse-graining5 of
the system. The random variable R(y) also has a Hamil-
An equivalent characterization of a sufficient statistic is tonian, denoted H 0 (R(y)), which we require to have the
provided by information theory [25, 26]. The data pro- same functional form as the original Hamiltonian Hx , al-
cessing inequality [26] states that for any function f and though the parameters x may change. In other words,
any random variables x, y, H 0 (R(y)) = Hr(x) (R(y)) for some function r. Since the
domain and the range of R coincide, this map R can be
I(x, y) ≥ I(x, f (y)), (21) iterated n times Rn = R ◦ R ◦ · · · R, giving a Hamilto-
nian Hrn (x) (Rn (y)) for the repeatedly renormalized data.
where I is the mutual information:
X p(x, y)
I(x, y) = p(x, y) log . (22)
x,y
p(x)p(y) 5 A typical renormalization scheme for a lattice system involves
replacing many spins (bits) with a single spin according to some
A sufficient statistic T (y) is a function f (y) for which “≥” rule. In this case, it might seem that the map R could not
gets replaced by “=” in equation (21), i.e., a function possibly map its domain onto itself, since there are fewer degrees
of freedom after the coarse-graining. On the other hand, if we
retaining all the information about x. let the domain and range of R differ, we cannot easily talk about
Even information distillation functions f that are not the Hamiltonian as having the same functional form, since the
renormalized Hamiltonian would have a different domain than
strictly sufficient can be very useful as long as they dis- the original Hamiltonian. Physicists get around this by taking
till out most of the relevant information and are com- the limit where the lattice is infinitely large, so that R maps an
putationally efficient. For example, it may be possible infinite lattice to an infinite lattice.
10
Similar to the case of sufficient statistics, P (x|Rn (y)) will of supervised learning, not unsupervised learning. We
then be a compositional function. elaborate on this further in Appendix A, where we con-
struct a counter-example to a recent claim [32] that a
Contrary to some claims in the literature, effective field
so-called “exact” RG is equivalent to perfectly recon-
theory and the renormalization group have little to do
structing the empirical probability distribution in an un-
with the idea of unsupervised learning and pattern-
supervised problem. The information-distillation nature
finding. Instead, the standard renormalization proce-
of renormalization is explicit in many numerical meth-
dures in statistical physics and quantum field theory
ods, where the purpose of the renormalization group is
are essentially a feature extractor for supervised learn-
to efficiently and accurately evaluate the free energy of
ing, where the features typically correspond to long-
the system as a function of macroscopic variables of in-
wavelength/macroscopic degrees of freedom. In other
terest such as temperature and pressure. Thus we can
words, effective field theory only makes sense if we spec-
only sensibly talk about the accuracy of an RG-scheme
ify what features we are interested in. For example, if
once we have specified what macroscopic variables we are
we are given data y about the position and momenta of
interested in.
particles inside a mole of some liquid and is tasked with
predicting from this data whether or not Alice will burn A subtlety regarding the above statements is pre-
her finger when touching the liquid, a (nearly) sufficient sented by the Multi-scale Entanglement Renormalization
statistic is simply the temperature of the object, which Ansatz (MERA) [34]. MERA can be viewed as a varia-
can in turn be obtained from some very coarse-grained tional class of wave functions whose parameters can be
degrees of freedom (for example, one could use the fluid tuned to to match a given wave function as closely as
approximation instead of working directly from the posi- possible. From this perspective, MERA is as an unsuper-
tions and momenta of ∼ 1023 particles). vised machine learning algorithm, where classical proba-
bility distributions over many variables are replaced with
To obtain a more quantitative link between renormal-
quantum wavefunctions. Due to the special tensor net-
ization and deep-learning-style feature extraction, let us
work structure found in MERA, the resulting variational
consider as a toy model for natural images (functions of
approximation of a given wavefunction has an interpreta-
a 2D position vector r) a generic two-dimensional Gaus-
tion as generating an RG flow. Hence this is an example
sian random field y(r) whose Hamiltonian satisfies trans-
of an unsupervised learning problem whose solution gives
lational and rotational symmetry:
rise to an RG flow. This is only possible due to the ex-
tra mathematical structure in the problem (the specific
Z h i
2
Hx (y) = x0 y 2 + x1 (∇y)2 + x2 ∇2 y + · · · d2 r. tensor network found in MERA); a generic variational
(23) Ansatz does not give rise to any RG interpretation and
Thus the fictitious classes of images that we are trying vice versa.
to distinguish are all generated by Hamiltonians Hx with
the same above form but different parameter vectors x.
We assume that the function y(r) is specified on pixels F. No-flattening theorems
that are sufficiently close that derivatives can be well-
approximated by differences. Derivatives are linear op-
erations, so they can be implemented in the first layer of Above we discussed how Markovian generative models
a neural network. The translational symmetry of equa- cause p(y|x) to be a composition of a number of sim-
tion (23) allows it to be implemented with a convnet. If pler functions fi . Suppose that we can approximate each
can be shown [27] that for any course-graining operation function fi with an efficient neural network for the rea-
that replaces each block of b × b pixels by its average and sons given in Section II. Then we can simply stack these
divides the result by b2 , the Hamiltonian retains the form networks on top of each other, to obtain an deep neural
of equation (23) but with the parameters xi replaced by network efficiently approximating p(y|x).
But is this the most efficient way to represent p(y|x)?
x0i = b2−2i xi . (24)
Since we know that there are shallower networks that
This means that all paramters xi with i ≥ 2 decay ex- accurately approximate it, are any of these shallow net-
ponentially with b as we repeatedly renormalize and b works as efficient as the deep one, or does flattening nec-
keeps increasing, so that for modest b, one can neglect essarily come at an efficiency cost?
all but the first few xi ’s.In this example, the parame- To be precise, for a neural network f defined by equa-
ters x0 and x1 would be called “relevant operators” by tion (6), we will say that the neural network f` is the
physicists and “signal” by machine-learners, whereas the flattened version of f if its number ` of hidden layers is
remaining parameters would be called “irrelevant opera- smaller and f` approximates f within some error (as
tors” by physicists and “noise” by machine-learners. measured by some reasonable norm). We say that f` is
In summary, renormalization is a special case of feature a neuron-efficient flattening if the sum of the dimensions
extraction and nearly sufficient statistics, typically treat- of its hidden layers (sometimes referred to as the number
ing small scales as noise. This makes it a special case of neurons Nn ) is less than for f. We say that f` is a
11
synapse-efficient flattening if the number Ns of non-zero matrices is equivalent to multiplying by a single matrix
entries (sometimes called synapses) in its weight matrices (their product). While the effect of flattening is indeed
is less than for f. This lets us define the flattening cost trivial for expressibility (f can express any linear func-
of a network f as the two functions tion, independently of how many layers there are), this
is not the case for the learnability, which involves non-
Nn (f` )
Cn (f, `, ) ≡ min , (25) linear and complex dynamics despite the linearity of the
f` Nn (f) network [43]. We will show that the efficiency of such
Ns (f` ) linear networks is also a very rich question.
Cs (f, `, ) ≡ min , (26)
f` Ns (f) Neuronal efficiency is trivially attainable for linear net-
specifying the factor by which optimal flattening in- works, since all hidden-layer neurons can be eliminated
creases the neuron count and the synapse count, respec- without accuracy loss by simply multiplying all the
tively. We refer to results where Cn > 1 or Cs > 1 weight matrices together. We will instead consider the
for some class of functions f as “no-flattening theorems”, case of synaptic efficiency and set ` = = 0.
since they imply that flattening comes at a cost and ef- Many divide-and-conquer algorithms in numerical linear
ficient flattening is impossible. A complete list of no- algebra exploit some factorization of a particular ma-
flattening theorems would show exactly when deep net- trix A in order to yield significant reduction in complex-
works are more efficient than shallow networks. ity. For example, when A represents the discrete Fourier
There has already been very interesting progress in this transform (DFT), the fast Fourier transform (FFT) algo-
spirit, but crucial questions remain. On one hand, it rithm makes use of a sparse factorization of A which only
has been shown that deep is not always better, at least contains O(n log n) non-zero matrix elements instead of
empirically for some image classification tasks [35]. On the naive single-layer implementation, which contains n2
the other hand, many functions f have been found for non-zero matrix elements. This is our first example of a
which the flattening cost is significant. Certain deep linear no-flattening theorem: fully flattening a network
Boolean circuit networks are exponentially costly to flat- that performs an FFT of n variables increases the synapse
ten [36]. Two families of multivariate polynomials with count Ns from O(n log n) to O(n2 ), i.e., incurs a flat-
an exponential flattening cost Cn are constructed in[10]. tening cost Cs = O(n/ log n) ∼ O(n). This argument
[11, 12, 37] focus on functions that have tree-like hierar- applies also to many variants and generalizations of the
chical compositional form, concluding that the flattening FFT such as the Fast Wavelet Transform and the Fast
cost Cn is exponential for almost all functions in Sobolev Walsh-Hadamard Transform.
space. For the ReLU activation function, [38] finds a class Another important example illustrating the subtlety of
of functions that exhibit exponential flattening costs; [39] linear networks is matrix multiplication. More specifi-
study a tailored complexity measure of deep versus shal- cally, take the input of a neural network to be the entries
low ReLU networks. [40] shows that given weak condi- of a matrix M and the output to be NM, where both
tions on the activation function, there always exists at M and N have size n × n. Since matrix multiplication is
least one function that can be implemented in a 3-layer linear, this can be exactly implemented by a 1-layer lin-
network which has an exponential flattening cost. Fi- ear neural network. Amazingly, the naive algorithm for
nally, [41, 42] study the differential geometry of shallow matrix multiplication, which requires n3 multiplications,
versus deep networks, and find that flattening is expo- is not optimal: the Strassen algorithm [44] requires only
nentially neuron-inefficient. Further work elucidating the O(nω ) multiplications (synapses), where ω = log2 7 ≈
cost of flattening various classes of functions will clearly 2.81, and recent work has cut this scaling exponent down
be highly valuable. to ω ≈ 2.3728639 [45]. This means that fully optimized
matrix multiplication on a deep neural network has a
flattening cost of at least Cs = O(n0.6271361 ).
G. Linear no-flattening theorems
Low-rank matrix multiplication gives a more elementary
no-flattening theorem. If A is a rank-k matrix, we can
In the mean time, we will now see that interesting no- factor it as A = BC where B is a k × n matrix and
flattening results can be obtained even in the simpler-to- C is an n × k matrix. Hence the number of synapses is
model context of linear neural networks [43], where the n2 for an ` = 0 network and 2nk for an ` = 1-network,
σ operators are replaced with the identity and all biases giving a flattening cost Cs = n/2k > 1 as long as the
are set to zero such that Ai are simply linear operators rank k < n/2.
(matrices). Every map is specified by a matrix of real Finally, let us consider flattening a network f = AB,
(or complex) numbers, and composition is implemented where A and B are random sparse n × n matrices such
by matrix multiplication. that each element is 1 with probability p and 0 with prob-
One might suspect that such a network is so simple that ability 1P− p. Flattening the network results in a matrix
the questions concerning flattening become entirely triv- Fij = k Aik Bkj , so the probability that Fij = 0 is
ial: after all, successive multiplication with n different (1 − p2 )n . Hence the number of non-zero components
12

will on average be 1 − (1 − p2 )n n2 , so Physics Machine learning
Hamiltonian Surprisal − ln p
Simple H Cheap learning
1 − (1 − p2 )n n2 1 − (1 − p2 )n Quadratic H Gaussian p
Cs = = . (27) Locality Sparsity
2n2 p 2p
Translationally symmetric H Convnet
Computing p from H Softmaxing
Note that Cs ≤ 1/2p and that this bound is asymptoti-
Spin Bit
cally saturated for n 1/p2 . Hence in the limit where n Free energy difference KL-divergence
is very large, flattening multiplication by sparse matrices Effective theory Nearly lossless data distillation
p 1 is horribly inefficient. Irrelevant operator Noise
Relevant operator Feature
TABLE I: Physics-ML dictionary.

IV. CONCLUSIONS
lower and upper bounds on the number of neurons and

We have shown that the success of deep and cheap (low- synaptic weights needed to approximate a given polyno-
parameter-count) learning depends not only on mathe- mial. We conjecture that approximating a multiplication
matics but also on physics, which favors certain classes of gate x1 x2 · · · xn will require exponentially many neurons
exceptionally simple probability distributions that deep in n using non-pathological activation functions, whereas
learning is uniquely suited to model. We argued that we have shown that allowing for log2 n layers allows us
the success of shallow neural networks hinges on sym- to use only ∼ 4n neurons.
metry, locality, and polynomial log-probability in data
from or inspired by the natural world, which favors sparse Acknowledgements: This work was supported by the
low-order polynomial Hamiltonians that can be efficiently Foundational Questions Institute http://fqxi.org/.
approximated. Whereas previous universality theorems We thank Tomaso Poggio and Bart Selman for helpful
guarantee that there exists a neural network that approx- discussions and suggestions and the Center for Brains,
imates any smooth function to within an error , they Minds, and Machines (CBMM) for hospitality.
cannot guarantee that the size of the neural network does
not grow to infinity with shrinking or that the activa-
tion function σ does not become pathological. We show V. APPENDIX
constructively that given a multivariate polynomial and
any generic non-linearity, a neural network with a fixed
size and a generic smooth activation function can indeed A. Why matching partition functions do not imply
approximate the polynomial highly efficiently. matching probability distributions
Turning to the separate question of depth, we have ar-

gued that the success of deep learning depends on the Let us interpret the random variable y as as describ-
ubiquity of hierarchical and compositional generative ing degrees of freedom which are in thermal equilibrium
processes in physics and other machine-learning appli- at unit temperature with respect to some Hamiltonian
cations. By studying the sufficient statistics of the gen- H(y):
erative process, we showed that the inference problem 1 −H(y)
requires approximating a compositional function of the p(y) = e , (28)
Z
form f1 ◦ f2 ◦ f2 ◦ · · · that optimally distills out the in- P −H(y)
formation of interest from irrelevant noise in a hierar- where the normalization Z ≡ y e . (Unlike before,
chical process that mirrors the generative process. Al- we only require in this section that H(y) = − ln p(y) +
though such compositional functions can be efficiently constant.) Let H(y, y0 ) be a Hamiltonian of two random
implemented by a deep neural network as long as their variables y and y0 , e.g.,
individual steps can, it is generally not possible to retain 1 −H(y,y0 )
the efficiency while flattening the network. We extend ex- p̃(y, y0 ) =e , (29)
Ztot
isting “no-flattening” theorems [10–12] by showing that 0
where the normalization Ztot ≡ yy0 e−H(y,y ) .
P
efficient flattening is impossible even for many important
cases involving linear networks.
P has been0 claimed [32] that Ztot = Z implies p̃(y) ≡
It
Strengthening the analytic understanding of deep learn- y0 p̃(y, y ) = p(y). We will construct a family of coun-
ing may suggest ways of improving it, both to make it terexamples where Ztot = Z, but p̃(y) 6= p(y).
more capable and to make it more robust. One promis- Let y0 belong to the same space as y and take any non-
ing area is to prove sharper and more comprehensive constant function K(y). We choose the joint Hamilto-
no-flattening theorems, placing lower and upper bounds nian
on the cost of flattening networks implementing various
classes of functions. A concrete example is placing tight H(y, y0 ) = H(y) + H(y0 ) + K(y) + ln Z̃, (30)
13
e−[H(y)+K(y)] . Then
P
where Z̃ = y Hence the claim that Z = Ztot implies p̃(y) = p(y) is
false. Note that our counterexample generalizes immedi-
X 0
Ztot = e−H(y,y ) ately to the case where there are one or more parameters
yy0 x in the Hamiltonian H(y) → Hx (y) that we might want
1 X −[H(y)+K(y)+H(y0 )] to vary. For example, x could be one component of an
= e external magnetic field. In this case, we simply choose
Z̃ yy0 Hx (y, y0 ) = Hx (y) + Hx (y0 ) + K(y) + ln Z̃x . This means
1 X −[H(y)+K(y)] X −H(y0 ) (31) that all derivatives of ln Z and Ztot with respect to x
= e e can agree despite the fact that p̃ 6= p. This is important
Z̃ y y0 because all macroscopic observables such as the average
1 X 0 X energy, magnetization, etc. can be written in terms of
= · Z̃ · e−H(y ) = e−H(y) = Z.
Z̃ 0
derivatives of ln Z. This illustrates the point that an
y y
exact Kadanoff RG scheme that can be accurately used
So the partition functions agree. However, the marginal- to compute physical observables nevertheless can fail to
ized probability distributions do not: accomplish any sort of unsupervised learning. In retro-
spect, this is unsurprising since the point of renormaliza-
1 X −H(y,y0 ) tion is to compute macroscopic quantites, not to solve an
p̃(y) = e
Ztot 0 unsupervised learning problem in the microscopic vari-
y
(32) ables.
1
= e−[H(y)+K(y)] 6= p(y).
Z̃
[1] Y. LeCun, Y. Bengio, and G. Hinton, Nature 521, 436 [20] G. Hinshaw, C. Barnes, C. Bennett, M. Greason,
(2015). M. Halpern, R. Hill, N. Jarosik, A. Kogut, M. Limon,
[2] S. Russell, D. Dewey, and M. Tegmark, AI Magazine 36 S. Meyer, et al., The Astrophysical Journal Supplement
(2015). Series 148, 63 (2003).
[3] K. Hornik, M. Stinchcombe, and H. White, Neural net- [21] G. Hinton, Momentum 9, 926 (2010).
works 2, 359 (1989). [22] M. Émile Borel, Rendiconti del Circolo Matematico di
[4] G. Cybenko, Mathematics of control, signals and systems Palermo (1884-1940) 27, 247 (1909).
2, 303 (1989). [23] R. A. Fisher, Philosophical Transactions of the Royal So-
[5] A. Pinkus, Acta Numerica 8, 143 (1999). ciety of London. Series A, Containing Papers of a Math-
[6] B. Gnedenko, A. Kolmogorov, B. Gnedenko, and A. Kol- ematical or Physical Character 222, 309 (1922).
mogorov, Amer. J. Math. 105, 28 (1954). [24] M. Riesenhuber and T. Poggio, Nature neuroscience 3,
[7] E. T. Jaynes, Physical review 106, 620 (1957). 1199 (2000).
[8] R. Kindermann and J. L. Snell (1980). [25] S. Kullback and R. A. Leibler, Ann. Math. Statist.
[9] M. Tegmark, A. Aguirre, M. J. Rees, and F. Wilczek, 22, 79 (1951), URL http://dx.doi.org/10.1214/aoms/
Physical Review D 73, 023505 (2006). 1177729694.
[10] O. Delalleau and Y. Bengio, in Advances in Neural In- [26] T. M. Cover and J. A. Thomas, Elements of information
formation Processing Systems (2011), pp. 666–674. theory (John Wiley & Sons, 2012).
[11] H. Mhaskar, Q. Liao, and T. Poggio, ArXiv e-prints [27] M. Kardar, Statistical physics of fields (Cambridge Uni-
(2016), 1603.00988. versity Press, 2007).
[12] H. Mhaskar and T. Poggio, arXiv preprint [28] J. Cardy, Scaling and renormalization in statistical
arXiv:1608.03287 (2016). physics, vol. 5 (Cambridge university press, 1996).
[13] R. Adam, P. Ade, N. Aghanim, Y. Akrami, M. Alves, [29] J. K. Johnson, D. M. Malioutov, and A. S. Willsky, ArXiv
M. Arnaud, F. Arroja, J. Aumont, C. Baccigalupi, e-prints (2007), 0710.0013.
M. Ballardini, et al., arXiv preprint arXiv:1502.01582 [30] C. Bény, ArXiv e-prints (2013), 1301.3124.
(2015). [31] S. Saremi and T. J. Sejnowski, Proceedings of the
[14] U. Seljak and M. Zaldarriaga, arXiv preprint astro- National Academy of Sciences 110, 3071 (2013),
ph/9603033 (1996). http://www.pnas.org/content/110/8/3071.full.pdf, URL
[15] M. Tegmark, Physical Review D 55, 5895 (1997). http://www.pnas.org/content/110/8/3071.abstract.
[16] J. Bond, A. H. Jaffe, and L. Knox, Physical Review D [32] P. Mehta and D. J. Schwab, ArXiv e-prints (2014),
57, 2117 (1998). 1410.3831.
[17] M. Tegmark, A. de Oliveira-Costa, and A. J. Hamilton, [33] E. Miles Stoudenmire and D. J. Schwab, ArXiv e-prints
Physical Review D 68, 123523 (2003). (2016), 1605.05775.
[18] P. Ade, N. Aghanim, C. Armitage-Caplan, M. Arnaud, [34] G. Vidal, Physical Review Letters 101, 110501 (2008),
M. Ashdown, F. Atrio-Barandela, J. Aumont, C. Bacci- quant-ph/0610099.
galupi, A. J. Banday, R. Barreiro, et al., Astronomy & [35] J. Ba and R. Caruana, in Advances in neural information
Astrophysics 571, A12 (2014). processing systems (2014), pp. 2654–2662.
[19] M. Tegmark, The Astrophysical Journal Letters 480, L87 [36] J. Hastad, in Proceedings of the eighteenth annual ACM
(1997). symposium on Theory of computing (ACM, 1986), pp.
14
6–20. S. Ganguli, ArXiv e-prints (2016), 1606.05340.

[37] T. Poggio, F. Anselmi, and L. Rosasco, Tech. Rep., Cen- [42] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and
ter for Brains, Minds and Machines (CBMM) (2015). J. Sohl-Dickstein, ArXiv e-prints (2016), 1606.05336.
[38] M. Telgarsky, arXiv preprint arXiv:1509.08101 (2015). [43] A. M. Saxe, J. L. McClelland, and S. Ganguli, arXiv
[39] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, preprint arXiv:1312.6120 (2013).
in Advances in neural information processing systems [44] V. Strassen, Numerische Mathematik 13, 354 (1969).
(2014), pp. 2924–2932. [45] F. Le Gall, in Proceedings of the 39th international sym-
[40] R. Eldan and O. Shamir, arXiv preprint posium on symbolic and algebraic computation (ACM,
arXiv:1512.03965 (2015). 2014), pp. 296–303.
[41] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and

Why Does Deep and Cheap Learning Work So Well? PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Why Does Deep and Cheap Learning Work So Well? PDF

Uploaded by

Copyright:

Available Formats

Why does deep and cheap learning work so well?

Henry W. Lin and Max Tegmark

hinges on properties such as symmetry, locality, compositionality and polynomial log-probability,

I. INTRODUCTION useful similarities between deep learning and statistical

This makes it timely and interesting to develop new an-

several important factors to consider: II. EXPRESSIBILITY AND EFFICIENCY OF

• Expressibility: What class of functions can the

Statisticians refer to − ln p as “self-information” or “sur- The softmax operator is therefore defined by

1. Continuous input variables Continuous multiplication gate: Binary multiplication gate:

D. What Hamiltonians do we want to 2. Locality

One of the deepest principles of physics is locality: that

COSMO- Ω, Ω b , Λ, τ, h x =T (y) cat or dog?

shape & posture

x1 POWER 6422347 6443428 -454.841

scale & translate

Since the above equation depends on xn only through

fn = Tn−1 . Then to trade some loss of mutual information with a dra-

TABLE I: Physics-ML dictionary.

lower and upper bounds on the number of neurons and

Turning to the separate question of depth, we have ar-

6–20. S. Ganguli, ArXiv e-prints (2016), 1606.05340.

You might also like