Professional Documents
Culture Documents
Why Does Deep and Cheap Learning Work So Well? PDF
Why Does Deep and Cheap Learning Work So Well? PDF
Deep learning works remarkably well, and has helped dra- For concreteness, let us focus on the task of approximat-
matically improve the state-of-the-art in areas ranging ing functions. As illustrated in Figure 1, this covers most
from speech recognition, translation and visual object core sub-fields of machine learning, including unsuper-
recognition to drug discovery, genomics and automatic vised learning, classification and prediction. For exam-
game playing [1]. However, it is still not fully under- ple, if we are interested in classifying faces, then we may
stood why deep learning works so well. In contrast to want our neural network to implement a function where
GOFAI (“good old-fashioned AI”) algorithms that are we feed in an image represented by a million greyscale
hand-crafted and fully understood analytically, many al- pixels and get as output the probability distribution over
gorithms using artificial neural networks are understood a set of people that the image might represent.
only at a heuristic level, where we empirically know that
certain training protocols employing large data sets will
result in excellent performance. This is reminiscent of the
situation with human brains: we know that if we train
a child according to a certain curriculum, she will learn p(x,y)
certain skills — but we lack a deep understanding of how Unsupervised
her brain accomplishes this. learning
determined from their sum: for example, the product ranging from 2 to 4. This means that the number of
y1 y2 y3 = 1 if and only if the sum y1 + y2 + y3 = 3. This polynomial coefficients is not infinite as in equation (9)
sum-checking can be implemented using one of the most or exponential in n as in equation (12), merely of order
popular choices for a nonlinear function σ: the logistic n2 , n3 or n4 .
sigmoid σ(y) = 1+e1−y which satisfies σ(y) ≈ 0 for y 0 Thanks to the Central Limit Theorem [6], many proba-
and σ(y) ≈ 1 for y 1. To compute the product of bility distributions in machine-learning and statistics can
some set of k bits described by the set K (for our exam- be accurately approximated by multivariate Gaussians,
ple above, K = {1, 2, 3}), we let A1 and A2 shift and i.e., of the form
stretch the sigmoid to exploit the identity P
hj yi −
P
p(y) = eh+ i ij hij yi yj
, (14)
Y 1 X which means that the Hamiltonian H = − ln p is a
yi = lim σ −β k − − yi . (13) quadratic polynomial. More generally, the maximum-
β→∞ 2
i∈K y∈K entropy probability distribution subject to constraints on
some of the lowest moments, say expectation values of the
Since σ decays exponentially fast toward 0 or 1 as β is in- form hy1α1 y2α2 · · · ynαn i for some integers αi ≥ 0 would
creased, modestly large β-values suffice in practice; if, for P lead
to a Hamiltonian of degree no greater than d ≡ i αi [7].
example, we want the correct answer to D = 10 decimal
places, we merely need β > D ln 10 ≈ 23. In summary, Image classification tasks often exploit invariance under
when y is a bit string, an arbitrary function px (y) can be translation, rotation, and various nonlinear deformations
evaluated by a simple 3-layer neural network: the mid- of the image plane that move pixels to new locations. All
dle layer uses sigmoid functions to compute the products such spatial transformations are linear function functions
from equation (12), and the top layer performs the sums (d = 1 polynomials) of the pixel vector y. Functions im-
from equation (12) and the softmax from equation (8). plementing convolutions and Fourier transforms are also
d = 1 polynomials.
dependent parameters required to describe it is further Both examples involve a Markov chain3 where the prob-
reduced. For instance, many probability distributions in ability distribution p(xi ) at the ith level of the hierarchy
both physics and machine learning are invariant under is determined from its causal predecessor alone:
translation and rotation. As an example, consider a vec-
tor y of air pressures yi measured by a microphone at pi Mi pi−1 , (15)
times i = 1, ..., n. Assuming that the Hamiltonian de-
scribing it has d = 2 reduces the number of parameters where the probability vector pi specifies the probabil-
N from ∞ to (n + 1)(n + 2)/2. Further assuming lo- ity distribution of p(xi ) according to (pi )x ≡ p(xi ) and
cality (nearest-neighbor couplings only) reduces this to the Markov matrix Mi specifies the transition probabili-
N = 2n, after which requiring translational symmetry ties between two neighboring levels, p(xi |xi−1 ). Iterating
reduces the parameter count to N = 3. Taken together, equation (15) gives
the constraints on locality, symmetry and polynomial or-
pn = Mn Mn−1 · · · M1 p0 , (16)
der reduce the number of continuous parameters in the
Hamiltonian of the standard model of physics to merely so we can write the combined effect of the the entire
32 [9]. generative process as a matrix product.
Symmetry can reduce not merely the parameter count, In our physics example (Figure 3, left), a set of cosmo-
but also the computational complexity. For example, if logical parameters x0 (the density of dark matter, etc.)
a linear vector-valued function f(y) mapping a set of n determines the power spectrum x1 of density fluctuations
variables onto itself happens to satisfy translational sym- in our universe, which in turn determines the pattern
metry, then it is a convolution (implementable by a con- of cosmic microwave background radiation x2 reaching
volutional neural net; “convnet”), which means that it us from our early universe, which gets combined with
can be computed with n log2 n rather than n2 multipli- foreground radio noise from our Galaxy to produce the
cations using Fast Fourier transform. frequency-dependent sky maps (x3 ) that are recorded by
a satellite-based telescope that measures linear combina-
tions of different sky signals and adds electronic receiver
III. WHY DEEP? noise. For the recent example of the Planck Satellite
[13], these datasets xi , x2 , ... contained about 101 , 104 ,
108 , 109 and 1012 numbers, respectively.
Above we investigated how probability distributions from
More generally, if a given data set is generated by a (clas-
physics and computer science applications lent them-
sical) statistical physics process, it must be described by
selves to “cheap learning”, being accurately and effi-
an equation in the form of equation (16), since dynamics
ciently approximated by neural networks with merely a
in classical physics is fundamentally Markovian: classi-
handful of layers. Let us now turn to the separate ques-
cal equations of motion are always first order differential
tion of depth, i.e., the success of deep learning: what
equations in the Hamiltonian formalism. This techni-
properties of real-world probability distributions cause
cally covers essentially all data of interest in the machine
efficiency to further improve when networks are made
learning community, although the fundamental Marko-
deeper? This question has been extensively studied from
vian nature of the generative process of the data may be
a mathematical point of view [10–12], but mathemat-
an in-efficient description.
ics alone cannot fully answer it, because part of the an-
swer involves physics. We will argue that the answer in- Our toy image classification example (Figure 3, right) is
volves the hierarchical/compositional structure of gener- deliberately contrived and over-simplified for pedagogy:
ative processes together with inability to efficiently “flat- x0 is a single bit signifying “cat or dog”, which deter-
ten” neural networks reflecting this structure. mines a set of parameters determining the animal’s col-
oration, body shape, posture, etc. using approxiate prob-
ability distributions, which determine a 2D image via
ray-tracing, which is scaled and translated by random
A. Hierarchical processess
amounts before a randomly background is added.
In both examples, the goal is to reverse this generative hi-
One of the most striking features of the physical world erarchy to learn about the input x ≡ x0 from the output
is its hierarchical structure. Spatially, it is an object xn ≡ y, specifically to provide the best possibile estimate
hierarchy: elementary particles form atoms which in turn
form molecules, cells, organisms, planets, solar systems,
galaxies, etc. Causally, complex structures are frequently
created through a distinct sequence of simpler steps. 3 If the next step in the generative hierarchy requires knowledge
of not merely of the present state but also information of the
Figure 3 gives two examples of such causal hierarchies past, the present state can be redefined to include also this in-
generating data vectors x0 7→ x1 7→ ... 7→ xn that are formation, thus ensuring that the generative process is a Markov
relevant to physics and image classification, respectively. process.
7
select color,
LABEL
>
PARAMTERS
n, nT, Q, T/S 0 0
fluctuations
generate M1 M1
f0
param 1 param 2 param 3
>
SPECTRUM 8454543
1004356
9345593
8345388
654.766
-305.567 PARAMTERS
... ... ...
ray trace
simulate
sky map
M2 M2
f1
x2 CMB SKY x2=T2(y) RAY-TRACED x2
>
MAP OBJECT
f2
M3 M3
add
MULTΙ-
x3=T3(y) TRANSFORMED
>
FREQUENCY
x3 MAPS OBJECT x3
combinations,
select background
take linear
add noise
f3
M4 M4
Pixel 1 Pixel 2 ΔT
TELESCOPE 6422347 6443428 -454.841
y=x4
x4 DATA
3141592
8454543
2718281
9345593
141.421
654.766
FINAL IMAGE x4
1004356 8345388 -305.567
... ... ...
FIG. 3: Causal hierarchy examples relevant to physics (left) and image classification (right). As information flows down the
hierarchy x0 → x1 → ... → xn = y, some of it is destroyed by random Markov processes. However, no further information is
bn−1 → ... → x
lost as information flows optimally back up the hierarchy as x b0 . The right example is deliberately contrived and
over-simplified for pedagogy; for example, translation and scaling are more naturally performed before ray tracing, which in
turn breaks down into multiple steps.
of the probability distribution p(x|y) = p(x0 |xn ) — i.e., can be specified by a more modest number of parameters,
to determine the probability distribution for the cosmo- because each of its steps can. Whereas specifying an ar-
logical parameters and to determine the probability that bitrary probability distribution over multi-megapixel im-
the image is a cat, respectively. ages y requires far more bits than there are atoms in
our universe, the information specifying how to compute
the probability distribution p(y|x) for a microwave back-
ground map fits into a handful of published journal arti-
B. Resolving the swindle cles or software packages [14–20]. For a megapixel image
of a galaxy, its entire probability distribution is defined
by the standard model of particle physics with its 32
This decomposition of the generative process into a hier- parameters [9], which together specify the process trans-
archy of simpler steps helps resolve the“swindle” paradox forming primordial hydrogen gas into galaxies.
from the introduction: although the number of parame-
ters required to describe an arbitrary function of the in- The same parameter-counting argument can also be ap-
put data y is beyond astronomical, the generative process plied to all artificial images of interest to machine learn-
8
ing: for example, giving the simple low-information- C. Sufficient statistics and hierarchies
content instruction “draw a cute kitten” to a random
sample of artists will produce a wide variety of images y
with a complicated probability distribution over colors, The goal of deep learning classifiers is to reverse the hi-
postures, etc., as each artist makes random choices at a erarchical generative process as well as possible, to make
series of steps. Even the pre-stored information about inferences about the input x from the output y. Let us
cat probabilities in these artists’ brains is modest in size. now treat this hierarchical problem more rigorously using
information theory.
Given P (x|y), a sufficient statistic T (y) is defined by the
Note that a random resulting image typically contains equation P (x|y) = P (x|T (y)) and has played an impor-
much more information than the generative process cre- tant role in statistics for almost a century [23]. All the
ating it; for example, the simple instruction “generate information about x contained in y is contained in the
a random string of 109 bits” contains much fewer than sufficient statistic. A minimal sufficient statistic [23] is
109 bits. Not only are the typical steps in the genera- some sufficient statistic T∗ which is a sufficient statistic
tive hierarchy specified by a non-astronomical number of for all other sufficient statistics. This means that if T (y)
parameters, but as discussed in Section II D, it is plausi- is sufficient, then there exists some function f such that
ble that neural networks can implement each of the steps T∗ (y) = f (T (y)). As illustrated in Figure 3, T∗ can be
efficiently.4 thought of as a an information distiller, optimally com-
pressing the data so as to retain all information relevant
to determining x and discarding all irrelevant informa-
tion.
A deep neural network stacking these simpler networks
on top of one another would then implement the entire The sufficient statistic formalism enables us to state some
generative process efficiently. In summary, the data sets simple but important results that apply to any hierarchi-
and functions we care about form a minuscule minority, cal generative process cast in the Markov chain form of
and it is plausible that they can also be efficiently im- equation (16).
plemented by neural networks reflecting their generative Theorem 2: Given a Markov chain described by our
process. So what is the remainder? Which are the data notation above, let Ti be a minimal sufficient statistic of
sets and functions that we do not care about? P (xi |xn ). Then there exists some functions fi such that
Ti = fi ◦ Ti+1 . More casually speaking, the generative
hierarchy of Figure 3 can be optimally reversed one step
Almost all images are indistinguishable from random at a time: there are functions fi that optimally undo each
noise, and almost all data sets and functions are in- of the steps, distilling out all information about the level
distinguishable from completely random ones. This fol- above that was not destroyed by the Markov process.
lows from Borel’s theorem on normal numbers [22], which Here is the proof. Note that for any k ≥ 1, “backwards”
states that almost all real numbers have a string of dec- Markovity P (xi |xi+1 , xi+k ) = P (xi |xi+1 ) follows from
imals that would pass any randomness test, i.e., are Markovity via Bayes’ theorem:
indistinguishable from random noise. Simple parame-
ter counting shows that deep learning (and our human P (xi+k |xi , xi+1 )P (xi |xi+1 )
P (xi |xi+k , xi+1 ) =
brains, for that matter) would fail to implement almost P (xi+k |xi+1 )
all such functions, and training would fail to find any P (xi+k |xi+1 )P (xi |xi+1 ) (17)
useful patterns. To thwart pattern-finding efforts. cryp- =
P (xi+k |xi+1 )
tography therefore aims to produces random-looking pat-
terns. Although we might expect the Hamiltonians de- = P (xi |xi+1 ).
scribing human-generated data sets such as drawings,
Using this fact, we see that
text and music to be more complex than those describing
simple physical systems, we should nonetheless expect P (xi |xn ) =
X
P (xi |xi+1 xn )P (xi+1 |xn )
them to resemble the natural data sets that inspired their xi+1
creation much more than they resemble random func- X (18)
tions. = P (xi |xi+1 )P (xi+1 |Ti+1 (xn )).
xi+1
Similar to the case of sufficient statistics, P (x|Rn (y)) will of supervised learning, not unsupervised learning. We
then be a compositional function. elaborate on this further in Appendix A, where we con-
struct a counter-example to a recent claim [32] that a
Contrary to some claims in the literature, effective field
so-called “exact” RG is equivalent to perfectly recon-
theory and the renormalization group have little to do
structing the empirical probability distribution in an un-
with the idea of unsupervised learning and pattern-
supervised problem. The information-distillation nature
finding. Instead, the standard renormalization proce-
of renormalization is explicit in many numerical meth-
dures in statistical physics and quantum field theory
ods, where the purpose of the renormalization group is
are essentially a feature extractor for supervised learn-
to efficiently and accurately evaluate the free energy of
ing, where the features typically correspond to long-
the system as a function of macroscopic variables of in-
wavelength/macroscopic degrees of freedom. In other
terest such as temperature and pressure. Thus we can
words, effective field theory only makes sense if we spec-
only sensibly talk about the accuracy of an RG-scheme
ify what features we are interested in. For example, if
once we have specified what macroscopic variables we are
we are given data y about the position and momenta of
interested in.
particles inside a mole of some liquid and is tasked with
predicting from this data whether or not Alice will burn A subtlety regarding the above statements is pre-
her finger when touching the liquid, a (nearly) sufficient sented by the Multi-scale Entanglement Renormalization
statistic is simply the temperature of the object, which Ansatz (MERA) [34]. MERA can be viewed as a varia-
can in turn be obtained from some very coarse-grained tional class of wave functions whose parameters can be
degrees of freedom (for example, one could use the fluid tuned to to match a given wave function as closely as
approximation instead of working directly from the posi- possible. From this perspective, MERA is as an unsuper-
tions and momenta of ∼ 1023 particles). vised machine learning algorithm, where classical proba-
bility distributions over many variables are replaced with
To obtain a more quantitative link between renormal-
quantum wavefunctions. Due to the special tensor net-
ization and deep-learning-style feature extraction, let us
work structure found in MERA, the resulting variational
consider as a toy model for natural images (functions of
approximation of a given wavefunction has an interpreta-
a 2D position vector r) a generic two-dimensional Gaus-
tion as generating an RG flow. Hence this is an example
sian random field y(r) whose Hamiltonian satisfies trans-
of an unsupervised learning problem whose solution gives
lational and rotational symmetry:
rise to an RG flow. This is only possible due to the ex-
tra mathematical structure in the problem (the specific
Z h i
2
Hx (y) = x0 y 2 + x1 (∇y)2 + x2 ∇2 y + · · · d2 r. tensor network found in MERA); a generic variational
(23) Ansatz does not give rise to any RG interpretation and
Thus the fictitious classes of images that we are trying vice versa.
to distinguish are all generated by Hamiltonians Hx with
the same above form but different parameter vectors x.
We assume that the function y(r) is specified on pixels F. No-flattening theorems
that are sufficiently close that derivatives can be well-
approximated by differences. Derivatives are linear op-
erations, so they can be implemented in the first layer of Above we discussed how Markovian generative models
a neural network. The translational symmetry of equa- cause p(y|x) to be a composition of a number of sim-
tion (23) allows it to be implemented with a convnet. If pler functions fi . Suppose that we can approximate each
can be shown [27] that for any course-graining operation function fi with an efficient neural network for the rea-
that replaces each block of b × b pixels by its average and sons given in Section II. Then we can simply stack these
divides the result by b2 , the Hamiltonian retains the form networks on top of each other, to obtain an deep neural
of equation (23) but with the parameters xi replaced by network efficiently approximating p(y|x).
But is this the most efficient way to represent p(y|x)?
x0i = b2−2i xi . (24)
Since we know that there are shallower networks that
This means that all paramters xi with i ≥ 2 decay ex- accurately approximate it, are any of these shallow net-
ponentially with b as we repeatedly renormalize and b works as efficient as the deep one, or does flattening nec-
keeps increasing, so that for modest b, one can neglect essarily come at an efficiency cost?
all but the first few xi ’s.In this example, the parame- To be precise, for a neural network f defined by equa-
ters x0 and x1 would be called “relevant operators” by tion (6), we will say that the neural network f` is the
physicists and “signal” by machine-learners, whereas the flattened version of f if its number ` of hidden layers is
remaining parameters would be called “irrelevant opera- smaller and f` approximates f within some error (as
tors” by physicists and “noise” by machine-learners. measured by some reasonable norm). We say that f` is
In summary, renormalization is a special case of feature a neuron-efficient flattening if the sum of the dimensions
extraction and nearly sufficient statistics, typically treat- of its hidden layers (sometimes referred to as the number
ing small scales as noise. This makes it a special case of neurons Nn ) is less than for f. We say that f` is a
11
synapse-efficient flattening if the number Ns of non-zero matrices is equivalent to multiplying by a single matrix
entries (sometimes called synapses) in its weight matrices (their product). While the effect of flattening is indeed
is less than for f. This lets us define the flattening cost trivial for expressibility (f can express any linear func-
of a network f as the two functions tion, independently of how many layers there are), this
is not the case for the learnability, which involves non-
Nn (f` )
Cn (f, `, ) ≡ min , (25) linear and complex dynamics despite the linearity of the
f` Nn (f) network [43]. We will show that the efficiency of such
Ns (f` ) linear networks is also a very rich question.
Cs (f, `, ) ≡ min , (26)
f` Ns (f) Neuronal efficiency is trivially attainable for linear net-
specifying the factor by which optimal flattening in- works, since all hidden-layer neurons can be eliminated
creases the neuron count and the synapse count, respec- without accuracy loss by simply multiplying all the
tively. We refer to results where Cn > 1 or Cs > 1 weight matrices together. We will instead consider the
for some class of functions f as “no-flattening theorems”, case of synaptic efficiency and set ` = = 0.
since they imply that flattening comes at a cost and ef- Many divide-and-conquer algorithms in numerical linear
ficient flattening is impossible. A complete list of no- algebra exploit some factorization of a particular ma-
flattening theorems would show exactly when deep net- trix A in order to yield significant reduction in complex-
works are more efficient than shallow networks. ity. For example, when A represents the discrete Fourier
There has already been very interesting progress in this transform (DFT), the fast Fourier transform (FFT) algo-
spirit, but crucial questions remain. On one hand, it rithm makes use of a sparse factorization of A which only
has been shown that deep is not always better, at least contains O(n log n) non-zero matrix elements instead of
empirically for some image classification tasks [35]. On the naive single-layer implementation, which contains n2
the other hand, many functions f have been found for non-zero matrix elements. This is our first example of a
which the flattening cost is significant. Certain deep linear no-flattening theorem: fully flattening a network
Boolean circuit networks are exponentially costly to flat- that performs an FFT of n variables increases the synapse
ten [36]. Two families of multivariate polynomials with count Ns from O(n log n) to O(n2 ), i.e., incurs a flat-
an exponential flattening cost Cn are constructed in[10]. tening cost Cs = O(n/ log n) ∼ O(n). This argument
[11, 12, 37] focus on functions that have tree-like hierar- applies also to many variants and generalizations of the
chical compositional form, concluding that the flattening FFT such as the Fast Wavelet Transform and the Fast
cost Cn is exponential for almost all functions in Sobolev Walsh-Hadamard Transform.
space. For the ReLU activation function, [38] finds a class Another important example illustrating the subtlety of
of functions that exhibit exponential flattening costs; [39] linear networks is matrix multiplication. More specifi-
study a tailored complexity measure of deep versus shal- cally, take the input of a neural network to be the entries
low ReLU networks. [40] shows that given weak condi- of a matrix M and the output to be NM, where both
tions on the activation function, there always exists at M and N have size n × n. Since matrix multiplication is
least one function that can be implemented in a 3-layer linear, this can be exactly implemented by a 1-layer lin-
network which has an exponential flattening cost. Fi- ear neural network. Amazingly, the naive algorithm for
nally, [41, 42] study the differential geometry of shallow matrix multiplication, which requires n3 multiplications,
versus deep networks, and find that flattening is expo- is not optimal: the Strassen algorithm [44] requires only
nentially neuron-inefficient. Further work elucidating the O(nω ) multiplications (synapses), where ω = log2 7 ≈
cost of flattening various classes of functions will clearly 2.81, and recent work has cut this scaling exponent down
be highly valuable. to ω ≈ 2.3728639 [45]. This means that fully optimized
matrix multiplication on a deep neural network has a
flattening cost of at least Cs = O(n0.6271361 ).
G. Linear no-flattening theorems
Low-rank matrix multiplication gives a more elementary
no-flattening theorem. If A is a rank-k matrix, we can
In the mean time, we will now see that interesting no- factor it as A = BC where B is a k × n matrix and
flattening results can be obtained even in the simpler-to- C is an n × k matrix. Hence the number of synapses is
model context of linear neural networks [43], where the n2 for an ` = 0 network and 2nk for an ` = 1-network,
σ operators are replaced with the identity and all biases giving a flattening cost Cs = n/2k > 1 as long as the
are set to zero such that Ai are simply linear operators rank k < n/2.
(matrices). Every map is specified by a matrix of real Finally, let us consider flattening a network f = AB,
(or complex) numbers, and composition is implemented where A and B are random sparse n × n matrices such
by matrix multiplication. that each element is 1 with probability p and 0 with prob-
One might suspect that such a network is so simple that ability 1P− p. Flattening the network results in a matrix
the questions concerning flattening become entirely triv- Fij = k Aik Bkj , so the probability that Fij = 0 is
ial: after all, successive multiplication with n different (1 − p2 )n . Hence the number of non-zero components
12
will on average be 1 − (1 − p2 )n n2 , so Physics Machine learning
Hamiltonian Surprisal − ln p
Simple H Cheap learning
1 − (1 − p2 )n n2 1 − (1 − p2 )n Quadratic H Gaussian p
Cs = = . (27) Locality Sparsity
2n2 p 2p
Translationally symmetric H Convnet
Computing p from H Softmaxing
Note that Cs ≤ 1/2p and that this bound is asymptoti-
Spin Bit
cally saturated for n 1/p2 . Hence in the limit where n Free energy difference KL-divergence
is very large, flattening multiplication by sparse matrices Effective theory Nearly lossless data distillation
p 1 is horribly inefficient. Irrelevant operator Noise
Relevant operator Feature
e−[H(y)+K(y)] . Then
P
where Z̃ = y Hence the claim that Z = Ztot implies p̃(y) = p(y) is
false. Note that our counterexample generalizes immedi-
X 0
Ztot = e−H(y,y ) ately to the case where there are one or more parameters
yy0 x in the Hamiltonian H(y) → Hx (y) that we might want
1 X −[H(y)+K(y)+H(y0 )] to vary. For example, x could be one component of an
= e external magnetic field. In this case, we simply choose
Z̃ yy0 Hx (y, y0 ) = Hx (y) + Hx (y0 ) + K(y) + ln Z̃x . This means
1 X −[H(y)+K(y)] X −H(y0 ) (31) that all derivatives of ln Z and Ztot with respect to x
= e e can agree despite the fact that p̃ 6= p. This is important
Z̃ y y0 because all macroscopic observables such as the average
1 X 0 X energy, magnetization, etc. can be written in terms of
= · Z̃ · e−H(y ) = e−H(y) = Z.
Z̃ 0
derivatives of ln Z. This illustrates the point that an
y y
exact Kadanoff RG scheme that can be accurately used
So the partition functions agree. However, the marginal- to compute physical observables nevertheless can fail to
ized probability distributions do not: accomplish any sort of unsupervised learning. In retro-
spect, this is unsurprising since the point of renormaliza-
1 X −H(y,y0 ) tion is to compute macroscopic quantites, not to solve an
p̃(y) = e
Ztot 0 unsupervised learning problem in the microscopic vari-
y
(32) ables.
1
= e−[H(y)+K(y)] 6= p(y).
Z̃
[1] Y. LeCun, Y. Bengio, and G. Hinton, Nature 521, 436 [20] G. Hinshaw, C. Barnes, C. Bennett, M. Greason,
(2015). M. Halpern, R. Hill, N. Jarosik, A. Kogut, M. Limon,
[2] S. Russell, D. Dewey, and M. Tegmark, AI Magazine 36 S. Meyer, et al., The Astrophysical Journal Supplement
(2015). Series 148, 63 (2003).
[3] K. Hornik, M. Stinchcombe, and H. White, Neural net- [21] G. Hinton, Momentum 9, 926 (2010).
works 2, 359 (1989). [22] M. Émile Borel, Rendiconti del Circolo Matematico di
[4] G. Cybenko, Mathematics of control, signals and systems Palermo (1884-1940) 27, 247 (1909).
2, 303 (1989). [23] R. A. Fisher, Philosophical Transactions of the Royal So-
[5] A. Pinkus, Acta Numerica 8, 143 (1999). ciety of London. Series A, Containing Papers of a Math-
[6] B. Gnedenko, A. Kolmogorov, B. Gnedenko, and A. Kol- ematical or Physical Character 222, 309 (1922).
mogorov, Amer. J. Math. 105, 28 (1954). [24] M. Riesenhuber and T. Poggio, Nature neuroscience 3,
[7] E. T. Jaynes, Physical review 106, 620 (1957). 1199 (2000).
[8] R. Kindermann and J. L. Snell (1980). [25] S. Kullback and R. A. Leibler, Ann. Math. Statist.
[9] M. Tegmark, A. Aguirre, M. J. Rees, and F. Wilczek, 22, 79 (1951), URL http://dx.doi.org/10.1214/aoms/
Physical Review D 73, 023505 (2006). 1177729694.
[10] O. Delalleau and Y. Bengio, in Advances in Neural In- [26] T. M. Cover and J. A. Thomas, Elements of information
formation Processing Systems (2011), pp. 666–674. theory (John Wiley & Sons, 2012).
[11] H. Mhaskar, Q. Liao, and T. Poggio, ArXiv e-prints [27] M. Kardar, Statistical physics of fields (Cambridge Uni-
(2016), 1603.00988. versity Press, 2007).
[12] H. Mhaskar and T. Poggio, arXiv preprint [28] J. Cardy, Scaling and renormalization in statistical
arXiv:1608.03287 (2016). physics, vol. 5 (Cambridge university press, 1996).
[13] R. Adam, P. Ade, N. Aghanim, Y. Akrami, M. Alves, [29] J. K. Johnson, D. M. Malioutov, and A. S. Willsky, ArXiv
M. Arnaud, F. Arroja, J. Aumont, C. Baccigalupi, e-prints (2007), 0710.0013.
M. Ballardini, et al., arXiv preprint arXiv:1502.01582 [30] C. Bény, ArXiv e-prints (2013), 1301.3124.
(2015). [31] S. Saremi and T. J. Sejnowski, Proceedings of the
[14] U. Seljak and M. Zaldarriaga, arXiv preprint astro- National Academy of Sciences 110, 3071 (2013),
ph/9603033 (1996). http://www.pnas.org/content/110/8/3071.full.pdf, URL
[15] M. Tegmark, Physical Review D 55, 5895 (1997). http://www.pnas.org/content/110/8/3071.abstract.
[16] J. Bond, A. H. Jaffe, and L. Knox, Physical Review D [32] P. Mehta and D. J. Schwab, ArXiv e-prints (2014),
57, 2117 (1998). 1410.3831.
[17] M. Tegmark, A. de Oliveira-Costa, and A. J. Hamilton, [33] E. Miles Stoudenmire and D. J. Schwab, ArXiv e-prints
Physical Review D 68, 123523 (2003). (2016), 1605.05775.
[18] P. Ade, N. Aghanim, C. Armitage-Caplan, M. Arnaud, [34] G. Vidal, Physical Review Letters 101, 110501 (2008),
M. Ashdown, F. Atrio-Barandela, J. Aumont, C. Bacci- quant-ph/0610099.
galupi, A. J. Banday, R. Barreiro, et al., Astronomy & [35] J. Ba and R. Caruana, in Advances in neural information
Astrophysics 571, A12 (2014). processing systems (2014), pp. 2654–2662.
[19] M. Tegmark, The Astrophysical Journal Letters 480, L87 [36] J. Hastad, in Proceedings of the eighteenth annual ACM
(1997). symposium on Theory of computing (ACM, 1986), pp.
14