2 views

Uploaded by Andres Tuells Jansson

dl

- Data science
- Applications of Deep Learning to Sentiment Analysis of Movie Reviews
- Dropout as a Bayesian Approximation Representing Model Uncertainty in Deep Learning
- Matemática PPT - Polynomials
- ANNs
- A Survey on Hybrid Soft Computing Techniques based on Classification Problem
- Animate Object Detection Using Deep Learning
- [George a. Rovithakis Manolis a. Christodoulou] Adaptive Control With Recurrent High Order Neural Networks
- Neural
- 2013-1-00615-IF Bab1001
- Supervised Machine Learning Based Dynamic Estimation
- dl a survey
- 05899139imp
- Artificial Neural Networks in Human Life: Future Challenges and its Applications
- Neural Networks
- CVPR17 Tutorial Overview
- Assignment
- Word Matching Algebra Integrated Algebra AMTNYS
- machine
- An Abstract Domain for Certifying Neural Networks 2.pdf

You are on page 1of 9

Dept. of Physics, Massachusetts Institute of Technology, Cambridge, MA 02139 and

Dept. of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139

(Dated: May 17, 2017)

It is well-known that neural networks are universal approximators, but that deeper networks tend

to be much more efficient than shallow ones. We shed light on this by proving that the total number

of neurons m required to approximate natural classes of multivariate polynomials of n variables

grows only linearly with n for deep neural networks, but grows exponentially when merely a single

hidden layer is allowed. We also provide evidence that when the number of hidden layers is increased

from 1 to k, the neuron requirement grows exponentially not with n but with n1/k , suggesting that

the minimum number of layers required for computational tractability grows only logarithmically

arXiv:1705.05502v1 [cs.LG] 16 May 2017

with n.

model. The problem has also been posed for sum-product

networks [11] and restricted Boltzmann machines [12].

Deep learning has lately been shown to be a very powerful Cohen, Sharir, and Shashua [13] showed, using tools from

tool for a wide range of problems, from image segmen- tensor decomposition, that shallow arithmetic circuits

tation to machine translation. Despite its success, many can express only a measure-zero set of the functions ex-

of the techniques developed by practitioners of artificial pressible by deep circuits. A weak generalization of this

neural networks (ANNs) are heuristics without theoreti- result to convolutional neural networks was shown in [14].

cal guarantees. Perhaps most notably, the power of feed-

forward networks with many layers (deep networks) has In summary, recent years have seen a wide variety of the-

not been fully explained. The goal of this paper is to shed oretical demonstrations of the power of deep neural net-

more light on this question and to suggest heuristics for works. It is important and timely to extend this work

how deep is deep enough. to make it more concrete and actionable, by deriving

resource requirements for approximating natural classes

It is well-known [13] that nonlinear neural networks with of functions using todays most common neural network

a single hidden layer can approximate any function under architectures. Lin, Tegmark, and Rolnick [15] recently

reasonable assumptions, but it is possible that the net- proved that it is exponentially more efficient to use a

works required will be extremely large. Recent authors deep network than a shallow network when approximat-

have shown that some functions can be approximated by ing the product of input variables. In the present paper,

deeper networks much more efficiently (i.e. with fewer we will greatly extend these results to include broad nat-

neurons) than by shallower ones. However, many of the ural classes of multivariate polynomials, and to tackle the

functions in question are complicated or arise from ex- question of how resource use depends on the precise num-

istence proofs without explicit constructions, and the ber of layers. Our results apply to standard feedforward

results often apply only to types of network rarely used ANNs with general nonlinearities and are borne out by

in practice. empirical tests.

Deeper networks have been shown to have greater repre-

sentational power with respect to various notions of com- The rest of this paper is organized as follows. In II B,

plexity, including piecewise linear decision boundaries [4] we consider the complexity of approximating a multivari-

and topological invariants [5]. Recently, Poole et al. [6] ate polynomial p(x) using a feedforward neural network

and Raghu et al. [7] showed that the trajectories of input with input x = (x1 , . . . , xn ). We show (Theorem II.4)

variables attain exponentially greater length and curva- that for general sparse p, exponentially many neurons are

ture with greater network depth. Work including [8] and required if the network is shallow, but at most linearly

[9] shows that there exist functions that require expo- many for a deep network. For monomials p, we calculate

nential width to be approximated by a shallow network. (Theorem II.1) exactly the minimum number of neurons

Mhaskar, Liao, and Poggio [10], in considering composi- required for a shallow network. These theorems apply for

tional functions with this property, inquire whether ex- all nonlinear activation functions with nonzero Taylor

plicit examples must be pathologically complicated. coefficients; a slightly weaker result (Theorem II.2) holds

for an even broader class of .

Various authors have also considered the power of deeper

In II C, we present similar results (Propositions II.5

and II.6) for approximating univariate polynomials. In

this case, shallow networks require linearly many neurons

drolnick@mit.edu while for deep networks it suffices to use only logarith-

tegmark@mit.edu mically many neurons. In II D, we tie the difficulty of

2

complexity of tensor decomposition (Proposition II.7).

In III A, we consider networks with a constant number The following theorem generalizes a result of Lin,

k of hidden layers. For input of dimension n, we show Tegmark, and Rolnick [15] to arbitrary monomials. By

(Theorem III.1) that products can be approximated with setting r1 = r2 = . . . = rn = 1 below, we recover their

a number of neurons exponential in n1/k , and justify our result that the product of n numbers requires 2n neurons

theoretical predictions with empirical results. While still in a shallow network but can be done with linearly many

exponential, this shows that problems unsolvable by shal- neurons in a deep network.

low networks can be tractable even for k > 1 of modest Theorem II.1. Let p(x)

size. Pn denote the monomial

xr11 xr22 xrnn , with N = i=1 ri . Suppose that the

Finally, in III B, we compare our results on feedfor- nonlinearity (x) has nonzero Taylor coefficients up to

ward neural networks to prior work on the complexity of xN . Then:

Boolean circuits. We conclude that these problems are n

Y

independent, and therefore that established hard prob- m1 (p, ) = (ri + 1), (1)

lems in Boolean complexity do not provide any obstacle i=1

n

to analogous results for standard deep neural networks. X

m(p, ) (7dlog2 (ri )e + 4), (2)

i=1

II. THE INEFFICIENCY OF SHALLOW

NETWORKS

Proof. Without loss of generality, suppose that ri > 0 for

i = 1, . . . , n. Let X be the multiset1 in which xi occurs

In this section, we compare the efficiency of shallow net- with multiplicity ri .

works (those with a single hidden layer) and deep net- Qn

We first show that i=1 (ri + 1) neurons are sufficient to

works at approximating multivariate polynomials. approximate p(x). Appendix A in [15] demonstrates that

for variables y1 , . . . , yN , the product y1 yN can be

approximated as a linear combination of the 2N functions

(y1 yN ).

A. Definitions

Consider setting y1 , . . . , yN equal to the elements of mul-

tiset X. Then, we conclude that we can approximate p(x)

as a linear combination of the functions (y1 yN ).

Let (x) be a nonlinear function, k a positive integer, and However, these functions are not all distinct: there are

p(x) a multivariate polynomial of degree d. We define ri + 1 distinct ways to assign signs to ri copies of xi

mk (p, ) to be the minimum number of neurons (exclud- (ignoring

ing input and output) required to approximate p with Qn permutations of the signs). Therefore, there

are i=1 (ri + 1) distinctQfunctions (y1 yN ),

a neural net having k hidden layers and nonlinearity , n

proving that m1 (p, ) i=1 (ri + 1).

where the error of approximation is of degree at least

d + 1 in the input variables. Thus, in particular, m1 (p, ) We now adapt methods introduced in [15] to show that

is the minimal integer m such that: this number of neurons is also necessary for approximat-

ing p(x). Let m m1P (p, ) and suppose that (x) has

k

m

X n

X

! the Taylor expansion k=0 k x . Then, by grouping

wj aij xi = p(x) + O(xd+1

1 + . . . + xd+1

n ).

terms of each order, we conclude that there exist con-

j=1 i=1 stants aij and wj such that

m n

!N

Note that approximation up to degree d allows us to ap- X X

N wj aij xi = p(x) (3)

proximate any polynomial to high precision as long as

j=1 i=1

the input variables are small enough. In particular, for !k

homogeneous polynomials of degree d, we can adjust the m

X n

X

weights so as to scale each variable by a constant 1 k wj aij xi =0 for 0 k N 1. (4)

before input, and then scale the output by 1/d , in order j=1 i=1

to achieve arbitrary precision.

there is an exponential gap between m1 (p, ) and m(p, ) 1 That is, a collection of elements of X, in which each element is

for various classes of polynomials p. allowed to occur multiple times.

3

For each S X, let us take the derivative of equations 7dlog2 (ri )e neurons arranged in a deep network. There-

ri

(3) and (4) by every variable that occurs in S, where we P we can approximate all of the xi using a total of

fore,

take multiple derivatives of variables that occur multiple i 7dlog2 (ri )e neurons. From [15], we know that these

times. This gives n terms can be multiplied

P using 4n additional neurons,

!N |S| giving us a total of i (7dlog2 (ri )e + 4). This completes

m n

N N ! X Y X Y the proof.

wj ahj aij xi = xi , (5)

|S|! j=1

hS i=1 i6S

!k|S| In order to approximate a degree-N polynomial effec-

m n tively with a single hidden layer, we must naturally as-

k k! X Y X

wj ahj aij xi =0 (6) sume that has nonzero N th Taylor coefficient. How-

|S|! j=1 i=1

hS ever, if we do not wish to make assumptions about the

Qn other Taylor coefficients of , we can still prove the fol-

for |S| k N 1. Observe that there are i=1 (ri + 1) lowing weaker result:

choices for S, since each variable xi can b included

anywhere

Qn from 0 to ri times. Define A to Q be the Theorem II.2. Once again, let p(x) P denote the mono-

n

( i=1 (ri + 1))m matrix with entries AS,j = hS ahj . mial xr11 xr22 Q

xrnn . Then, for N = i=1 ri , we have

n

We claim that A has full row rank. This would show that m1 (p, ) i=1 (ri + 1) if has nonzero N th Tay-

the number of columns m is at least the number of rows lor coefficient. For any , we have that m1 (p, ) is at

Q n

i=1 (ri + 1), proving the desired lower bound on m. least the maximum coefficient in the univariate polyno-

mial g(y) = i (1 + y + . . . + y ri ).

Q

Suppose towards contradiction that the rows AS` , admit

a linear dependence:

Proof outline. The first statement follows as in the proof

r

X of Theorem II.1. For the second, consider once again

c` AS` , = 0,

the equation (5). The left-hand side can be written, for

`=1

any S, as the linear combination of basis functions of the

where the coefficients c` are nonzero and the S` de- form:

note distinct subsets of X. Set s = max` |S` |. Then, !N |S|

n

take the dot product of each side of the above equa- X

tion by the vector with entries (indexed by j) equal to aij xi . (7)

Pn N s i=1

wj ( i=1 aij xi ) :

!N s Letting S vary over multisets of fixed size s, we see that

r m n

X X Y X the right-hand side of (5) attains every degree-s mono-

0= c` wj ahj aij xi mial that divides p(x). The number of such monomials is

`=1 j=1 hS` i=1 the coefficient Cs of the term y s in the polynomial g(y).

m n

!N |S` | Since such monomials are linearly independent, we con-

X X Y X

= c` wj ahj aij xi clude that the number of basis functions of the form (7)

`|(|S` |=s) j=1 hS` i=1 above must be at least at least Cs . Picking s to maximize

m n

!(N +|S` |s)|S` | Cs gives us the desired result.

X X Y X

+ c` wj ahj aij xi . Corollary II.3. If p(x) is the product of the inputs

`|(|S` |<s) j=1 hS` i=1 (i.e. r1 = p= rn = 1), we conclude that m1 (p, )

n n

We can use (5) to simplify the first term and (6) (with bn/2c 2 / n/2, under the assumption that the nth

k = N + |S` | s) to simplify the second term, giving us: Taylor coefficient of is nonzero.

0= c` xh + c` 0

N N ! k k! ing general polynomials. However, without further con-

`|(|S` |=s) h6S` `|(|S` |<s)

straint, this is relatively uninstructive because polynomi-

X |S` |! Y

als of degree d in n variables live within a space of di-

= c` xh .

N N ! mension n+d

`|(|S` |=s) h6S` d , and therefore most require exponentially

Q many neurons for any depth of network. We therefore

Since the monomials xi S` xi are linearly independent, consider polynomials of sparsity c: that is, those that

this contradicts our assumption that the c` are nonzero. can be represented as the sum of c monomials. This in-

We conclude thatQA has full row rank, and therefore that cludes many natural functions.

n

m1 (p, ) = m i=1 (ri + 1). This completes the proof

The following theorem, when combined with Theorem

of equation (1).

II.1, shows that general polynomials p with subexponen-

For equation (2), it follows from Proposition II.6 part tial sparsity have exponential large m1 (p, ), but subex-

(2) below that for each i, we can approximate xri i using ponential m(p, ).

4

Theorem II.4. Suppose that p(x) is a multivari- only logarithmically many neurons in a deep network.

ate polynomial of degree N , with c monomials Thus, depth allows us to reduce networks from linear

q1 (x), q2 (x), . . . , qc (x). Suppose that the nonlinearity to logarithmic size, while for multivariate polynomials

(x) has nonzero Taylor coefficients up to xN . Then, the gap was between exponential and linear. The differ-

ence here arises because the dimensionality of the space

1. m1 (p, ) 1c maxj m1 (qj , ). of univariate degree-d polynomials is linear in d, which

P the dimensionality of the space of multivariate degree-d

2. m(p, ) j m(qj , ). polynomials is exponential in d.

Proposition II.5. Let be a nonlinear function with

Proof outline. Our proof in Theorem II.1 relied upon the nonzero Taylor coefficients. Then, we have m1 (p, )

fact that all nonzero partial derivatives of a monomial d + 1 for every univariate polynomial p of degree d.

are linearly independent. This fact is not true for gen-

eral polynomials p; however, an exactly similar argument Proof. Pick a0 , a1 , . . . , ad to be arbitrary, distinct real

shows that m1 (p, ) is at least the number of linearly in- numbers. Consider the Vandermonde matrix A with

dependent partial derivatives of p, taken with respect to entries Aij = aji . It is well-known that det(A) =

multisets of the input variables.

Q

i<i0 (ai ai ) 6= 0. Hence, A is invertible, which means

0

Consider the monomial q of p such that m1 (q, ) is maxi- that multiplying its columns by nonzero values gives an-

mized, and suppose that q(x) = xr11 xr22 Q

xrnn . By Theo- other invertible matrix. Suppose that we multiply P the

n

rem II.1, m1 (q, ) is equal to the number i=1 (ri + 1) of jth column of A by j to get A0 , where (x) = j j xj

distinct monomials that can be obtained by taking par- is the Taylor expansion of (x).

tial derivatives of q. Let Q be the set of such monomials, Now, observe that the ith row of A0 is exactly the co-

and let D be the set of (iterated) partial derivatives cor- efficients of (ai x), up to the degree-d term. Since A0

responding to them, so that for d D, we have d(q) Q. is invertible, the rows must be linearly independent, so

Consider the set of polynomials P = {d(p) | d D}. We the polynomials (ai x), restricted to terms of degree at

claim that there exists a linearly independent subset of most d, must themselves be linearly independent. Since

P with size at least |D|/c. Suppose to the contrary that the space of degree-d univariate polynomials is (d + 1)-

P 0 is a maximal linearly independent subset of P with dimensional, these d+1 linearly independent polynomials

|P 0 | < |D|/c. must span the space. Hence, m1 (p, ) d + 1 for any

univariate degree-d polynomial p. In fact, we can fix the

Since p has c monomials, every element of P has at most weights from the input neuron to the hidden layer (to be

c monomials. Therefore, the total number of distinct a0 , a1 , . . . , ad , respectively) and still represent any poly-

monomials in elements of P 0 is less than |D|. However, nomial p with d + 1 hidden neurons.

there are at least |D| distinct monomials contained in ele-

ments of P , since for d D, the polynomial d(p) contains Proposition II.6. Let p(x) = xd , and suppose that (x)

the monomial d(q), and by definition all d(q) are distinct is a nonlinear function with nonzero Taylor coefficients.

as d varies. We conclude that there is some polynomial Then:

p0 P \P 0 containing a monomial that does not appear

in any element of P 0 . But then p0 is linearly indepen- 1. m1 (p, ) = d + 1.

dent of P 0 , a contradiction since we assumed that P 0 was

maximal. 2. m(xd , ) 7dlog2 (d)e.

P has size at least |D|/c, and therefore that the space Proof. Part (1) follows from part (1) of Theorem II.1

of partial derivatives of p has rank at least |D|/c = above, by setting n = 1 and r1 = d.

m1 (q, )/c. This proves part (1) of the theorem. Part (2) For part (2), observe that we can approximate the square

follows immediately from the definition of m(p, ). x2 of an input x with three neurons in a single layer:

1

((x) + (x) 2(0)) = x2 + O(x4 ).

C. Univariate polynomials 2 00 (0)

We refer to this construction as a square gate, and the

construction of [15] as a product gate. We also use identity

As with multivariate polynomials, depth can offer an ex-

gate to refer to a neuron that simply preserves the input

ponential savings when approximating univariate polyno-

of a neuron from the preceding layer (this is equivalent

mials. We show below (Proposition II.5) that a shallow

to the skip connections in residual nets).

network can approximate any degree-d univariate poly-

nomial with a number of neurons at most linear in d. Consider a network in which each layer contains a square

The monomial xd requires d + 1 neurons in a shallow net- gate (3 neurons) and either a product gate or an iden-

work (Proposition II.6), but can be approximated with tity gate (4 or 1 neurons, respectively), according to the

5

following construction: The square gate squares the out- Proof. Suppose that we can approximate p(u) using a

put of the preceding square gate, yielding inductively a neural net with m neurons in a single hidden layer. Then,

k

result of the form x2 , where k is the depth of the layer. there exist vectors a1 , a2 , . . . , am of weights from the in-

Writing d in binary, we use a product gate if there is a 1 put to the hidden layer such that

in the 2k1 -place; if so, the product gate multiplies the

output of the preceding product gate by the output of p(u) w1 (a1 u) + + wm (am u).

the preceding square gate. If there is a 0 in the 2k1 -

If we consider only terms of degree d, we obtain

place, we use an identity gate instead of a product gate.

k k1

Thus, each layer computes x2 and multiplies x2 to m

X

the computation if the 2k1 -place in d is 1. The process p(u) = d wi (ai u)d (8)

stops when the product gate outputs ud . i=1

The symmetric tensor

with a worst case scenario where d + 1 is a power of

2. m

X

T= (d wi )ai ai

i=1

equation (8), we have

We conclude this section by noting interesting connec- X

tions between the value m1 (p, ) and the established hard p(u) = Tj1 jd (uj1 ujd ).

1j1 ,...,jd m

problem of tensor decomposition. Specifically, we show

(Proposition II.7) that the minimum number of neurons By the definition of T, the entries Tj1 jd are equal up to

required to approximate a polynomial p equals the sym- permutation of j1 , . . . , jd . Therefore, T must equal the

metric tensor rank of a tensor constructed from the co- monomial tensor T(p), showing that m RS (T(p)) as

efficients of p. desired.

Let T be an order-d symmetric tensor of dimensions n

Corollary II.8. Let p(u) = p(u1 , . . . , un ) be a multi-

n. We say that T is symmetric if the entry Ti1 i2 id

variate polynomial of degree d. For c = 0, 1, . . . , d, let

is identical for any permutation of i1 , i2 , . . . , id . For T

pc (u) be the homogeneous polynomial obtained by taking

symmetric, the symmetric tensor rank RS (T) is defined

all terms of p(u) with degree c. If RS is the maximum

to be the minimum r such that T can be written

symmetric rank of T(pc (u)), then m1 (p) RS .

r

X

T= i ai ai , The proof of this statement closely follows that of Propo-

i=1

sition II.7.

with R and ai Rn for each i [16]. Thus, for d = 1, Various results are known for the symmetric rank of ten-

T is simply a real number and RS (T) = 1. For d = 2, T sors over the complex numbers C [16]. Notably, Alexan-

is a symmetric matrix, and the symmetric tensor rank is der and Hirschowitz [17] showed in a highly non-trivial

simply the rank of T, where one can take the values i proof that the symmetric rank of a generic symmet-

above to be eigenvalues of the matrix T. ric tensor

of dimension n and order k over C equals

For a multiset S, we let (S) denote the number of dis- d n+k1

k /ne, with the exception of a few small values

tinct permutations of S. Thus, if q1 , . . . , qs are frequen- of k and n. However, this result does not hold over the

cies for elements in S, we have real numbers R, and in fact there can be several possible

generic ranks for symmetric tensors over R [16].

q1 + + qs

(S) = .

q1 , . . . , qs

III. HOW EFFICIENCY IMPROVES WITH

For p(u) = p(u1 , . . . , un ) a homogeneous multivariate

DEPTH

polynomial of degree d, we define the monomial tensor

T(p) as follows:

We now consider how mk (p, ) scales with k, interpo-

T(p)j1 jd = [uj1 ujd ]p /({j1 , . . . , jd }), lating between exponential in n (for k = 1) and linear

where []p denotes the coefficient of a monomial in p. in n (for k = log n). In practice, networks with modest

k > 1 are effective at representing natural functions. We

Proposition II.7. For p(u) = p(u1 , . . . , un ) a homo- explain this theoretically by showing that the cost of ap-

geneous multivariate polynomial of degree d, we have proximating the product polynomial drops off rapidly as

m1 (p) RS (T (p)). k increases.

6

i1 Qk !

j=h+1 bj

Y X

0 = bj + 2bh

By repeated application of the shallow network construc- bi

j6=i h=1

tion in Lin, Tegmark, and Rolnick [15], we obtain the

following upper bound on mk (p, ), which we conjecture k

Y

to be essentially tight. Our approach is reminiscent of + (log 2) bj 2bi , for 1 i k (11)

tree-like network architectures discussed e.g. in [10], in j=i+1

which groups of input variables are recursively processed k

Y

in successive layers. 0=n bj . (12)

j=1

Theorem III.1. For p(x) equal to the product

Qk

x1 x2 xn , and for any with all nonzero Taylor co- Dividing (11) by j=i+1 bj and rearranging gives us the

efficients, we have: recursion

1/k

mk (p, ) = O n(k1)/k 2n . (9)

Thus, the optimal bi are not exactly equal but very slowly

increasing with i (see Figure 1).

n inputs are recursively multiplied. The n inputs are

first divided into groups of size b1 , and each group is

multiplied in the first hidden layer using 2b1 neurons (as

described in [15]). Thus, the first hidden layer includes

a total of 2b1 n/b1 neurons. This gives us n/b1 values to

multiply, which are in turn divided into groups of size

b2 . Each group is multiplied in the second hidden layer

using 2b2 neurons. Thus, the second hidden layer includes

a total of 2b2 n/(b1 b2 ) neurons.

b1 b2 bk = n, giving us one neuron which is the product

of all of our inputs. By considering the total number of

neurons used, we conclude

k k

k

FIG. 1: The optimal settings for {bi }ki=1 as n varies are shown

X n X Y for k = 1, 2, 3. Observe that the bi converge to n1/k for large

mk (p, ) Qi 2bi = bj 2bi . (10)

n, as witnessed by a linear fit in the log-log plot. The exact

i=1 j=1 bj i=1 j=i+1

values are given by equations (12) and (13).

(9).

The following conjecture states that the bound given in

Theorem III.1 is (approximately) optimal.

In fact, we can solve for the choice of bi such that the Conjecture III.2. For p(x) equal to the product

upper bound in (10) is minimized, under the condition x1 x2 xn , and for any with all nonzero Taylor co-

b1 b2 bk = n. Using the technique of Lagrange multi- efficients, we have

pliers, we know that the optimum occurs at a minimum 1/k

, (14)

!

k

Y k

X k

Y

L(bi , ) := n bi + bj 2bi .

i=1 i=1 j=i+1 i.e., the exponent grows as n1/k as n .

We empirically tested Conjecture III.2 by training ANNs

Differentiating L with respect to bi , we obtain the con- to predict the product of input values x1 , . . . , xn with

7

exponential to linear width aligns with our predictions.

In our experiments, we used feedforward networks with It is interesting to consider how our results on the inap-

dense connections between successive layers, with nonlin- proximability of simple polynomials by polynomial-size

earities instantiated as the hyperbolic tangent function. neural networks compare to results for Boolean circuits.

Similar results were also obtained for rectified linear units Recall that T C 0 is defined as the set of problems that

(ReLUs) as the nonlinearity, despite the fact that this can be solved by a Boolean circuit of constant depth and

function does not satisfy our hypothesis of being every- polynomial size, where the circuit is allowed to use AND,

where differentiable. The number of layers was varied, OR, NOT, and MAJORITY gates of arbitrary fan-in.2 It is

as was the number of neurons within a single layer. The an open problem whether T C 0 equals the class T C 1 of

networks were trained using the AdaDelta optimizer [18] problems solvable by circuits for which the depth is log-

to minimize the absolute value of the difference between arithmic in the size of the input.

the predicted and actual values. Input variables xi were

drawn uniformly at random from the interval [0, 2], so In this section, we consider the feasibility of strong gen-

that the expected value of the output would be of man- eral no-flattening results. It would be very interesting if

ageable size. one could show that general polynomials p in n variables

require a superpolynomial number of neurons to approx-

imate for any constant number of hidden layers. That

is, for each integer k 1, we would like to prove a lower

bound on mk (p, ) that grows fast than polynomially in

n.

whether T C 0 and T C 1 are equal. However, Boolean cir-

cuits compute using 0/1 values, while the neurons of our

artificial neural networks take on arbitrary real values.

To preserve 0/1 values at all neurons, we can restrict in-

puts to such values and take the nonlinear activation to

be the Heaviside step function:

(

0 if x 0

(x) =

1 if x > 0.

inspired by McCulloch and Pitts [19].

fixed bias constant: that is, a neuron receiving inputs

x1 , x2 , . . . , xn is of the form (a0 + a1 x1 + a2 x2 + +

an xn ) where a0 , a1 , a2 , . . . , an are real constants. Such

FIG. 2: Performance of trained networks in approximating

the product of 20 input variables, ranging from red (high a neuron corresponds to a weighted threshold gate in

mean error) to blue (low mean error). The curve w = Boolean circuits.

1/k

n(k1)/k 2n for n = 20 is shown in black. In the region

above and to the right of the curve, it is possible to effectively

It follows from the work of [20] and [21] that such artificial

approximate the product function (Theorem III.1). neural nets (ANNs), with constant depth and polynomial

size, have exactly the same power as T C 0 circuits. That

is, weighted threshold gates can simulate and be simu-

lated by constant-depth, polynomial-size circuits of AND,

OR, NOT, and MAJORITY gates. The following are simple

Eq. (14) provides a helpful rule of thumb for how deep is constructions for these four types of gates using weighted

deep enough. Suppose, for instance, that we wish to keep

typical layers no wider than about a thousand ( 210 )

neurons. Eq. (14) then implies n1/k < 10, i.e., that the 2 An AND gate evaluates True if all its inputs do so, an OR gate

number of layers should be at least

evaluates True if any of its inputs do so, a NOT gate evaluates

True if its input is False and vice versa, and a MAJORITY gate

k>

log10 n. evaluates True if at least half of its inputs do so.

8

n ! layer is assumed to be connected only to a small subset of

X 1 neurons from the previous layer, rather than the entirety

AND(x1 , x2 , . . . , xn ) = xi n

2 of them (or some large fraction). In fact, we showed

i=1

n

! (e.g. Prop. II.6) that there exist natural functions that

X 1 can be computed in a linear number of neurons, where

OR(x1 , x2 , . . . , xn ) = xi

2 each neuron is connected to at most two neurons in the

i=1

preceding layer, which nonetheless cannot be computed

1 with fewer than exponentially many neurons in a single

NOT(x) = x

2 layer, no matter how may connections are used. Our

n

!

X n1 construction can also easily be framed with reference to

MAJORITY(x1 , x2 , . . . , xn ) = xi the other properties mentioned in [13]: those of sharing

2

i=1 (in which weights are shared between neural connections)

and pooling (in which layers are gradually collapsed, as

Thus, we should not hope easily to prove general no-

our construction essentially does with recursive combina-

flattening results for Boolean functions, but the case

tion of inputs).

of polynomials in real-valued variables may be more

tractable. Simply approximating a real value by a This paper has focused exclusively on the resources (no-

Boolean circuit requires arbitrarily many bits. Therefore, tably neurons and synapses) required to compute a given

performing direct computations on real values is clearly function. An important complementary challenge is to

intractable for T C 0 circuits. Moreover, related work such quantify the resources (e.g. training steps) required to

as [4, 8, 13] has already proven gaps in expressivity for learn the computation, i.e., to converge to appropriate

real-valued neural networks of different depths, for which weights using training data possibly a fixed amount

the analogous results remain unknown in Boolean cir- thereof, as suggested in [22]. There are simple functions

cuits. that can be computed with polynomial resources but re-

quire exponential resources to learn [23]. It is quite pos-

sible that architectures we have not considered increase

IV. CONCLUSION the feasibility of learning. For example, residual networks

(ResNets) [24] and unitary nets (see e.g. [25, 26]) are no

more powerful in representational ability than conven-

We have shown how the power of deeper ANNs can be tional networks of the same size, but by being less sus-

quantified even for simple polynomials. We have proven ceptible to the vanishing/exploding gradient problem,

that there is an exponential gap between the width of it is far easier to optimize them in practice. We look

shallow and deep networks required for approximating forward to future work that will help us understand the

a given sparse polynomial. For n variables, a shallow power of neural networks to learn.

network requires size exponential in n, while a deep net-

work requires at most linearly many neurons. Networks

with a constant number k > 1 of hidden layers appear

to interpolate between these extremes, following a curve

V. ACKNOWLEDGMENTS

exponential in n1/k . This suggests a rough heuristic for

the number of layers required for approximating simple

functions with neural networks. For example, if we want

This work was supported by the Foundational Questions

no layers to have more than 103 neurons, say, then the

Institute http://fqxi.org/, the Rothberg Family Fund for

minimum number of layers required grows only as log10 n.

Cognitive Science and NSF grant 1122374. We thank

It is worth noting that our constructions enjoy the prop- Scott Aaronson, Surya Ganguli, David Budden, and

erty of locality mentioned in [13], which is also a feature Henry Lin for helpful discussions and suggestions.

[1] G. Cybenko, Mathematics of Control, Signals, and Sys- ral networks and learning systems 25, 1553 (2014).

tems (MCSS) 2, 303 (1989). [6] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and

[2] K. Hornik, M. Stinchcombe, and H. White, Neural net- S. Ganguli, in Advances In Neural Information Process-

works 2, 359 (1989). ing Systems (NIPS) (2016), pp. 33603368.

[3] K.-I. Funahashi, Neural networks 2, 183 (1989). [7] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli,

[4] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, and J. Sohl-Dickstein, arXiv preprint arXiv:1611.08083

in Advances in Neural Information Processing Systems (2016).

(NIPS) (2014), pp. 29242932. [8] M. Telgarsky, Journal of Machine Learning Research

[5] M. Bianchini and F. Scarselli, IEEE transactions on neu- (JMLR) 49 (2016).

9

[9] R. Eldan and O. Shamir, in 29th Annual Conference on [19] W. S. McCulloch and W. Pitts, The bulletin of mathe-

Learning Theory (2016), pp. 907940. matical biophysics 5, 115 (1943).

[10] H. Mhaskar, Q. Liao, and T. Poggio, arXiv preprint [20] A. K. Chandra, L. Stockmeyer, and U. Vishkin, SIAM

arXiv:1603.00988v4 (2016). Journal on Computing 13, 423 (1984).

[11] O. Delalleau and Y. Bengio, in Advances in Neural Infor- [21] N. Pippenger, IBM journal of research and development

mation Processing Systems (NIPS) (2011), pp. 666674. 31, 235 (1987).

[12] J. Martens, A. Chattopadhya, T. Pitassi, and R. Zemel, [22] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals,

in Advances in Neural Information Processing Systems in Proceedings of the International Conference on Learn-

(NIPS) (2013), pp. 28772885. ing Representations (ICLR) (2017).

[13] N. Cohen, O. Sharir, and A. Shashua, Journal of Machine [23] S. Shalev-Shwartz, O. Shamir, and S. Shammah, arXiv

Learning Research (JMLR) 49 (2016). preprint arXiv:1703.07950 (2017).

[14] N. Cohen and A. Shashua, in International Conference [24] K. He, X. Zhang, S. Ren, and J. Sun, in Proceedings of

on Machine Learning (ICML) (2016). the IEEE Conference on Computer Vision and Pattern

[15] H. W. Lin, M. Tegmark, and D. Rolnick, arXiv preprint Recognition (2016), pp. 770778.

arXiv:1608.08225v3 (2016). [25] M. Arjovsky, A. Shah, and Y. Bengio, in International

[16] P. Comon, G. Golub, L.-H. Lim, and B. Mourrain, SIAM Conference on Machine Learning (2016), pp. 11201128.

Journal on Matrix Analysis and Applications 30, 1254 [26] L. Jing, Y. Shen, T. Dubcek, J. Peurifoy, S. Skirlo,

(2008). M. Tegmark, and M. Soljacic, arXiv preprint

[17] J. Alexander and A. Hirschowitz, Journal of Algebraic arXiv:1612.05231 (2016).

Geometry 4, 201 (1995).

[18] M. D. Zeiler, arXiv preprint arXiv:1212.5701 (2012).

- Data scienceUploaded bySonia Sharma
- Applications of Deep Learning to Sentiment Analysis of Movie ReviewsUploaded byTamara Komnenić
- Dropout as a Bayesian Approximation Representing Model Uncertainty in Deep LearningUploaded byJoo-Hyun Yoo
- Matemática PPT - PolynomialsUploaded by4gen_5
- ANNsUploaded byjgrove255
- A Survey on Hybrid Soft Computing Techniques based on Classification ProblemUploaded byInternational Journal for Scientific Research and Development - IJSRD
- Animate Object Detection Using Deep LearningUploaded byInternational Journal Of Emerging Technology and Computer Science
- [George a. Rovithakis Manolis a. Christodoulou] Adaptive Control With Recurrent High Order Neural NetworksUploaded byruben210979
- NeuralUploaded by9175388127
- 2013-1-00615-IF Bab1001Uploaded byReinertRumagit
- Supervised Machine Learning Based Dynamic EstimationUploaded byInternational Journal of Research in Engineering and Technology
- dl a surveyUploaded byRoshdy AbdelRassoul
- Artificial Neural Networks in Human Life: Future Challenges and its ApplicationsUploaded byIRJET Journal
- 05899139impUploaded byRramchandra Hasabe
- Neural NetworksUploaded byGoutham Beesetty
- CVPR17 Tutorial OverviewUploaded byGauravJain
- Word Matching Algebra Integrated Algebra AMTNYSUploaded byMario Caredo
- AssignmentUploaded byPiyush Bhandari
- machineUploaded byhiperboreoatlantec
- An Abstract Domain for Certifying Neural Networks 2.pdfUploaded byIman Zabet
- Related to financeUploaded byVineet Tiwari
- Sonar-Based Mobile Robot Navigation Through Supervised Learning on a Neural NetUploaded byjejudo
- automated earphonesUploaded bySiddhuBogoju
- 4540SampleReport.docUploaded byalmodhafar
- Lecture notes to neural networks in Electrical engineeringUploaded byNaeem Ali Sajad
- Ose 0101001Uploaded byesutjiadi
- 6ff5d305249658150618dbebe7ab21ae82eaUploaded bypancawawan
- v19-15Uploaded byManish Pradhan
- deepakUploaded byAmit Sati
- Deep Learning NotesUploaded byAc Ebirim

- Geometric Algebra Primer - Jaap Suter - 2003Uploaded byJaapSuter
- Learning Deep Learning With KerasUploaded byAndres Tuells Jansson
- Bombjoke Comments on I'm a Piece of Shit. No More Games, No More Lies, No More Excuses. I Need DisciplineUploaded byAndres Tuells Jansson
- 496fed6eca4452e569fbac8d41a12cedd703Uploaded byAndres Tuells Jansson
- A Step by Step Backpropagation Example – Matt MazurUploaded byAndres Tuells Jansson
- Deep Learning (2016.09.23-00.17.14Z).epubUploaded byAndres Tuells Jansson
- Wald.-.Gravitation,Thermodynamics.and.Quantum.Theory.(1999)Uploaded byAlex Albuquerque
- Thank You R_theXeffect! _ TheXeffectUploaded byAndres Tuells Jansson
- Hello, TensorFlow! - O'Reilly MediaUploaded byAndres Tuells Jansson
- 9904050Uploaded byAndres Tuells Jansson
- awda-preview-trollkiller.pdfUploaded byAndres Tuells Jansson
- junipers-guide-to-fascists.docUploaded byAndres Tuells Jansson
- mixture of gaussiansUploaded byAshwini Kumar Pal
- awda-preview-rights.pdfUploaded byAndres Tuells Jansson
- PerceptronUploaded byapi-3814100
- deepwarp_eccv2016Uploaded byAndres Tuells Jansson
- ESANNUploaded byAndres Tuells Jansson
- Space InvadersUploaded byAndres Tuells Jansson
- 1006.2483v2.pdfUploaded byAndres Tuells Jansson
- 1704.01568Uploaded byPeter Pan
- Zeiler Ec Cv 2014Uploaded byVaibhav Jain
- 1704.05539.pdfUploaded byAndres Tuells Jansson
- springUploaded bypetekrem59
- Some Deep Learning ResourcesUploaded byaliasvicko
- Neural Networks and Deep LearningUploaded byAndres Tuells Jansson
- An Introduction to Geometric Algebra over R^2 _ BitWorkingUploaded byAndres Tuells Jansson
- Deep Learning Tutorial Release 0.1Uploaded bylerhlerh
- Why Does Deep and Cheap Learning Work Soo WellUploaded byAndres Tuells Jansson
- Why Does Deep and Cheap Learning Work Soo WellUploaded byAndres Tuells Jansson

- 0606_w02_qp_2Uploaded byLing Sii Nen
- ODD09.pdfUploaded byThomas Durand
- DMSUploaded byNaveen A Reddy
- Métodos de InterpolaciónUploaded byAnon
- lesson02.pdfUploaded byLazoRoljić
- 4037_w11_qp_13Uploaded bymstudy123456
- International Competitions IMO 1959-16-1Uploaded byniwa_third
- The Mathematical Encyclopaedia of the 15 Th CenturyUploaded byVictor Krivorotov
- 2Extreme Values of Poisson RatioUploaded byArvind Purushotaman
- numbertheorytipstricksUploaded byAlldica Bezelo
- Unit 5 QB.pdfUploaded byedwin
- Little Fourier TransformUploaded bytheninjapirate
- Mirjana Mladenovic- Efficient Calculation of Rovibrational Energy Levels of General Tetratomic MoleculesUploaded byWippetsxz
- Signals and Systems Model -1Uploaded bysharanyameen2704
- A Study on Category of GraphsUploaded byIOSRjournal
- algebra 2Uploaded byapi-298956964
- Sample Paper MATH.docUploaded byRaksha Arora
- Matlab Example 2009Uploaded byRajesh EC
- 2003 - Yang, Peng - Image Reconstruction Algorithms for Electrical Capacitance TomographyUploaded byBikashGupta
- Lec04 PerturbationUploaded byadi.s022
- Paper 11th ShamimUploaded byShamim Rahmat
- Linear RegressionUploaded byMahin1977
- Algebra PracticeUploaded byJ Alberto Lucas
- The Kharitonov theorem and its applications inUploaded byMohamed Osman Abdalla
- Mathematics NTSE Stage-1Uploaded bySonal Gupta
- IIT JEE - 2012 Paper 1 & 2Uploaded byOP Gupta
- Lehtinen AudiosemmaUploaded byEnes Şavlı
- Alberta HighSchool 2007 Round 1Uploaded byWeerasak Boonwuttiwong
- Spivak, Calculus on Manifolds - A Modern Approach to Clasical Theorems of Advanced CalculusUploaded byGabriel Dalla Vecchia
- Maths FrameworkUploaded bySpam Emails