- Differential Evolution
- Differential Evolution Training Algorithm for Feed-Forward Neural Networks
- CHAPTER_1_TEST(9-8-10)
- Deep Learning Turorial.pdf
- Genetic Neural Approach for Heart Disease Prediction.pdf
- IJAIEM-2014-11-25-77
- Us 5422982
- CSSS2010-20100803-Kanevski-lecture2
- DSP4 Fourier Series_unlocked
- Neural Net Dup 1
- Intelligent_robotic_handling_of_fabrics_towards_sewing.pdf
- 212-1304-1-PB
- creo3_behavioral.pdf
- Elektrik Divice Neural Network
- Business Object Key Map Extract for FBL 2
- hiskens_mar04
- 3. Fourier Series and Transform
- b Lakers 2012
- Teori 3D Inv
- functions
- Pre-Calc Lesson 4
- RegentsA2T Practice Test 1
- Lecture 1 ECON1202
- Overview of signals
- final_u2
- Discrete Mathematics 216 - HW 9
- Tanner t Cell
- 9 24 12 Piecewise Day 2
- Feedback Laboratory Exercise 3
- Mathematics 3 MAy 2013
- casella_berger_statistical_inference1.pdf
- Deep Learning in Neural Networks An Overview.pdf
- Introduction to Computation and Programming Using Python (Spring 2013 Edition).pdf
- A Connectionist Approach to Algorithmic Composition.pdf
- Algorithm Design by Jon Kleinberg, Eva Tardos.pdf
- Gradient-based mathematical optimization, new challenges in solving real world modeling problems.pdf
- AlgsMasses.pdf
- A Connectionist Approach to Algorithmic Composition
- Math - Calculus of Finite Differences.pdf
- Introduction to Python for Econometrics, Statistics and Data Analysis.pdf
- Koller D., Friedman N. Probabilistic graphical models (MIT, 2009)(ISBN 0262013193)(O)(1270s)_CsAi_.pdf
- Application of AVX (Advanced Vector Extensions) for Improved.pdf
- C++BetterDeviceSoftware.pdf
- An-Introduction-to-probability-Theory-by-William-Feller.pdf
- Basic probabilty theory.pdf
- Lo-que-el-Buddha-enseno-Walpola-Rahula.pdf
- athut rimbaud poemas completos.pdf
- Evolution of consciousness.pdf
- A Cultural Perspective on Romantic Love
- Artificial Neural Network Approach for Brain Tumor Detection Using Digital Image Segmentation.pdf
- Artificial Neural Network Based Brain Cancer Analysis
- A brief introduction to neural networks.pdf
- Existentialist_Ethics.pdf
- Memristor Models for Machine Learning
- Estructura molecular de los ácidos nucleicos.pdf
- A New Algorithm for Fully Automatic Brain Tumor.pdf
- nicanor parra poemas y antipoemas.pdf
- Beuchot Mauricio - Historia De La Filosofia Del Lenguaje.pdf
- Eric Hobsbawm - Age Of Capital - 1848-1875.pdf

Matus Telgarsky

arXiv:1509.08101v2 [cs.LG] 29 Sep 2015

Abstract

This note provides a family of classification problems, indexed by a positive integer k, where all

shallow networks with fewer than exponentially (in k) many nodes exhibit error at least 1/6, whereas

a deep network with 2 nodes in each of 2k layers achieves zero error, as does a recurrent network

with 3 distinct nodes iterated k times. The proof is elementary, and the networks are standard

feedforward networks with ReLU (Rectified Linear Unit) nonlinearities.

1

Overview

**A neural network is a function whose evaluation is defined by a graph as follows. Root nodes compute
**

x 7→ σ(w0 + hw, xi), where x is the input to the network, and σ : R → R is typically a nonlinear function,

for instance the ReLU (Rectified Linear Unit) σr (z) = max{0, z}. Internal nodes perform a similar

computation, but now their input vector is the collective output of their parents. The choices of w0 and

w may vary from node to node, and the possible set of functions obtained by varying these parameters

gives the function class N(σ; m, l), which has l layers each with at most m nodes.

The representation power of N(σ; m, l) will be measured via the classification error Rz . Namely,

given a function f : Rd → R, let f˜ : Rd → {0, 1} denote the corresponding classifier f˜(x) := 1[f (x) ≥

1/2], and additionally

given a sequence of points ((xi , yi ))ni=1 with xi ∈ Rd and yi ∈ {0, 1}, define

P

−1

˜

Rz (f ) := n

i 1[f (xi ) 6= yi ].

Theorem 1.1. Let positive integer k, number of layers l, and number of nodes per layer m be given

with m ≤ 2(k−3)/l−1 . Then there exists a collection of n := 2k points ((xi , yi ))ni=1 with xi ∈ [0, 1] and

y ∈ {0, 1} such that

min

f ∈N(σr ;2,2k)

Rz (f ) = 0

and

min

g∈N(σr ;m,l)

Rz (g) ≥

1

.

6

**For example, approaching the error of the 2k-layer network (which has O(k) nodes
**

and weights) with

√

√

2 layers requires at least 2(k−3)/2−1 nodes, and with k − 3 layers needs at least 2 k−3−1 nodes.

The purpose of this note is to provide an elementary proof of Theorem 1.1 and its refinement Theorem 1.2, which amongst other improvements will use a recurrent neural network in the upper bound.

Section 2 will present the proof, and Section 3 will tie these results to the literature on neural network

expressive power and circuit complexity, which by contrast makes use of product nodes rather than

standard feedforward networks when showing the benefits of depth.

1.1

Refined bounds

There are three refinements to make: the classification problem will be specified, the perfect network will be an even

simpler recurrent network, and σ need not be σr .

Let n-ap (the n-alternating-point problem) denote the

set of n uniformly spaced points within [0, 1 − 2−n ] with

alternating labels, as depicted in Figure 1; that is, the points

((xi , yi ))ni=1 with xi = i2−n , and yi = 0 when i is even, and

1

1

0

1/2

Figure 1: The 3-ap.

1

Lastly. If σ is t-sawtooth. Given a sawtooth function. that on the 2k -ap one needs exponentially (in k) many parameters when boosting decision stumps. Lemma 2.2. its classification error on the n-ap may be lower bounded as follows. m. The upper bound will exhibit a network in N(σr . but the former has O(ml) parameters whereas the latter has O(mlk) parameters. l) with f : R → R is (tm)l -sawtooth. l) applied k times: f (x) = g k (x) = g ◦ · · · ◦ g (x). 2) which can be composed with itself k times to exactly fit the n-ap. thus elements of N(σ. l) are sawtooth whenever σ is. simply tracking the number of times a function within N(σ. say that σ : R → R is t-sawtooth if it is piecewise affine with t pieces. m. Given a t-sawtooth σ : R → R and n := 2k points as specified by the n-ap.m. for example.2.1 Lower bound The lower bound is proved in two stages. defined as follows. Theorem 1.2. and constantly many parameters with a recurrent network. Let positive integer k. k) denote k iterations of a recurrent network with l layers of at most m nodes each. P Proof. k) ⊆ N(σ. | {z } k times Consequently. and number of nodes per layer m be given. Every f ∈ R(σ. a sawtooth function can not cross 1/2 very often. whereby Rz (f ) := n−1 i 1[yi 6= f˜(xi )]. whereas adding a constant number of nodes in a deep network can correct predictions on a constant fraction of the points. Secondly. m. 2. k) consists of some fixed network g ∈ N(σ. yi ))ni=1 be given according to the n-ap.k) Rz (f ) = 0 and min g∈N(σ. The key observation is that adding together sawtooth functions grows the number of regions very slowly. l. Since f is piecewise monotonic with a corresponding partition R having at most t pieces. First. 3n This more refined result can thus say. whereas composition grows the number very quickly. m. meaning it can’t hope to match the quickly changing labels of the n-ap. l. The proof is straightforward and deferred momentarily. m.1. 2 Analysis This section will first prove the lower bound via a counting argument. and at most 1 at the right endpoint 2 . Let ((xi .l) Rz (g) ≥ n − 4(tm)l .otherwise yi = 1. meaning R is partitioned into t consecutive intervals. the key is that adding a constant number of nodes in a flat network only corrects predictions on a constant number of points. l. then every f ∈ N(σ. an early sign of the benefits of depth. m. then f has at most 2t − 1 crossings of 1/2: at most one within each interval of the partition. σr is 2-sawtooth. then min f ∈R(σr . but this class also includes many other functions.2. m. 2.2. l) is sawtooth as follows. linearly many parameters with a deep network. lk). Let R(σ. m. the labels change as often as possible. Recall the notation f˜(x) := 1[f (x) ≥ 1/2]. composing and summing sawtooth functions must also yield a sawtooth function. number of layers l. m. R(σ.1. l) can cross 1/2. These bounds together prove Theorem 1. Lemma 2. Consequently. for instance the decision stumps used in boosting are 2-sawtooth. To start. Then every t-sawtooth function f : R → R satisfies Rz (f ) ≥ (n − 4t)/(3n). and decision trees with t − 1 nodes correspond to t-sawtooths. N(σ. As the x values pass from left to right. and σ is affine within each interval. which in turn implies Theorem 1.

1. . it follows that |I| ≤ k + l (the other intersections are empty).2 Upper bound Consider the mirror map fm : R → R. note that fm ◦g is 2g(x) whenever g(x) ∈ [0. Then f + g is (k + l)sawtooth. However. The proof proceeds by induction over layers. Ug ∈ Ig }. and in particular consider the image f (g(Ug )) for some interval Ug ∈ Ig . The upper bounds will use k fm ∈ R(σr . . xi) is t-sawtooth by Lemma 2. it was suggested before. The proof of Lemma 2. nothing prevents g(Ug ) from hitting all the elements of If . Thereafter. 1/2]. Since buckets are intervals and signs must alternate within any such interval. each node starts by computing x 7→ w0 + hw.3. the input to layer i with i > 1 is a collection of functions (g1 . where the corresponding partition of R is into at most 2t intervals. Consequently. . fm . so the full node computation x 7→ σ(w0 +hw. 2. xi.3. and Ig denote the partition of R corresponding to g. and fm . meaning I := {Uf ∩ Ug : Uf ∈ If . Consequently.and l-sawtooth. To close. it holds that f ◦ g is (|If | · |Ig |)-sawtooth. indeed. consequently. 0 otherwise. k) ⊆ N(σr . fm (x) = σr (2σr (x) − 4σr (x − 1/2)). thus the total number of points landing in buckets with at least three points is at least n − 4t. f˜ is piecewise constant. to sinusoids. Proof of Lemma 2. the proof of Lemma 2. x 7→ w0 + j wj gj (x) is m(tm)i−1 -sawtooth by Lemma 2. at least a third of the points in any of these buckets are labeled incorrectly by f˜. f + g has a single slope along Uf ∩ Ug . Visually. 2. 2k). which is itself affine and thus 1-sawtooth.3. Proof of Lemma 2. by Bengio and LeCun (2007). This means n points with alternating labels must land in 2t buckets.of all but the last interval. . 1/2 0 To assess the effect of the post-composition fm ◦ g for any g : R → R. 2). First note how adding and composing sawtooths grows their complexity.3. this has 1 the effect of reflecting (or folding) the graph of g around the horizontal line through 1/2 and then rescaling by 2. where I is the set of all intersections of intervals from If and Ig . Lemma 2. For the first layer. that Fourier transforms are efficiently represented with deep networks. and defined as when 0 ≤ x ≤ 1/2. By sorting the left endpoints of elements of If and Ig . and f ◦ g is kl-sawtooth. f + g is |I|-sawtooth. l).1 follows by induction over layers of N(σ. g is affine with a single slope along Ug . showing the output of each node in layer i is (tm)i -sawtooth as a function of the neural network input. Let f : R → R and g : R → R be respectively k. Now consider f ◦ g. 1/2 0 2 whose peaks and troughs match the 2 -ap and 23 -ap. whereby applying σ yields a (tm)i -sawtooth function (once again by Lemma 2. Let If denote the partition of R corresponding to f . depicted in Figure 2. and moreover any intervals Uf ∈ If and Ug ∈ Ig . 1]. therefore f is being considered along a single unbroken interval g(Ug ).1 proceeds as follows. 2. 1 0 1/2 1 Note that fm ∈ N(σr . m. 2. First consider f + g. 2x fm (x) := 2(1 − x) when 1/2 < x ≤ 1. gm0 P ) with m0 ≤ m and gj being (tm)i−1 -sawtooth by the inductive hypothesis. Necessarily. for instance. 2. 3 1 1 1 .3). and 2(1 − g(x)) whenever g(x) ∈ (1/2. since Ug was arbitrary. Apk 2 3 plying this reasoning to fm leads to fm and fm in Figure 2. and 2 3 moreover have the form of a piecewise affine approximations Figure 2: fm .

Now. whereas post-composition reflects around the horizontal line at 1/2 and then scales vertically by 2. . any f ∈ N(σ. namely its effect on Before closing this subsection. 2. 2k) by construction. via an incredible proof. 2k). 2. Continuing with Theorem 1.2 It suffices to prove Theorem 1. These gates correspond to multiplication and addition 4 . that boolean circuits consisting only of and gates and or gates require exponential size in order to approximate the parity function well. . and an additional point with x-coordinate 1. 1]d arbitrarily well. 2k−1 } and real xk ∈ [0. the inductive hypothesis applied to 2x grants ( 2xl+1 when 0 ≤ xl+1 ≤ 1/2. k in one more way.4. k fm (x) = 2(1 − xk ) when 1/2 < xk < 1.2. l ◦ fm )(x) = Consequently. The proof proceeds by induction on the number of compositions l. k) ⊆ N(σr . k) ⊆ N(σr . but (g ◦ fm )(x) = g(2 − 2x) when x ∈ (1/2. In order to prove this form and develop a better understanding of fm . 2. 1) satisfy 2x = 2(il+1 +xl+1 )2−l−1 = (il+1 + xl+1 )2−l . which yields Theorem 1. l) is (tm)l -sawtooth by Lemma 2. whereby the condition m ≤ 2(k−3)/l−1 implies n − 4(2m)l 1 4 4 1 1 1 = − (2m)l 2−k ≥ − 2k−3 2−k = − . 2. 1] and positive integer k be given. When l = 1. 2. This result. . For the upper bound. consider its pre-composition behavior g ◦ fm for any g : R → R. it suffices to consider x ∈ [0. Lemma 2.These compositions may be written as follows. who established.3 Proof of Theorems 1. pre-composition first scales horizontally by 1/2 and then reflects around the vertical line at 1/2. 1/2]. Proof of Lemma 2. is for flat networks. yi ))i=1 provided by the n-ap with n := 2 . who proved that neural networks can approximate continuous functions over [0.1. which completes the proof. which by the mirroring property means (fm l fm (2x).2 k gives the lower bound. 1) so that x = (ik + xk )21−k . 1/2].4. Then ( 2xk when 0 ≤ xk ≤ 1/2. the mirroring property of pre-composition with fm l combined with the symmetry of fm (by the inductive hypothesis) implies that every x ∈ [0. For the inductive step. k ˜ k and moreover fm (xi ) = fm (xi ) = yi on every (xi . note that fm ∈ R(σr . yi ))ni=1 is an (n/2)-ap with all points duplicated except x1 = 0. Observe that ((fm (xi ). 3n 3 3 3 3 3 6 and the upper bound transfers since R(σr . m. yi ) in the n-ap by Lemma 2. (g ◦ fm )(x) = g(2x) whenever x ∈ [0. 2. it is interesting to view fm n k ((xi . l l (fm ◦ f )(x) = fm (2x) = 2(1 − xl+1 ) when 1/2 < xl+1 < 1.4.1 since σr is 2-sawtooth. there is nothing to show. whereby Lemma 2. 3 Related work The standard classical result on the representation power of neural networks is due to Cybenko (1989). and choose the unique nonnegative integer ik ∈ {0. 1/2] satisfies l l l (fm ◦ f )(x) = (fm ◦ f )(1 − x) = (fm ◦ f )(x + 1/2). providing a condensed mirror image and motivating the name mirror map. Let real x ∈ [0. 2. . Since the unique nonnegative integer il+1 and real xl+1 ∈ [0. An early result showing the benefits of depth is due to H˚ astad (1986). however. 1].1 and 1.2.

networks consisting of sum and product nodes. Kolmogorov’s nonlinearities have fractal structure. there is a far more classical result which deserves mention. More generally. k as mentioned above. Lastly. 5 . editors. the VC dimension of N(σr .over the boolean domain. Neural Network Learning: Theoretical Foundations. Weston. this network needs multiple distinct nonlinearities and therefore is not an element of N σ. Bartlett and Shahar Mendelson. Andrey Nikolaevich Kolmogorov. and moreover the parity function is the Fourier basis over the boolean domain. Lastly. MIT Press. by the seminal result of Anthony and Bartlett (1999.05009 [cs. note that fm k has an exponentially large Lipschitz constant (exactly 2 ). m. Still on the topic of representation results. Nadav Cohen. On the representation of continuous functions of several variables by superpositions of continuous functions of one variable and addition. Massachusetts Institute of Technology. Yoshua Bengio and Olivier Delalleau. 2007. 114:953–956. In NIPS. have been studied in the machine learning literature. O(d2 ). 2002) can inadvertently erase the benefits of depth as presented here. Interestingly. arXiv:1509. Rademacher and gaussian complexities: Risk bounds and structural results. Olivier Chapelle. 1957. indicating that these k exponential representation benefits directly translate into statistical savings. it is worthwhile to mention the relevance of representation power to statistical questions. deep sum-product networks. George Cybenko. Peter L. and J. 3 for a fixed σ. In L´eon Bottou. fm as used here is a piecewise affine approximation of a Fourier basis. DeCoste. Namely. JMLR. Theorem 8. and Amnon Shashua. Mathematics of Control. Shallow vs. On the expressive power of deep learning: A tensor analysis. Yoshua Bengio and Yann LeCun. where it was showed by Bengio and Delalleau (2011) that again there is an exponential benefit to depth. 1989. 1986. D. Approximation by superpositions of a sigmoidal function. and thus an elementary statistical analysis via Lipschitz constants and Rademacher complexity (Bartlett and Mendelson. 2015. Signals and Systems. but now over the reals. and it was suggested previously by Bengio and LeCun (2007) that Fourier transforms admit efficient representations with deep networks. Large Scale Kernel Machines. however one can treat k these specialized nonlinearities as goalposts for other representation results. Namely. while this note was only concerned with finite sets of points. Nov 2002. PhD thesis. 2011. similarly to the fm used here.14). While this result was again for a countable class of functions. (2015) aims to give a broader characterization. namely of only controlling a countable family of functions which is in no sense dense.NE]. Or Sharir. l) is at most O(m8 l2 ). 1]d → R can be exactly represented by a network with O(d2 ) nodes in 3 layers. Bartlett. the surreal result of Kolmogorov (1957) states that a continuous function f : [0. Johan H˚ astad. References Martin Anthony and Peter L. 2(4):303–314. Computational Limitations of Small Depth Circuits. more recent work by Cohen et al. 3:463–482. Indeed. Scaling learning algorithms towards AI. 1999. note that H˚ astad (1986)’s work has one of the same weaknesses as the present result. Cambridge University Press.

- Differential EvolutionUploaded byDuško Tovilović
- Differential Evolution Training Algorithm for Feed-Forward Neural NetworksUploaded byue06037
- CHAPTER_1_TEST(9-8-10)Uploaded byAni Coca Kolla
- Deep Learning Turorial.pdfUploaded byDennis Min Zeng
- Genetic Neural Approach for Heart Disease Prediction.pdfUploaded byMuhammad Aldi Perdana Putra
- IJAIEM-2014-11-25-77Uploaded byAnonymous vQrJlEN
- Us 5422982Uploaded bytrantrieuvn
- CSSS2010-20100803-Kanevski-lecture2Uploaded byMuhd Adha
- DSP4 Fourier Series_unlockedUploaded byluisperiko
- Neural Net Dup 1Uploaded byPinaki Ghosh
- Intelligent_robotic_handling_of_fabrics_towards_sewing.pdfUploaded byLiliana Stan
- 212-1304-1-PBUploaded byJesant17
- creo3_behavioral.pdfUploaded bygggg
- Elektrik Divice Neural NetworkUploaded byyolanda dewi
- Business Object Key Map Extract for FBL 2Uploaded byBick Kyy
- hiskens_mar04Uploaded bymohancrescent
- 3. Fourier Series and TransformUploaded bykushal124
- b Lakers 2012Uploaded bygateman234
- Teori 3D InvUploaded byHumbang Purba
- functionsUploaded byapi-236866849
- Pre-Calc Lesson 4Uploaded byezmoreldo
- RegentsA2T Practice Test 1Uploaded byrobertz_tolentino014
- Lecture 1 ECON1202Uploaded byNormanMichael
- Overview of signalsUploaded byDaniel_M02139
- final_u2Uploaded byRaul Ramirez Rojo
- Discrete Mathematics 216 - HW 9Uploaded byJeremie Pope
- Tanner t CellUploaded bySella Thambi
- 9 24 12 Piecewise Day 2Uploaded byVlad Vizconde
- Feedback Laboratory Exercise 3Uploaded byPedyrose Dulay
- Mathematics 3 MAy 2013Uploaded byDax Shukla

- casella_berger_statistical_inference1.pdfUploaded byRichieQC
- Deep Learning in Neural Networks An Overview.pdfUploaded byRichieQC
- Introduction to Computation and Programming Using Python (Spring 2013 Edition).pdfUploaded byRichieQC
- A Connectionist Approach to Algorithmic Composition.pdfUploaded byRichieQC
- Algorithm Design by Jon Kleinberg, Eva Tardos.pdfUploaded byRichieQC
- Gradient-based mathematical optimization, new challenges in solving real world modeling problems.pdfUploaded byRichieQC
- AlgsMasses.pdfUploaded byRichieQC
- A Connectionist Approach to Algorithmic CompositionUploaded byRichieQC
- Math - Calculus of Finite Differences.pdfUploaded byRichieQC
- Introduction to Python for Econometrics, Statistics and Data Analysis.pdfUploaded byRichieQC
- Koller D., Friedman N. Probabilistic graphical models (MIT, 2009)(ISBN 0262013193)(O)(1270s)_CsAi_.pdfUploaded byRichieQC
- Application of AVX (Advanced Vector Extensions) for Improved.pdfUploaded byRichieQC
- C++BetterDeviceSoftware.pdfUploaded byRichieQC
- An-Introduction-to-probability-Theory-by-William-Feller.pdfUploaded byRichieQC
- Basic probabilty theory.pdfUploaded byRichieQC
- Lo-que-el-Buddha-enseno-Walpola-Rahula.pdfUploaded byRichieQC
- athut rimbaud poemas completos.pdfUploaded byRichieQC
- Evolution of consciousness.pdfUploaded byRichieQC
- A Cultural Perspective on Romantic LoveUploaded byRichieQC
- Artificial Neural Network Approach for Brain Tumor Detection Using Digital Image Segmentation.pdfUploaded byRichieQC
- Artificial Neural Network Based Brain Cancer AnalysisUploaded byRichieQC
- A brief introduction to neural networks.pdfUploaded byRichieQC
- Existentialist_Ethics.pdfUploaded byRichieQC
- Memristor Models for Machine LearningUploaded byRichieQC
- Estructura molecular de los ácidos nucleicos.pdfUploaded byRichieQC
- A New Algorithm for Fully Automatic Brain Tumor.pdfUploaded byRichieQC
- nicanor parra poemas y antipoemas.pdfUploaded byRichieQC
- Beuchot Mauricio - Historia De La Filosofia Del Lenguaje.pdfUploaded byRichieQC
- Eric Hobsbawm - Age Of Capital - 1848-1875.pdfUploaded byRichieQC