Professional Documents
Culture Documents
Abstract— Recently, researchers in the artificial neural net- the data through a sequence of stages, e.g., preprocessing,
work field have focused their attention on connectionist models prediction and postprocessing. In the specific field of artificial
composed by several hidden layers. In fact, experimental results neural networks, researchers have proposed and experimented
and heuristic considerations suggest that deep architectures
are more suitable than shallow ones for modern applications, several models exploiting multilayer architectures, including
facing very complex problems, e.g., vision and human language convolutional neural networks [9], cascade correlation net-
understanding. However, the actual theoretical results supporting works [10], deep neural networks [4], [5], layered graph neural
such a claim are still few and incomplete. In this paper, we networks [11], and networks with adaptive activations [12].
propose a new approach to study how the depth of feedforward In addition, even recurrent/recursive models [13], [14] and
neural networks impacts on their ability in implementing high
complexity functions. First, a new measure based on topological graph neural networks [15], which cannot be strictly defined
concepts is introduced, aimed at evaluating the complexity of the deep architectures, actually implement the unfolding mecha-
function implemented by a neural network, used for classification nism during learning, which, in practice, produces networks
purposes. Then, deep and shallow neural architectures with com- that are as deep as the data they have to model. All the above
mon sigmoidal activation functions are compared, by deriving mentioned architectures have found in recent years a wide
upper and lower bounds on their complexity, and studying how
the complexity depends on the number of hidden units and the application to challenging problems in pattern recognition,
used activation function. The obtained results seem to support such as, for example, in natural language processing [16],
the idea that deep networks actually implements functions of chemistry [17], image processing [18], [19], and web min-
higher complexity, so that they are able, with the same number ing [20], [21].
of resources, to address more difficult problems. Nevertheless, from a theoretical point of view, the advan-
Index Terms— Betti numbers, deep neural networks, function tages of deep architectures are not yet completely under-
approximation, topological complexity, Vapnik–Chervonenkis stood. The existing results are limited to neural networks
dimension (VC-dim). with boolean inputs and units, implementing logical gates or
threshold activation functions, and to sum-product networks
I. I NTRODUCTION [22], [23]. In the first case, it has been proved that the imple-
mentation of some maps, for instance, the parity function,
I N RECENT years, there has been a substantial growth
of interest in feedforward neural networks with many
layers [1]–[3]. The main idea underlying this research is
requires a different number of units according to whether a
shallow or a deep architecture is exploited. However, such
results deal only with specific cases and do not provide a
that, even if common machine learning models often exploit
general tool for studying the computational capabilities of deep
only shallow architectures, deep architectures1 are required
neural networks. For sum-product networks, instead, deep
to solve complex AI problems as, e.g., image analysis or
architectures were proven to be more efficient in representing
language understanding [6]. Such a claim is supported by
real-valued polynomials. Anyway, there seems to be no way
numerous evidences coming from different research fields.
to directly extend those results to common neural networks,
For example, in neuroscience, it is well known that the neocor-
with sigmoidal activation functions.
tex of the mammalian brain is organized in a complex layered
Intuitively, an important actual limitation is that no mea-
structure of neurons [7], [8]. In machine learning practice,
sure is available to evaluate the complexity of the functions
complex applications are commonly approached by passing
implemented by neural networks. In fact, the claim that
Manuscript received May 6, 2013; revised October 12, 2013; accepted deep networks can solve more complex problems can be
November 21, 2013. Date of publication January 2, 2014; date of current reformulated as deep neural networks can implement func-
version July 14, 2014.
The authors are with the University of Siena, Siena 53100, Italy (e-mail:
tions with higher complexity than shallow ones, when using
monica@dii.unisi.it; franco@dii.unisi.it). the same number of resources. Anyway, to the best of our
Digital Object Identifier 10.1109/TNNLS.2013.2293637 knowledge, the concept of high complexity function has not
1 It is worth noting that, in this paper, the terms “deep architectures” and
been formally defined yet in the artificial neural network field.
“deep networks” are completely interchangeable and are used to address the
general class of feedforward networks with many hidden layers, whereas in [4] Actually, the complexity measures commonly used are not
and [5] “deep network” identifies a specific model in the above class. useful for our purposes, since they deal with different concepts,
2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
1554 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 8, AUGUST 2014
TABLE I
U PPER AND L OWER B OUNDS ON THE G ROWTH OF B(SN ) FOR
N ETWORKS W ITH h H IDDEN U NITS , n I NPUTS , AND
l H IDDEN L AYERS . T HE B OUND IN THE F IRST ROW
I S A W ELL -K NOWN R ESULT AVAILABLE IN [26]
has not been proved for networks with the hyperbolic tangent olk = σ ⎝ l
wk,i ol−1
i + bkl ⎠
i=1
function. Interestingly, we will clarify that the extension of
such results may require some significant advancements in where σ : IR → IR is the activation function, and h 0 = n.
algebraic topology, since they are related to some important We assume here that the activation function of the unique
and still unsolved problems. neuron in the output (last) layer is the identity function, i.e.,
Finally, from a practical point of view, both propositions σ (a) = a.7 On the other hand, for the activation function of
suggest that deep networks have the capability of implement- the hidden layers, four common cases will be studied: 1) the
ing more complex sets than shallow ones. This is particularly inverse tangent function, i.e., σ (y) = arctan(y) = tan−1 (y);
interesting if compared with current results on VC-dim of 2) the hyperbolic tangent function, i.e., σ (y) = tanh(y) =
multilayer perceptrons. Actually, the available bounds on VC- (e y − e−y )/(e y + e−y ); 3) polynomial activation functions,
dim with respect to the number of network parameters6 do i.e., σ (y) = p(y), where p is a polynomial; and 4) sigmoid
not depend on whether deep or shallow architectures are functions, i.e., σ is a limited monotone increasing function, for
considered. Since the VC-dim is related to generalization, which there exist lσ , rσ ∈ IR such that lim y→−∞ σ (y) = lσ
the two results together seem to suggest that deep networks and lim y→∞ σ (y) = rσ .
allow to largely increase the complexity of the classifiers Finally, the network output is given by the output of the
implemented by the network without significantly affecting its neuron in the last layer. Thus, the function f N , implemented
generalization behavior. This paper includes also an intuitive by a three-layer network N (one hidden layer), with n inputs
and preliminary discussion on the peculiarities that an applica- and h 1 hidden units, may be expressed in the form
tion domain should have to take advantage of deep networks. ⎛ ⎞
This paper is organized as follows. In Section II, the h1
n
notation is introduced. Section III collects the main results. In f N (x) = b12 + w1,i
2
1
σ ⎝bi11 + wi11 ,i0 x i0 ⎠ . (1)
Section IV, such results are discussed and compared with the i1 =1 i0 =1
state-of-the-art achievements. Finally, conclusions are drawn
in Section V. On the other hand, a four-layer network with, respectively,
h 1 and h 2 units in the first and in the second hidden layer,
5 It is worth mentioning that all the presented results for the hyperbolic
tangent apply also to the logistic sigmoid logsig(a) = 1/(1 + e−a ), which can 7 In common applications, output neurons with linear or sigmoidal activation
be obtained by translations and dilations from tanh(·). functions are usually employed. In this paper, the analysis is carried out only
6 Current bounds range from O( p log p) to O( p 4 ), according to the used for the former case, but most of the presented results can easily be extended
activation function, being p the number of the network parameters. to the latter.
1556 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 8, AUGUST 2014
implements the function the polynomials, and all the algebraic functions. In addition,
the compositions and the inverses of Pfaffian functions are
h2
f N (x) = b13 + w1,i
3 Pfaffian. Therefore, the natural logarithm, ln(x), is Pfaffian,
i =1
2
being the inverse of the exponential function e x , and the
⎛2 ⎛ ⎞⎞ logistic sigmoid logsig( x) = (1+e1 −x ) is Pfaffian, since it is
h1
n
an algebraic function of Pfaffian functions.
× σ ⎝bi22 + wi22 ,i1 σ ⎝bi1
1
+ wi11 ,i0 x i0 ⎠⎠
i1 =1 i0 =1 III. T HEORETICAL R ESULT
and so on. In this section, we collect some theoretical results on
function complexity, which can be applied to neural networks
B. Betti Numbers and Pfaffian Functions used in the classification framework, in order to compare deep
and shallow architectures. Actually, in Sections III-A and B,
Formally, the kth Betti number, bk (S), is defined as the results on upper and lower bounds, evaluated for different
rank of the kth homology group of the space S. The reader activation functions (threshold, polynomials, and sigmoidal),
is referred to text books on algebraic topology for a wider are, respectively, reported. For the theorems’ proofs, the reader
introduction to homology [25], [27]. Here, it suffices to say is referred to the Appendix, that also contains a discussion on
that the homology theory realizes a link between topological possible extensions to the presented results.
and algebraic concepts, and provides a theoretical framework
that allows to compare objects according to their forms. A. Upper Bounds
Homology theory starts from the observation that two
The first theorem we propose provides an upper bound on
topological spaces can be distinguished looking at their holes,
the sum of the Betti numbers of the positive set realized by
and provides a formal mathematical tool for determining
three-layer networks with arctan(·) activation function.
different types of holes. Actually, homology is also defined
Theorem 1: Let N be a three-layer neural network with
to be based on cycles. In fact, a k-dimensional compact hole
n inputs and h hidden neurons, with arctangent activation in
can be identified by its (k − 1)-dimensional cyclic border.
the hidden units. Then, for the set SN = {x ∈ IR n | f N (x) >
For example, the border of the 3-D hole inside a sphere is
0}, B(SN ) ∈ O((n + h)n+2 ) holds.
the surface of the sphere itself; a torus has two 2-D holes,
Thus, B(SN ) grows at most polynomially in the number of
corresponding to the cyclic paths around the inside (of the
hidden neurons. Interestingly, a similar result can be obtained
“donut”) and around the tube, respectively (Fig. 1). The kth
for analogous shallow architectures having a single hidden
homology group of a space S, intuitively, collects a coding
layer of polynomial units.
of all the different types of (k + 1)-dimensional holes/cycles
Theorem 2: Let p be a polynomial of degree r . Let
contained in S. Thus, the kth Betti number provides a sort of
N be a three-layer neural network with n inputs, and h hidden
signature to recognize similar objects, by counting the number
neurons, having p as activation function. Then, for the set
of (k + 1)-dimensional holes.
SN = {x ∈ IR n | f N (x) > 0}, it holds that
The homology theory plays a fundamental role in many dis-
1
ciplines, including, for instance, physics [28], image process- B(SN ) ≤ (2 + r )(1 + r )n−1 .
ing [29], and sensor networks [30]. 2
A sequence ( f 1 , . . . , f ) of real analytic functions in IR n In addition, when the network has only one input and the
is a Pfaffian chain on the connected component U ⊆ IR n , if activation function is the arctangent, the problem of defining a
there exists a set of polynomials Pi, j such that for each x ∈ U bound on B(SN ) becomes simpler. In fact, if n = 1, then SN is
a set of intervals in IR and the only nonnull Betti number is the
∂ fi first one, i.e., b0 (SN ), which counts the number of intervals
= Pi, j (x, f 1 (x), . . . , f i (x)), 1 ≤ i ≤ , 1 ≤ j ≤ n (2)
∂x j where f N (x) is nonnegative. In this case, as proved in the
holds. A Pfaffian function in the chain ( f 1 , . . . , f ) is a map f , following theorem, the bound on b0 (SN ) is linear.
expressed as a polynomial P in x and f i (x), i = 1, . . . , Theorem 3: Let N be a three-layer neural network with one
input and h hidden neurons. If the activation function used in
f (x) = P(x, f1 (x), . . . , f (x)). the hidden units is the arctangent, then it holds that
The complexity (or the format) of the Pfaffian function f is b0 (SN ) ≤ h.
defined by a triplet (α, β, ), where α is the degree of the
Finally, we derive some upper bounds for deep networks,
chain, i.e., the maximum degree of the polynomials Pi, j , β is
which can have any number of hidden layers.
the degree of the Pfaffian function, i.e., the degree of P, and
Theorem 4: Let N be a network with n inputs, one output,
is the length of the chain.
l hidden layers, and h hidden units with activation σ . If σ is
The class of Pfaffian maps is very wide and includes most of
a Pfaffian function such that the polynomials Pi, j , defining its
the functions commonly used in engineering and computer sci-
chain, do not depend on the input variable x (see Eq.(2)), σ
ence applications. Actually, most of the elementary functions
has complexity (α, β, ), and n ≤ h, then
are Pfaffian, e.g., the exponential, the trigonometric functions,8
B(SN ) ≤ 2(h(h−1))/2
8 More precisely, sin(·) and cos(·) are Pfaffian within a period. For instance,
sin(·) and cos(·) are Pfaffian in [0, 2π ]. ×O (n((α + β − 1 + αβ)l + β(α + 1)))n+h
BIANCHINI AND SCARSELLI: ON THE COMPLEXITY OF NEURAL NETWORK CLASSIFIERS 1557
holds. In addition, if σ = arctan(·) and n ≤ 2h, then Theorem 6 deals with nonlinear C 3 activations, a class that
includes all the polynomials with degree larger than one.
B(SN ) ≤ 2h(2h−1) O (nl + n)n+2h , On the other hand, the obtained results are not useful for
and if σ = tanh(·) and n ≤ h networks with a fixed number of layers, since, in this case, the
bound becomes a constant. In the following, a lower bound
B(SN ) ≤ 2(h(h−1))/2 O (nl + n)n+h . for the complexity of three-layer networks is devised, varying
the number of hidden neurons. More precisely, in the next
It is worth noting that the hypothesis n ≤ h (n ≤ 2h,
theorem, we show that b0 (SN ) can grow as ((h − 1)/2n)n
n ≤ h) is not restrictive for our purposes, since we are
for three-layer networks with sigmoid activation functions.
interested in the asymptotic behavior of deep architectures,
Interestingly, the lower bound is (h n ), and it is optimal
with the number of hidden layers (and, therefore, the number
for networks with arctan(·) function, since it equals the upper
of hidden units) tending to infinity. Similarly, the assumption
bound devised in Theorem 1. It is also optimal for networks
on Pi, j , defining the chain of σ , is satisfied by common
with threshold activation (see [26], [31]).
activation functions and represents just a technical hypothesis
Theorem 7: For any positive integers n, m, and any sig-
added to simplify the proof of Theorem 4 (see the Appendix,
moidal activation function σ , there exists a three-layer net-
for more details).
work N , with n inputs and h = n(2m − 1) hidden units, for
Finally, for deep networks with polynomial activation func-
which
tions, the upper bound of Theorem 4 can be improved as
follows. h−1 n
b0 (SN ) ≥ .
Corollary 1: Let f N (x) be the function implemented by a 2n
neural network with n inputs, one output, and l hidden layers, It is worth mentioning that a tighter lower bound b0 (SN ) ≥
composed by neurons with a polynomial activation function (h/n)n can be obtained by assuming that the activation func-
of degree r . Then, it holds that tion is nonpolynomial and analytic at some point.9 (see the
1 theorem’s proof, for more details).
B(SN ) ≤ (2 + r l )(1 + r l )n−1 .
2 Instead, Theorem 7 cannot be extended to three-layer net-
works with polynomial activations, since they are not universal
B. Lower Bounds approximators. Actually, it can easily be seen that, in this case,
In the following, we present some results on lower bounds f N is a polynomial having at most the degree of the activation
of B(SN ), for both deep and shallow networks. In particular, function, independently of the number of hidden units. In fact,
we study the growth of B(SN ) when the number of layers also the upper bound in Theorem 2 does not depend on the
is progressively increased. Actually, the presented results con- number of hidden neurons, but only on the degree of the
cern b0 (SN ). However, since B(SN ) > b0 (SN ), they also activation function and on the input space dimension.
hold for B(SN ).
In the next theorem, we prove that for networks of growing IV. R ELATED L ITERATURE AND D ISCUSSION
depth, B(SN ) can actually become exponential with respect to
This paper belongs to the wide class of studies whose intent
the number of the hidden neurons. In fact, Theorem 5 provides
has been that of characterizing the peculiarities of the functions
an exponential lower bound on b0 (SNl ), for deep networks
implemented by artificial neural networks. In the early 90s,
with hidden neurons having sigmoid activation.
the capacity of multilayer networks—with sigmoidal, RBF,
Theorem 5: For any positive integers l, n, and any sigmoid
and threshold activation functions—of approximating any con-
activation function, there exists a multilayer network Nl with
tinuous map up to any degree of precision was established
n inputs, a unique output, l hidden layers, and two neurons
[32]–[35]. Such seminal studies have been subsequently
per layer, such that
extended in several ways, considering networks with very
b0 (SNl ) ≥ 2l−1 + 1. generic neurons (e.g., with analytic activation functions),
expanding the class of approximable maps (e.g., to integrable
Interestingly, Theorem 5 can be extended to networks with
functions [36]), or studying nonstatic networks (e.g., recurrent,
nonlinear continuously differentiable activation functions.
recursive, and graph neural networks [37]–[39]). Unfortu-
Theorem 6: Let σ ∈ C 3 be a nonlinear function in an
nately, those results do not allow to distinguish between deep
open interval U ⊆ IR. Then, for any positive integers l, n,
and shallow architectures, since they are based on common
there exists a multilayer network Nl , with n inputs, a unique
properties instead of on distinguishing features.
output, l hidden layers, and two neurons in each layer, with
On the other hand, researchers in the field of logic networks
activation σ , such that
have actually pursued the goal of defining the effect of
b0 (SNl ) ≥ 2l−1 + 1. the network depth on the amount of resources required to
implement a given function. In this ambit, it has been shown
Notice that even if both Theorems 5 and 6 can be applied
that there exist boolean functions, whose realization requires a
to the common activation functions tanh(·) and arctan(·), they
polynomial number of logic gates ( AND, OR, and NOT) using
deal with slightly different classes of functions. In particular,
Theorem 5 copes with general sigmoidal activation func- 9 A function is analytic at a given point if its Taylor series, computed with
tions, including, e.g., piecewise continuous functions, whereas respect to such a point, converges.
1558 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 8, AUGUST 2014
a network with l layers, whereas an exponential number of practice, the depth of a network has a larger impact on the
gates is needed for a smaller number of layers [40], [41]. complexity of the implemented function than on its gener-
A well-known function of this class is the parity function,10 alization capability. In other words, using the same amount
which requires an exponential number of gates (in the number of resources, deep networks are able to face more complex
of inputs), if the network has two layers, whereas it can be applications without loosing their generalization capability.
implemented by a linear number of logic gates, if a logarithmic Let us now discuss the limits of the presented theorems.
number of layers (in the input dimension) are employed. It is Further limits will be discussed in Subsection I of the
worth mentioning that it has been also proved that most of the Appendix. First, as mentioned in Section I, the claim that
boolean functions require an exponential number of gates to the sum of the Betti numbers related to sets implemented
be computed with a two-layer network [42]. by common multilayer networks grows exponentially in the
In addition, in [22], deep and shallow sum-product net- number of neurons if the network is deep, whereas it grows
works11 were compared, using two classes of functions. Their polynomially for a three-layer architecture, has been proved
implementation by deep sum-product networks requires a only for some activation functions. The most evident limitation
linear number of neurons with respect to the input dimension n of our results lies in the lack of a tight and polynomial
and the network
√ depth l, whereas a two-layer network needs upper bound for three-layer networks with hyperbolic tangent
at least O(2 n ) and O((n − 1)l ) neurons, respectively. activation (Table I). Such a limitation may appear surprising,
Finally, similar results were obtained for weighed threshold since a bound has been derived for the arctan(·) function,
circuits.12 For example, it has been proved that there are which is, intuitively, very similar to tanh(·). Actually, the
monotone functions13 fk that can be computed with depth k difference between arctan(·) and tanh(·) is more technical than
and a linear number of logic (AND, OR) gates, but they require fundamental. In the Appendix, it will be explained how the
an exponential number of gates to be computed by a weighed problem of devising a tighter bound for three-layer networks
threshold circuit with depth k − 1 [44]. with tanh(·) activation is strictly related to a well known and
The above mentioned results, however, cannot be applied unsolved problem on the number of the solutions of poly-
to common Backpropagation neural networks. In this sense, nomial systems. Also in that context, it will be deepen how
as far as we know, the theorems reported in this paper are the the general goal of improving the bounds reported in Table I
first attempt to compare deep and shallow architectures, which is related to unsolved challenging problems in mathematics
exploit the common tanh(·) and arctan(·) activation functions. as well.
Interestingly, the presented theorems are also related to those In addition, we have considered only networks with Pfaffian
on VC-dim, both from a mathematical and a practical point activation functions. Though this is a general feature for the
of view. An exhaustive explanation of such mathematical activations commonly used in Backpropagation networks,
relationship would require the introduction of tools that are out such an assumption is fundamental and cannot be easily
of the scope of this paper. Here, it suffices to say that some of removed, because it ensures that the Betti numbers of the set
the results obtained for the VC-dim can be derived using the implemented by the network are finite. Actually, in [48], it has
same theory, on Betti numbers and Pfaffian functions, that we been shown that the VC-dim is infinite for networks with one
will use in our proofs (see [45]). On the other hand, in order input, and two hiddens, with activation function σ (a) = π −1
to clarify practical relationships, it is worth pointing out that arctan(a) + (α(1 + a 2 ))−1 cos(a) + 1/2, where α > 2π.
the VC-dim provides a way of measuring the generalization In fact, the presence of the sinusoidal component allows to
capability of a classifier. Current available results show construct a single input network N for which f N has an
that the VC-dim is O( p 2 ), for networks with arctan(·) or infinite number of roots. Obviously, in this case, also B(Sn ) is
tanh(·) activation, where p is the number of parameters infinite. However, such a degenerate behavior is not possible
[45]–[47]. In addition, for generic sigmoidal activation when the activation function is Pfaffian, since a system of
functions,14 a lower bound ( p2 ) has been devised, so that n Pfaffian equations in n variables has a finite number of
the upper bound O( p2 ) for arctan(·) and tanh(·) is optimal. solutions [49].
Thus, the VC-dim is polynomial with respect to the number Also, one may wonder whether there are alternatives to the
of parameters and, as a consequence, it is polynomial also in sum of the Betti numbers, B(SN ) = i bi (SN ), for mea-
the number of hidden units. These bounds do not depend on suring the complexity of the regions realized by a classifier.
the number of layers in the network, thus suggesting that, in Actually, it is easy to see that other measures exist and we
must admit that our choice of using B(SN ) is due both to
10 The parity function, f : {0, 1}n → {0, 1}, is defined as f (x) = 1, the availability of mathematical tools suitable to derive a set
n n
if i=1 xi is odd, and f (x) = 0, if i=1 xi is even.
of useful bounds and to the fact that it provides a reasonable
11 A sum-product network consists of neurons that either compute the way of distinguishing topological complexity details. Actually,
product or a weighed sum of their inputs.
12 Thresholds circuits are multilayer networks with threshold activation func-
b0 (SN ) (the number of connected regions of SN ) is a possible
tions, i.e., σ (a) = 1, if a ≥ 0, and σ (a) = 0, if a < 0. Interestingly, threshold alternative to B(SN ), even if it captures less information on
circuits are more powerful than logic networks and allow to implement the the set SN ,15 while we are not aware of any tighter bound that
parity function by a two–layer network with n threshold gates [43].
13 See [44] for a definition of monotone function, in this case. 15 For example, using B(S ), we can recognize that the region representing
N
14 More precisely, this result holds if the activation function σ has left and the character B is more complex than the region D, which, in turn, is more
right limits and it is differentiable at some point a, where also σ
(a) = 0 complex than the region I , whereas all the regions have the same complexity
holds. according to b0 (SN ).
BIANCHINI AND SCARSELLI: ON THE COMPLEXITY OF NEURAL NETWORK CLASSIFIERS 1559
V. C ONCLUSIONS
In this paper, we proposed a new measure, based on
topological concepts, to evaluate the complexity of functions
implemented by neural networks. The measure has been used
for comparing deep and shallow feedforward neural networks.
It has been shown that, with the same number of hidden units,
deep architectures can realize maps with a higher complexity
with respect to shallow ones. The above statement holds
for networks with both arctangent and polynomial activation
functions. This paper analyzes the limits of the obtained results
and describes technical difficulties, which prevented us from
Fig. 2. Plots of the set S f , for f = g (top left), f = g ◦ t (top right), extending them to other commonly used activation functions.
f = g ◦ t2 (bottom left), and f = g ◦ t3 (bottom right), where g = 1 −
x
2 , An informal discussion on the practical differences between
t (x) = [1 − 2x12 , 1 − 2x22 ], t2 = t ◦ t, and t3 = t ◦ t ◦ t hold.
deep and shallow networks is also included.
Interestingly, the proposed complexity measure provides a
tool that allows us to study connectionist models from a new
can be derived using b0 . On the other hand, it is not difficult perspective. Actually, it is a future matter of research the appli-
to design a measure that captures the fact that the character M cation of the proposed measure to more complex architectures,
looks more complex than the character I, a difference which such as convolutional neural networks [9], recurrent neural
is not recognized by B(SN ), but then it becomes challenging networks [13], [50], and neural networks for graphs [14], [15].
to derive useful bounds using such a measure. Hopefully, this analysis will help in comparing the peculiar-
Finally, it may help to provide an intuitive explanation of the ities of different models and in disclosing the role of their
possible advantages of deep architectures. First, notice that the architectural parameters.
kth hidden layer of a feedforward network realizes a function
that correlates the outputs of the neurons in layer k − 1 with A PPENDIX
the inputs of the neurons in layer k + 1. Such a correlation
In this appendix, we collect the proofs of the presented
can be represented as a map, so that the global function
theorems along with some comments on future research,
implemented by a deep network results in a composition of
aimed at improving the given bounds. Such proofs exploit an
several maps, whose number depends on the number of its
important property of Pfaffian functions, according to which
hidden layers. On the other hand, the function composition
the solutions of systems of equalities and inequalities based
mechanism intuitively allows to replicate the same behavior
on Pfaffian functions define sets whose Betti numbers are
on different regions of the input space. Without providing
finite, and can be estimated by the format of the functions
a formal definition of such a statement, let us illustrate this
themselves. Therefore, let us start this section by introducing
concept with an example. Consider the composition f = g ◦ t,
some theoretical properties of Pfaffian varieties.
with g : D → IR, t : D → D, defined on the domain
Formally, a Pfaffian variety is a set V ⊂ IR n defined by a
D ⊆ IR n . In addition, let us assume that there exist m sets,
system of equations based on Pfaffian functions g1 , . . . , gt ,
A1 , . . . , Am ⊆ D, such that t (Ai ) = D, 1 ≤ i ≤ m. We can
i.e., V = {x ∈ U| g1(x) = 0, . . . , gt (x) = 0}, where
observe that the number of connected regions b0 (S f ) of the
U ⊂ IR n is the connected component on which g1 , . . . , gt
set S f = {x| f (x) ≥ 0} is at least m times b0 (Sg ), where
are defined. A semi-Pfaffian variety S on V is a subset
Sg = {x| g(x) ≥ 0}. In fact, intuitively, f behaves on each
of V defined by a set of sign conditions (inequalities or
Ai as g behaves on the whole domain D. In addition, this
equalities) based on Pfaffian functions f1 , . . . , fs , i.e., S =
argument can be extended to the case when g is composed
{x ∈ U| f 1 (x)R1 0, . . . , fs (x)Rs 0}, where, for each i , Ri
with t several times, i.e., f = g ◦ tk , where tk = t ◦ t ◦ . . . ◦ t
represents a partial order relation in {<, >, ≤, ≥, =}. For our
for k occurrences of t. In this case, the ratio b0 (S f )/b0 (Sg )
purposes, we will use the following theorem, due to Zell [51].
is at least m k . For example, Fig. 2 shows the set Sg◦t , when
Theorem 8: Let S be a compact semi-Pfaffian variety in
g = 1 −
x
2 , t (x) = [1 − 2x 12 , 1 − 2x 22 ], and D = {x| − 1 ≤
U ⊂ IR n , given on a compact Pfaffian variety V, of dimen-
x 1 ≤ 1, −1 ≤ x 2 ≤ 1}. Since t maps four distinct regions
sion n
, defined by s sign conditions on Pfaffian functions.
to D, then Sg is composed of a single connected region,
If all the functions defining S have complexity at most
Sg◦t of four connected regions, Sg◦t2 of 16 regions, Sg◦t3 of
(α, β, ), then the sum of the Betti numbers satisfies
64 regions, and so on.
Therefore, we can intuitively conclude that deep networks B(S) ∈ s n 2((−1))/2 O (nβ + min(n, )α)n+ . (3)
are able to realize, with few resources, functions that replicate
1560 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 8, AUGUST 2014
S = SN = {x| f N (x) ≥ 0}. In this case, the term s n in (3) applied to f N (x), with d = r , proving the thesis.
can be ignored, since s = 1.
Finally, it is worth noting that a bound is available also for
the sum of the Betti numbers of semialgebraic sets, i.e., sets C. Proof of Theorem 3
defined by inequalities involving polynomials [53]. The first Betti number b0 (SN ) equals the number of inter-
Theorem 9: Let S be the set defined by s inequalities, vals in IR where f N (x) is nonnegative. Thus, b0 (SN ) can
q1 (x) ≥ 0, . . ., qs (x) ≥ 0, where the qi are polynomials in be computed from the number of the roots of f N and, in
n real variables. Let d be the sum of the degrees of the qi . turn, according to the intermediate value theorem, from the
Then number of extreme points (maxima and minima) of f N . More
precisely, f N (x) has at most r + 1 roots, which define r + 2
1
B(S) ≤ (2 + d)(1 + d)n−1 (4) intervals where f N (x) has constant sign. Since, the number
2 of intervals where f N (x) is nonnegative is at most a half of
holds. the total number of intervals, then
In the following, the proofs of the theorems are reported,
r +2
together with some lemmas that are functional to such proofs. b0 (SN ) ≤ (5)
2
holds, where · is the ceil operator. On the other hand, from
A. Proof of Theorem 1 Lemma 1, ∂ f N (x)/∂ x = Q(x)/T (x), where Q has degree at
The next lemma defines the exact format of the func- most r = 2(h − 1). Thus, f N (x) has at most 2(h − 1) extreme
tion implemented by a three-layer network with arctangent points, which, along with Eq. (5), proves the thesis.
activation.
Lemma 1: Let N be a three-layer neural network, having D. Proof of Theorem 4
one output and h hidden neurons, with arctangent activation. The strategy used to prove Theorem 4 is similar to that
Then, the function f N (x), implemented by N , is Pfaffian, adopted for Theorem 1 and requires to first compute the
with complexity (2h, 2, 2). bounds on the complexity of the Pfaffian function implemented
Proof: Since d(arctan(a))/da = 1/(1 + a 2 ) by a multilayer perceptron.
Lemma 2: Let σ : IR → IR be a function for which there
∂ fN (x)
h
w1,i
2 w1
Q(x)
=
1 i1 ,k
2 = exist a Pfaffian chain c = (σ1 , . . . , σ ) and + 1 polynomials,
∂ xk
n T (x) Q and Pi , 1 ≤ i ≤ , of degree β and α, respectively, such
i1 =1
1+ wi1 ,i0 x i0 + bi11 that
i0 =1
dσi (a)
= Pi (a, σ1 (a), . . . , σi (a)), 1 ≤ i ≤ (6)
where Q and T are polynomials in x 1 , . . . , x n , of degree da
smaller than 2(h − 1) and 2h, respectively.17 σ (a) = Q(σ1 (a), . . . , σ (a)). (7)
Let us now consider the sequence ( f 1 , f2 ), with f 1 (x) = Q(x)
Let f N (x) be the function implemented by a neural network
and f2 (x) = 1/T (x), which is a Pfaffian chain, with degree
with n inputs, one output, l ≥ 1 hidden layers, and h hidden
α = 2h and length = 2. Actually, f 1 is straightforwardly
units with activation σ . Then, fN (x) is Pfaffian with complex-
Pfaffian, since it is a polynomial, whereas for f 2 the following
ity bounded by (ᾱ, β, h), where ᾱ = (α + β − 1 + αβ)l + αβ,
equation holds:
in the general case, and ᾱ = (α + β − 1)l + α, if, ∀i , Pi does
∂ f 2 (x) T j (x) not depend directly on a, i.e., Pi = Pi (σ1 (a), . . . , σi (a)).
=− = −T j (x) · ( f 2 (x))2 Proof: To prove the lemma, we first show that the activation
∂x j T (x)2
level
being T j the partial derivative of T with respect to x j .
h d−1
Finally, since ∂ f N (x)/∂ x k can be computed as a polynomial aid = wi,d j od−1 + bid (8)
j
in f1 (x) and f 2 (x), namely ∂ f N (x)/∂ x k = f1 (x) · f 2 (x), then j =1
f N is a Pfaffian function with complexity (2h, 2, 2).
Proof of Theorem 1: Theorem 1 is a direct consequence of of the i th neuron in layer d, is a Pfaffian function in the
Theorem 8, which provides a bound for Pfaffian functions, and chain s d , constructed by applying all the σi , 1 ≤ i ≤ , on all
Lemma 1. the activation levels a kj (of neuron j in layer k) up to layer
d −1
16 Formally, IR n is the special variety U = {x| 0 = 0}.
17 Actually, Q = h h s d = σ1 a11 , σ2 a11 , . . . , σ a11 , . . .
t=1 Nt (x) j =t D j (x) and T = j =1 D j (x), where d−1
2 w 1 , D (x) = 1 + (
Nt (x) = w1,t t,k j
n w x
i0 =1 j,i0 i0 + b 1 )2 hold.
j
. . . , σ1 ahd−1
d−1
, σ2 ah d−1 , . . . , σ ahd−1
d−1
.
BIANCHINI AND SCARSELLI: ON THE COMPLEXITY OF NEURAL NETWORK CLASSIFIERS 1561
Notice that the above hypothesis straightforwardly implies that dσ (a)/da|a=a d is α+β −1, if, ∀i , Pi does not depend directly
j
f N (x) is Pfaffian and that its length is h, since the output on a, and it is α + β − 1 + αβ, in the general case. Collecting
of the network is just the activation level of the unique output the results for ∂a dj /∂ x k and dσ (a)/da|a=a d , it follows that
neuron in layer l + 1, i.e., f N (x) = a1l+1 . j
∂a dj /∂ x k is a polynomial (with respect to the chain s dj and x)
In fact, for d ≥ 2, from Eqs. (7) and (8) it follows:
whose degree is (α + β − 1 + αβ)(d − 1), in the general case,
h d−1 d−1 and (α + β − 1)(d − 1), if, ∀i , Pi does not depend directly
aid = wi,d j Q σ1 a d−1
j , . . . , σ aj + bid (9) on a.
j =1 Finally, we can estimate the degree of ∂σi (a dj )/∂ x k in (10),
so that aid is a polynomial of degree β in the chain s d . by the bounds on the degrees of Pi (a dj , σ1 (a dj ), . . . , σi (a dj ))
Thus, it remains to prove that s d is a Pfaffian chain, i.e., and ∂a dj /∂ x k . It follows that the degree of ∂σi (a dj )/∂ x k is at
the derivative of each function in s d can be defined as a most (α + β − 1 + αβ)(d − 1) + αβ, in the general case, and
polynomial in the previous functions of the chain and in the (α + β − 1)(d − 1) + α, if Pi does not depend directly on
input x. Actually, for each i, j, d, k a. The bound on ᾱ follows, observing that the unique output
neuron belongs to the (l +1)th layer in a network with l hidden
∂σi (a dj ) dσi (a) ∂a dj layers.
=
∂ xk da a=a d ∂ x k Proof of Theorem 4: The proof is a direct consequence of
j
Theorem 8 and Lemma 2. In fact, replacing the format given
∂a dj by Lemma 2 into the bound defined in (3), it follows that:
= Pi a dj , σ1 a dj , . . . , σi a dj (10)
∂ xk
B(SN ) ≤ 2(h(h−1))/2 O (nβ + min(n, h)
holds, due to the chain rule. The first term in the right part
of (10) is a polynomial with respect to the elements of the × ((α + β − 1 + αβ)l + αβ))n+h
chain and the input. In fact, based on the previous discussion ≤ 2(h(h−1))/2
on a dj and on the hypothesis that Pi is a polynomial of degree
×O (n((α + β − 1 + αβ)l + β(α + 1)))n+h .
α in its variables, it follows that Pi (a dj , σ1 (a dj ), . . . , σi (a dj ))
has degree βα, in the general case, whereas it has degree α On the other hand, the arctan function is Pfaffian in the chain
if Pi does not depend directly on a dj . Let us now expand the c = (σ1 , σ2 ), where σ1 (a) = (1 + a 2 )−1 and σ2 are the arctan
term ∂a dj /∂ x k , by iteratively applying the chain rule itself. In fact, we have
dσ1 (a)
∂a dj
h d−1 = −2a(σ1 (a))2
∂od−1 da
= bdj + wdj,r r dσ2 (a)
∂ xk ∂ xk = σ1 (a)
r=1
da
h d−1
dσ (a) ∂ard−1
= bdj + wdj,r so that the format of arctan is (3, 1, 2). Thus, if the network
da a=ard−1 ∂ x k exploits the arctan function, the format of f N (x) is (6l +
r=1
3, 1, 2h). In addition, by Theorem 8
h d−1
dσ (a)
= bdj + wdj,r
da a=ard−1 B(SN ) ≤ 2(2h(2h−1))/2 O (n + (6l + 3) · min(n, 2h))n+2h
r=1
⎛ ⎞ ≤ 2h(2h−1) O (6nl + 4n)n+2h
h d−2
dσ (a) ∂as ⎠
d−2
× ⎝brd−1 + wr,s
d−1
. = 2h(2h−1) O (nl + n)n+2h .
da a=asd−2 ∂ x k
s=1
= ··· Finally, tanh is Pfaffian in the chain c = (σ1 ), where σ1 is
tanh itself. Actually, we have
Therefore, ∂a dj /∂ x k can be considered as a polynomial in the d tanh(a)
derivatives dσ (a)/da|a=art , 1 ≤ t ≤ d −1, where r1 , . . . , rd−1 = 1 − (tanh(a))2 .
t da
is a sequence of neuron indexes. The highest degree terms of
such a polynomial are in the form of d−1 Thus, the format of tanh is (2, 1, 1) and P1 does not depend
t =1 dσ (a)/da|a=artt .
In addition directly on a. Then, according to Lemma 2, the complexity of
the function implemented by the network with tanh activation
dσ (a) d Q(σ1 (a), . . . , σ (a))
= function is (2l + 2, 1, h). Then, by Theorem 8
da a=a d da d
a=a j
B(SN ) ≤ 2(h(h−1))/2 O (n + (2l + 2) · min(n, h))n+h
j
∂ Q(y1 , . . . , y )
≤ 2(h(h−1))/2 O (2nl + 3n)n+h
= Pi (a, y1, . . . , yi ) 1 ≤ i ≤
∂yk
= 2(h(h−1))/2 O (nl + n)n+h .
k=1 d
yi = σi (a j )
a = a dj
Interestingly, from the proofs of Lemma 2 and Theorem 4 it
holds, so that dσ (a)/da|a=a d is polynomial with respect to
j is clear that an upper bound can also be derived for generic
the chain s d+1 and the activation level a dj , which, in turn, is Pfaffian functions (without the hypothesis on the polynomials
a polynomial in the chain s d . More precisely, the degree of defining the chain).
1562 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 8, AUGUST 2014
E. Proof of Corollary 1 Step 1: It is proved that there exists a network N , with one
Since a neural network with l hidden layers and a polyno- hidden layer, which implements the function g defined
mial activation of degree r implements a polynomial map of in Lemma 3.
degree r l , the proofs straightforwardly follows from (4). Step 2: Based on N , a network Nl is constructed, which
implements the composition gl = g ◦ g ◦ . . . g.
To prove (1), notice that a network with one hidden layer,
F. Proof of Theorem 5 containing two hidden units, computes the function
Intuitively, the proof of this theorem is based on the g(x) = w1 σ (v 1 x + b1 ) + w2 σ (v 2 x + b2 ) + c
observation that the map implemented by a layered network
is the composition of a set of functions, each of which is where, for simplicity, v 1 , v 2 are the input-to-hidden weights,
realized by a single layer. Hence, the proof is carried out w1 , w2 are the hidden-to-output weights, and b1 , b2 , and c are
in two steps: 1) a function g is described such that its the biases of the hidden units and the output unit, respectively.
repeated composition produces a map f = g ◦ g ◦ . . . g, Let us set the weights so that the hypothesis of Lemma 3 holds.
for which B({x ∈ IR n | f (x) ≥ 0}) grows exponentially in Actually, the hypothesis consists of three equalities g(−1) =
the number of repetitions of g and 2) then, it is proved that 1, g(0) = −1, and g(1) = 1, which can be rewritten as
each g can be realized by a single layer in a multilayer ⎡ ⎤
w1
network.
A ⎣ w2 ⎦ = B (11)
Lemma 3: Let g : IR → IR be a continuous function for
c
which g(−1) = 1, g(0) = −1, g(1) = 1 hold. Let gk be
the function obtained by the composition of k copies of g, where
⎡ ⎤
i.e., g1 = g and, for k > 1, gk = g ◦ gk−1 . Then, the set σ (−v 1 + b1 ) σ (−v 2 + b2 ) 1
S = {x ∈ IR| f (x) ≥ 0} contains at least 2k−1 +1 disconnected A=⎣ σ (b1 ) σ (b2 ) 1⎦, B = [1, −1, 1] .
intervals. σ (v 1 + b1 ) σ (v 2 + b2 ) 1
Proof: To prove the lemma, the following assertion is
demonstrated by induction on k. Notice that for each k, there Equation (11) is a linear system in the unknowns [w1 , w2 , c],
exist 2k + 1 points, y1 , . . . , y2k +1 , in [−1, 1] such that: which can be uniquely solved if A is a full rank matrix. On
the other hand, it can be easily seen that there exist values
1) y1 , . . . , y2k +1 are arranged in an ascending order, i.e.,
of v 1 , v 2 , b1 , b2 for which the determinant of A, det (A), is
yi < yi+1 , for 1 ≤ i ≤ 2k ;
nonnull. In fact, let us set v 1 = v 2 = α, b1 = α, and b2 = −α,
2) gk is alternatively −1 or 1 on such points, i.e., gk (yi ) =
for some real value α. In this case
1, if i is odd, and gk (yi ) = −1, if i is even.
Actually, the above statement directly implies the thesis of the det(A) = σ (0) · σ (−α) + σ (−2α) · σ (2α) + σ (α) · σ (0)
lemma since, for each odd i 1 , there exists an interval which −σ (2α) · σ (−α) − (σ (0))2 − σ (α) · σ (−2α).
contains yi1 and where gk is positive and, vice versa, for each
even i 2 there exists an interval which contains yi2 and where In addition, let rσ = limα→∞ σ (α) and lσ = limα→−∞ σ (α)
gk is negative. Thus, the intervals containing the odd indexes, be the right and the left limit of σ , respectively. Then
which are disconnected because they are separated by those lim det(A) = lσ σ (0) + rσ σ (0) − (σ (0))2 − lσ rσ
with an even index, are the intervals required by the lemma. α→∞
layer l−1 adopts a linear activation function, using the notation H. Proof of Theorem 7
of (1), we derive To prove the theorem, the following lemma is required:
it provides a lower bound for the complexity of three-layer
olk = σ (wk,1
l
ol−1 l l−1
1 + wk,2 o2 + bk )
l
networks, varying the number of their hidden neurons.
l l−1 l−2
= σ wk,1 (w1,1 l
o1 + w1,2 ol−2 l−1
2 + b1 )
Lemma 4: Let n be a positive integer and N
be a three-
l−1 l−2 layer neural network with one input, one output, and h
hidden
+ wk,2
l
(w2,1 o1 + w2,2l
ol−2 l−1
2 + b2 ) + bk
l
l units, having activation function σ . Suppose that f N
(x) ≤ 1,
l−1 l−1 l−2
= σ (wk,1 w1,1 + wk,2
l
w2,1 )o1 for any x ∈ IR, and that there exist 2m −1 consecutive disjoint
+ (wk,1
l l−1
w1,2 + wk,2
l l−1 l−2
w2,2 )o2 intervals U1 , . . . , U2m−1 ∈ IR such that, for any i , 1 ≤ i ≤
l−1 l−1 2m − 1:
+ (wk,1 b1 + wk,2 b2 + bkl ) .
l l
1) if i is even, f N
(x) < −n holds for any x ∈ Ui ;
2) if i is odd, f N
(x) > 0 holds for any x ∈ Ui .
Thus, the output olk does not change if the lth layer is
connected directly to the (l − 2)th layer and the new weights Then, there exists a network N , with n inputs, one output,
of the neurons are w̄k,1
l
= wk,1
l l−1
w1,1 + wk,2
l l−1
w2,1 , w̄k,2
l
= and h = nh
hidden units, having activation function σ , such
l−1 l−1 l−1 l−1 that
wk,1 w1,2 + wk,2 w2,2 , and b̄k = wk,1 b1 + wk,2 b2 + bkl .
l l l l l
(b1 ) + O v 13 sequence contains only odd numbers and x ∈ Hi1 ,...,in then,
2
1
n
σ (−v 2 + b2 ) = σ (b2 ) − v 2 σ (b2 ) + v 22 σ
(b2 ) + O v 23
f N (x) = f N
(x k ) > 0
2
1 2
k=1
σ (v 1 + b1 ) = σ (b1 ) + v 1 σ (b1 ) + v 1 σ (b1 ) + O v 13
2 holds, because, by hypothesis, fN
(x k ) > 0 for any x k ∈ Ui
1 with odd i . If, on the other hand, the sequence contains at least
σ (v 2 + b2 ) = σ (b2 ) + v 2 σ (b2 ) + v 22 σ
(b2 ) + O v 23 .
(b1 )σ
(b2 ) + O(v 13 ) + O(v 23 )
because, by hypothesis, f N
(x k ) ≤ 1 for any x k ∈ IR and
which is positive and nonnull if v 1 , v 2 are sufficiently close f N
(x k ) < −n for any x k ∈ Ui with even index.
to 0 and v 1 , v 2 have the same sign of σ
(b1 )σ
(b1 )σ
(b2 ), respectively.18 Hi1 ,...,in that can be defined choosing the indexes i 1 , . . . , i n
among the odd integers in [1, 2m − 1]. On the other hand, f N
18 By hypothesis, σ
(b )σ
(b )σ
(b )
1 2 1 2
is negative in the regions Hi1 ,...,in , where at least an i j is even.
may be zero: in such a case, v 2 can be either positive or negative. Since, each region Hi1 ,...,in of the former kind is surrounded
1564 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 8, AUGUST 2014
by regions of the latter kind, it follows that SN contains at shared with networks having tanh(·) activation explains why
least m n disconnected sets, i.e., b0 (SN ) ≥ m n holds. a polynomial bound was not derived in this case.
Proof of Theorem 7: The proof consists of defining a three- A careful inspection of the presented results suggests that
layer network, with sigmoidal activation functions, which they are prone to be improved, both because most of the upper
satisfies the hypothesis of Lemma 4. More precisely, the bounds are not close to the corresponding lower bounds, and
chosen network has 2m − 1 hidden units and approximates because the difference between the upper bounds for networks
a staircase19 function on a set of 2m open intervals Ui ⊆ with arctan(·) and tanh(·) activations are counterintuitive.
[i, i + 1], 1 ≤ i ≤ 2m. The staircase function is discontinuous However, (3) is probably not tight for general Pfaffian func-
on the points i , 1 ≤ i ≤ 2m; it is constant and equal to 1/2 tions, and a completely different technique may be required to
on the intervals Ui for which i is odd, and to −n − 1 on improve our upper bounds. Actually, Zell’s Theorem is based
the intervals for which i is even. Notice that, provided that on Khovanskiı̆ bounds [49] on the number of the solutions of
the above network can be defined, the thesis directly follows systems of Pfaffian equations, and it is well known that those
from Lemma 4, observing that m = (h − 1)/(2n) holds. bounds can be significantly larger than the actual ones.
Actually, such network can be constructed using the same An example can help in explaining the importance of
reasoning adopted for proving that neural networks with Khovanskiı̆’s theory and its relationship to the problem we
sigmoidal activation functions are universal approximators faced in this paper. Let us consider a system of n equations
for continuous functions on IR [36]. In fact, we can first of polynomials in n variables, x = [x 1 , . . . , x n ]
observe that any staircase function with 2m − 1 steps can be
p1(x) = 0, . . . , pn (x) = 0 (12)
approximated by a three-layer neural network with 2m − 1
hidden nodes, exploiting the Heaviside (step) activation func- where the polynomial pi (x) contains
m i monomials, i.e.,
v
tion, and then that the Heaviside activation function can be pi (x) = mj =1
i
ti, j and ti, j = ai, j nk=1 x i i, j,k , with parameters
approximated by a sigmoid function.20 ai, j and exponents v i, j,k both belonging to IR. An important
It is worth mentioning that a tighter lower bound can be problem in mathematics consists in defining a bound on the
obtained by using the fact that the properties of function number of nondegenerated solutions of such asystem with
f N
in Lemma 4 can be satisfied also by a polynomial of respect to the total number of monomials, m = i m i . When
degree 2m − 1. In addition, a polynomial of degree 2m − 1 n = 1, a tight bound is provided by the Descartes’ rule
can be approximated on compact intervals by a three-layer of signs, which states that the number of roots is at most
network with m hidden units and any nonpolynomial and equal to the number of sign changes between consecutive
analytic21 activation function [36]. Therefore, by adding the monomials, when the monomials are sorted by their expo-
hypothesis that the activation function is analytic, the lower nents. In the more general case n > 1, no tight bound is
bound b0 (SN ) ≥ (h/n)n is obtained. known. Actually, Khovanskiı̆’s theory [49] provides a bound,
2(m −3m+2)/2 (n + 1)n−1 , which is exponential with respect to
2
I. Further Comments on the Limits of the Presented Results m, but some authors claim that the actual bound might be
polynomial. Unfortunately, despite the long time efforts, such
A discussion on Zell’s bound (3), which has been used in
a claim has not been proved, nor a counterexample has been
Theorems 1, 3, and 4, may be of help to understand some
found, yet (see [54]).
peculiarities and limits of the presented results.
On the other hand, let us consider the function
First of all, Zell’s bound (3) is exponential in the number
f (x) = − i ( pi (x))2 ; the problem of counting the nonde-
of inputs n and in the length of the considered Pfaffian
generated solutions of the system (12) can be transformed
function. In Lemma 2, it is proved that the length of the
into the problem of computing the number of the connected
function implemented by a network with a Pfaffian activation
components of the set S = {x| f (x) = 0}, i.e., the first
depends linearly on the number of its hidden units. Thus, in
Betti number b0 (S) of S. In addition, notice that,by the
deep networks, the derived upper bounds are exponential in
variable substitution yi = e xi , each term ti, j = ai, j e k v i, j,k xi
the number of hidden units and inputs, and, obviously, also
is equal to the output of a hidden neuron in a three-layer
with respect to the network depth. On the other hand, the
network, where the activation function is the exponential e x .
function implemented by three-layer networks with arctan(·)
By simple algebra,
it can be easily shown that, more generally,
activation has the peculiar property of having a length that does
f (x) = − i ( pi (x))2 is the output of a three-layer network
not depend on the number of inputs (see Lemma 1). Such a
with exponential activation functions, and where the number
peculiarity allowed us to derive a polynomial upper bound for
of hidden neurons is h = i (2m i +m 2i ). Thus, the problem of
such a class of networks. The fact that the peculiarity is not
bounding the solutions of (12) can be reduced to the problem
19 A staircase function is a piecewise constant function with discontinuity of bounding the number of connected components of the set
points. realized by a three-layer network with a Pfaffian activation
20 More precisely, in this way we obtain a network N that approximates,
up to any degree of precision, the target staircase function, t : IR → IR, on
function. As a consequence, if we could find a tighter bound
the whole IR except on the discontinuity points. The construction is sufficient for B(SN ) for networks with generic Pfaffian activation, then
for our goal, since the discontinuity points are outside the intervals Ui . we could also find a solution to another theoretical problem
21 A function is analytic if its Taylor series converges. A neural network is
that has been open for a long time.
a universal approximator, if the activation function of the hidden neurons is
not polynomial and analytic in a neighborhood of some point. The functions Obviously, the above discussion provides only an
tanh(·) and arctan(·) satisfy such a property. informal viewpoint to understand the difficulty of the
BIANCHINI AND SCARSELLI: ON THE COMPLEXITY OF NEURAL NETWORK CLASSIFIERS 1565
considered problem. In fact, the functions implemented by [26] R. Rojas, Neural Networks: A Systematic Introduction. New York, NY,
neural networks belong to a subclass of the Pfaffian functions, USA: Springer-Verlag, 1996.
[27] A. Hatcher, Algebraic Topology. Cambridge, U.K.: Cambridge Univ.
so that tighter bounds could be more easily obtainable in this Press, 2002.
case. [28] M. Nakahara, Geometry, Topology and Physics, (Graduate Student
Series in Physics), 2nd ed. New York, NY, USA: Taylor & Francis,
2003.
[29] T. Kaczynski, K. Mischaikow, and M. Mrozek, Computational Homol-
R EFERENCES ogy, vol. 157. New York, NY, USA: Springer-Verlag, 2004.
[30] R. Ghrist and A. Muhammad, “Coverage and hole–detection in sensor
[1] Y. Bengio, Y. LeCun, R. Salakhutdinov, and H. Larochelle, Proc. Deep networks via homology,” in Proc. 4th Int. Symp. Inf. Process. Sensor
Learn. Workshop, Found. Future Directions NIPS, 2007. Netw., 2005, pp. 254–260.
[2] H. Lee, M. Ranzato, Y. Bengio, G. Hinton, Y. LeCun, and A. Ng, Proc. [31] S. Zimmerman, “Slicing space,” College Math. J., vol. 32, no. 2,
Deep Learn. Unsupervised Feature Learn. Workshop NIPS, 2010. pp. 126–128, 2001.
[3] K. Yu, R. Salakhutdinov, Y. LeCun, G. E. Hinton, and Y. Bengio, Proc. [32] G. Cybenko, “Approximation by superpositions of a sigmoidal func-
Workshop Learn. Feature Hierarchies ICML, 2009. tion,” Math. Control, Signals Syst., vol. 2, no. 4, pp. 303–314,
[4] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy 1989.
layer–wise training of deep networks,” in Proc. Adv. NIPS, 2007, [33] K. I. Funahashi, “On the approximate realization of continuous mappings
pp. 153–160. by neural networks,” Neural Netw., vol. 2, no. 3, pp. 183–192, 1989.
[5] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for [34] K. Hornik, “Approximation capabilities of multilayer feedforward net-
deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006. works,” Neural Netw., vol. 4, no. 2, pp. 251–257, 1991.
[6] Y. Bengio, “Learning deep architectures for AI,” Found. Trends Mach. [35] J. Park and I. W. Sandberg, “Universal approximation using radial–
Learn., vol. 2, no. 1, pp. 1–127, 2009. basis–function networks,” Neural Comput., vol. 3, no. 2, pp. 246–257,
[7] T. S. Lee, D. Mumford, R. Romero, and V. A. F. Lamme, “The role 1991.
of the primary visual cortex in higher level vision,” Vis. Res., vol. 38, [36] F. Scarselli and A.-C. Tsoi, “Universal approximation using feedforward
nos. 15–16, pp. 2429–2454, 1998. neural networks: A survey of some existing methods, and some new
[8] T. Serre, G. Kreiman, M. Kouh, C. Cadieu, U. Knoblich, and T. Poggio, results,” Neural Netw., vol. 11, no. 1, pp. 15–37, 1998.
“A quantitative theory of immediate visual recognition,” Progr. Brain [37] K. I. Funahashi and Y. Nakamura, “Approximation of dynamical systems
Res., vol. 165, pp. 33–56, Feb. 2007. by continuous time recurrent neural networks,” Neural Netw., vol. 6,
[9] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech, no. 6, pp. 801–806, 1993.
and time series,” in The Handbook of Brain Theory and Neural [38] B. Hammer, “Approximation capabilities of folding networks,” in Proc.
Networks, M. Arbib, Ed. Cambridge, MA, USA: MIT Press, 1995, 7th ESANN, 1999, pp. 33–38.
pp. 255–258. [39] F. Scarselli, M. Gori, A.-C. Tsoi, M. Hagenbuchner, and G. Monfardini,
[10] S. E. Fahlman and C. Lebiere, “The cascade–correlation learning “Computational capabilities of graph neural networks,” IEEE Trans.
architecture,” in Advances in Neural Information Processing Systems Neural Netw., vol. 20, no. 1, pp. 81–102, Jan. 2009.
2. San Mateo, CA, USA: Morgan Kaufmann, 1990, pp. 524–532. [40] J. Håstad, “Almost optimal lower bounds for small depth circuits,” in
[11] N. Bandinelli, M. Bianchini, and F. Scarselli, “Learning long–term Proc. 18th ACM Symp. Theory Comput., 1986, pp. 6–20.
dependencies using layered graph neural networks,” in Proc. IJCNN. [41] A. C. C. Yao, “Separating the polynomial–time hierarchy by oracles,”
2010, pp. 1–8. in Proc. 26th Annu. Symp. Found. Comput. Sci., 1985, pp. 1–10.
[12] I. Castelli and E. Trentin, “Supervised and unsupervised co–training of [42] I. Wegener, The Complexity of Boolean Functions. New York, NY, USA:
adaptive activation functions in neural nets,” in Proc. 1st IAPR Workshop Wiley, 1987.
PSL, 2011, pp. 52–61. [43] H.-K. Fung and L. K. Li, “Minimal feedforward parity networks
[13] J. L. Elman, “Finding structure in time,” Cognit. Sci., vol. 14, no. 2, using threshold gates,” Neural Comput., vol. 13, no. 2, pp. 319–326,
pp. 179–211, 1990. 2001.
[14] P. Frasconi, M. Gori, and A. Sperduti, “A general framework for adaptive [44] J. Håstad and M. Goldmann, “On the power of small–depth
processing of data structures,” IEEE Trans. Neural Netw., vol. 9, no. 5, threshold circuits,” Comput. Complex., vol. 1, no. 2, pp. 113–129,
pp. 768–786, Sep. 1998. 1991.
[15] F. Scarselli, M. Gori, A.-C. Tsoi, M. Hagenbuchner, and G. Monfardini, [45] M. Karpinski, “Polynomial bounds of VC dimension of sigmoidal and
“The graph neural network model,” IEEE Trans.Neural Netw., vol. 20, general Pfaffian neural networks,” J. Comput. Syst. Sci., vol. 54, no. 1,
no. 1, pp. 61–80, Jan. 2009. pp. 169–176, 1997.
[16] C. L. Giles, S. Lawrence, and S. Fong, “Natural language grammatical [46] P. L. Bartlett and W. Maass, “Vapnik–Chervonenkis dimension of neural
inference with recurrent neural networks,” IEEE Trans. Knowl. Data nets,” in The Handbook of Brain Theory and Neural Networks, 2nd ed.
Eng., vol. 12, no. 1, pp. 126–140, Jan./Feb. 2000. M. Arbib, Ed. Cambridge, MA, USA: MIT Press, 2003, pp. 1188–1192.
[17] H.-T. Su, “Identification of chemical processes using [47] E. D. Sontag, “VC dimension of neural networks,” in Neural Networks
recurrent networks,” in Proc. Amer. Control Conf., 1991, pp. 2311–2319. and Machine Learning, C. M. Bishop, Ed. New York, NY, USA:
[18] M. Bianchini, M. Maggini, L. Sarti, and F. Scarselli, “Recursive neural Springer-Verlag, 1998, pp. 69–95.
networks for processing graphs with labelled edges: Theory and appli- [48] E. D. Sontag, “Feedforward nets for interpolation and classification,” J.
cations,” Neural Netw., vol. 18, no. 8, pp. 1040–1050, 2005. Comput. Syst. Sci., vol. 45, no. 1, pp. 20–48, 1992.
[19] M. Bianchini, M. Maggini, L. Sarti, and F. Scarselli, “Recursive neural [49] A. G. Khovanskiı̆, Fewnomials. vol. 88. Providence, RI, USA: AMS,
networks learn to localize faces,” Pattern Recognit. Lett., vol. 26, no. 12, 1991.
pp. 1885–1895, 2005. [50] B. A. Pearlmutter, “Learning state space trajectories in recurrent neural
[20] A. Pucci, M. Gori, M. Hagenbuchner, F. Scarselli, and A.-C. Tsoi, networks,” Neural Comput., vol. 1, no. 2, pp. 263–269, 1989.
“Investigation into the application of graph neural networks to [51] T. Zell, “Betti numbers of semi–Pfaffian sets,” J. Pure Appl. Algebra,
large–scale recommender systems,” Syst. Sci., vol. 32, no. 4, vol. 139, no. 1, pp. 323–338, 1999.
pp. 17–26, 2006. [52] T. Zell, “Quantitative study of semi–Pfaffian sets,” Ph.D. disserta-
[21] A.-C. Tsoi, M. Hagenbuchner, and F. Scarselli, “Computing customized tion, School of Math., Georgia Inst. Technol., Atlanta, GA, USA,
page ranks,” ACM Trans. Internet Technol., vol. 6, no. 4, pp. 381–414, 2004.
2006. [53] J. Milnor, “On the Betti numbers of real varieties,” Proc. Amer. Math.
[22] Y. Bengio and O. Delalleau, “Shallow vs. deep sum–products networks,” Soc., vol. 15, no. 2, pp. 275–280, 1964.
in Proc. Adv. Neural Inf. Process. Syst., vol. 24. 2011, pp. 666–674. [54] F. Sottile, Real Solutions to Equations from Geometry, vol. 57,
[23] O. Delalleau and Y. Bengio, “On the expressive power of deep archi- Providence, RI, USA: AMS, 2011.
tectures,” in Proc. 22nd Int. Conf. Algorithmic Learn. Theory, 2011,
pp. 18–36.
[24] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, “Learn-
ability and the Vapnik–Chervonenkis dimension,” J. ACM, vol. 36, no. 4,
pp. 929–965, 1989.
[25] G. E. Bredon, Topology and Geometry, Graduate Texts in Mathematics.
New York, NY, USA: Springer-Verlag, 1993. Authors’ photographs and biographies not available at the time of publication.