You are on page 1of 6

UNIVERSAL APPROXIMATION OF NEURAL NETWORKS

*
(Hung T. Nguyen, October 6, 2010)
*
Theses notes present the mathematical justications of the usefulness of
neural networks.
1. Generalities on neural networks
Neural Networks (NN), or Articial Neural Networks to be more precise,
are (simple) mathematical models of human brain. They are motivated by
the recognition that the human brain computes in an entirely dierent way
from the conventional digital (von Neumann) computer. The brain is a highly
complex, nonlinear and parallel computer (for information processing). It has
the capability to organize its structural constituents (known as neurons) so
as to perform necessary computations (e.g. pattern recognition, perception,
and motor control) many times faster than the fastest digital computer in
existence today. Thus, to imitate this fantastic capability, we create ANNs,
of course, in the simplest form of the brain ! A NN is a machine that is
designed to model the way in which the brain performs a particular task or
functions of interest. The NNs are usually implemented by using electronic
components or are simulated in sofware in a digital computer. So we need
to create NNs. How ? Well, we might start out by imitating the simplest
structure of the brain.
The purpose of this lecture is to give you a mathematical sketch of the
reason why NNs could produce expected results. For that, we will only need
to consider a simple model, namely the feed-forward neural network with one
hidden layer (also called as multilayer perceptron with one hidden layer).
The four basic components of a NN model are :
(i) Synapses or connecting links : each is characterized by a weight (or
strength)of its own. Specically, a signal r
j
at input of synapse , connected
to a neuron is multiplied by the synaptic weight n
j
(ii) An adder : for summing the input signals, weighted by the respective
synapses of the neuron
(iii) An activation function : for limiting the amplitude of the output of
a neuron.
(iv) Bias : The model of a neuron also include an external bias (denoted
by /) which has the eect of increasing or lowering the net input of the
activation function.
1
Thus, consider a NN with : inputs (r
j
. , = 1. 2. .... :) and one output
in an one hidden layer model, we could write
= ,(
n
X
j=1
n
j
r
j
+ /)
Note that the activation function , could be :
a) The threshold function
,(r) = 1
[0;1)
(r)
b) A piecewise linear function :
,(r) =
0 for r < 1,2
r + 1,2 for 1,2 _ r _ 1,2
1 for r 1,2
c) The sigmoid function (or logistic function) :
,(r) = 1,(1 + c
x
)
Now, why such a structure could produce good results ? The answer
is in a theorem in the theory of Approximation of Functions . But before
that, recall how we use NNs. With a given number of inputs, layers, and
a specied activation function, the weights are parameters. They will be
chosen according to which task we want the NN to perform. Thus, for each
type of computations, we tune the parameters from its exemplars (data from
the task to perform). There exists several algorithms to do that, e.g. the
well-known backpropagation.
Saying that, given exemplars, we can train a NN to perform a desired
task is nothing else than saying that we can nd a NN structure (i.e. a set of
2
appropriate weights) to approximate an unknown function , with arbitrary
good degree of accuracy.
The point is this. We seek to approximate a unknown function , from
a sample of data points, by using NNs. But this problem is a well known
type of problems of curve tting in Statistics (e.g. regression) ! Besides
fast computations, NNs provide complex nonlinear approximations, without
postulate the form of the function to be approximated, i.e. NNs is a model-
free regression mechanism.
Without knowing the form of ,, we know something else about it. In
practical situations, the domain of , is a closed bounded set of the real
line R (or more general, a compact subset of the euclidean space R
k
), and
, is a continuous function. Thus, the question is " Can NNs approximate
continuous functions dened on a compact domain ? ".
As often, we are standing on the shoulders of giants ! The above question
could be answered by a theorem in mathematics that we will discuss next.
2. The Stone-Weierstrass Theorem
The fact that every continuous function dened on an interval [c. /] can be
uniformly approximated as closely as desired by a polynomial function is the
classical theorem of Kark Weierstrass (1885) in mathematical analysis. Note
that polynomials are simplest functions and computers can directly evaluate
them.
Just for fun (!), let me give a quick proof of this theorem, not the original
one, but biased by my probability background ! Here is a probabilistic proof
of the Weierstrass theorem
Let , be a continuous function dened on [0. 1] (without loss of general-
ity). For each r [0. 1], build a biased coin with probability r of landing
heads in a single toss. And consider tossing that coin indenitely (!). Let
A
n
denote the outcome of the :th toss, where 1 stands for heads, and 0 for
tails. Let o
n
=
P
n
j=1
A
j
. Then o
n
is binomial, so that
1(o
n
= /) =

:
/

r
k
(1 r)
nk
for 0 _ / _ : .
We have
3
1(,(o
n
,:)) =
n
X
j=0
,(/,:)

:
/

r
k
(1 r)
nk
which is a polynomial, denoted as j
n
(r).
Now by the Strong law of large Numbers, o
n
,: converges with probability
one, as : , to the mean r. On the other hand, since , is continuous on
[0. 1], it is uniformly continuous, we have that
1(,(o
n
,:)) 1,(r) = ,(r)
Thus j
n
(r) ,(r) for each r. In fact, it can be checked that this
convergence is uniform in r, so that for any degree of accuracy 0, for :
suciently large, say : _ `(), we have
[j
n
(r) ,(r)[ _
for any r [0. 1] .
In 1937, Marshall H. Stone generalized this theorem to a much more
general setting, now known as the Stone-Weierstrass Theorem. Specically,
the framework of Weierstrass is extended as follows. An closed and bounded
interval [c. /] is replaced by an arbitrary compact topological (Hausdor)
space A, and polynomials are replaced by elements of subalgebras of the set
C(A) of continuous functions on A.
The idea is this. Our unknown function , C(A). We seek a subset
_ C(A) such that there is an element j which is close to ,, i.e.
approximates ,. Mathematically speaking, we seek a dense set of C(A).
is said to be dense in C(A) if the (topological) closure of is equal to C(A).
More specically, is dense in C(A) if for any , C(A), 0, there exists
j such that [[j ,[[ _ c (where [[.[[ denotes the sup-norm on C(A)).
For that to be possible, has to have some special properties. The
discovery of these properties forms the Stones theorem.
Here they are :
4
(i) is a subalgebra of C(A), i.e. if q. / and c. , R then cq+,/
,
(ii) contains a non-zero constant function (or vanishes nowhere , i.e.
for every r A, there is q such that q(r) ,= 0 ),
(iii) separates points, i.e. for every two distinct points r. A, there
exists q such that q(r) ,= q().
Stone-Weiertrass Theorem .
Let A be a compact (Hausdor) space, and C(A) be the space of all
continuous functions, dened on A with real values. If _ C(A) is a
subalgebra, containing a non-zero constant function and separates points,
then is dense in C(A).
The proof of this theorem can be found in any text on Real Analysis, and
hence omitted here.
Since the space of polynomials on [c. /] , as a subset of C([c. /]), clearly
satises the conditions of Stone-Weierstrass theorem, we see that weierstrass
theorem is a special case of Stone-Weierstrass theorem.
The practical meaning of the Stone-Weierstrass theorem is that elements
of C(A) can be approximated by those of .
3. Application to neural Networks
In applications, the compact space A is a compact subset on R
n
. When
applying the Stone-Weierstrass theorem to NNs, we get thefollowing re-
sult which is referred to as the universal approximation property of NNs :
Every multilayer feed-forward neural network with a single hidden layer that
contains a nite number of hidden neurons, and with arbitrary activation
function, is a universal approximator on a compact subset of R
n
. In other
words, every continuous function , (dened on a compact subset of R
n
) can
be uniformly approximated, to any degree of accuracy, by a such NN.
Specically, for any 0, there are weights n
j
, , = 0. 1. .... : , such that
[,(
n
X
j=1
n
j
r
j
+ /) ,(r)[ _
for any r = (r
1
. .... r
n
) A .
This universal approximation property of NNs follows from verifying that
the set of all possible NNs, i.e. the ,(
P
n
j=1
n
j
r
j
+ /), for a given ,, but
with all possible weights n = (n
1
. .... n
n
) , satises the sucient conditions
5
of the Stone-Weierstrass theorem. Usually, the result is credited to L.X.
Wang (1992), although there are earlier indications of such results.
The universal approximation property of NNs says that NNs can produce
good results. From a practical viewpoint, having an good approximation
scheme, several questions arise. The Stone-Weiestrass is an existence theo-
rem. It does not tell you which element of is the approximator of ,, for
a given . The backpropagation algorithm is a way to identify a candidate.
But the theoretical result does give you the condence that you are on the
right track !
6

You might also like