Neural Network Mathematics - A Brief Introduction

A BRIEF INTRODUCTION TO THE
MATHEMETICS OF NEURAL
NETWORKS
BY
ASHIMI BLESSING
Ashimiblessing@hotmail.com
( Draft Version Published January 2013)

DEDICATION
To God Almighty,my compassionate creator and my won-

derful parents.
i
Acknowledgements
I am indeed grateful to Almighty God for his infinite

mercy.
Words can not contain how indebted I am to members of
my family for their support, accept my hearty thanks.
My close friends, time and space won’t let me thank you
enough :
I hail you all!
Ashimi Blessing . A
ii
Contents
1 Introduction To Neural Networks 1

1.1 What Is An Artificial Neural Network? . . . 1
1.1.1 Motivation Of Study . . . . . . . . . 3
1.1.2 A Brief History Of Artificial Neural
Networks . . . . . . . . . . . . . . . 3
1.1.3 The Biological Neuron . . . . . . . . 5
1.1.4 The Artificial Neuron . . . . . . . . . 7
1.1.5 The Activation Functions . . . . . . 9
1.2 Examples Of Neural Networks . . . . . . . . 12
1.2.1 Perceptrons . . . . . . . . . . . . . . 12
1.2.2 Feedforward Networks . . . . . . . . 13
1.3 Mathematical Definitions And Background
Information . . . . . . . . . . . . . . . . . . 16
1.3.1 Measure Theory . . . . . . . . . . . . 16
1.3.2 σ Algebra and Borel Algebra . . . . . 17
1.3.3 More On Measure And Measure Spaces 18
iii
2 Learning In Neural Networks 21
2.1 Neural Network Training . . . . . . . . . . 21
2.1.1 The Back Propagation Algorithm . . 22
2.2 Probabilistic Model Of Learning . . . . . . . 23
2.2.1 Uniform Convergence Results . . . . 28
2.2.2 Application To Successful Learning . 31
3 Function Approximation With Neural Net-

works 34
3.1 Introduction And Useful Theorems . . . . . 34
3.1.1 Brief introduction . . . . . . . . . . . 34
3.1.2 Useful Theorems . . . . . . . . . . . 35
3.2 The Universal Approximation Theorem . . . 37
3.2.1 Main Results . . . . . . . . . . . . . 39
3.2.2 Application Of Results . . . . . . . . 46
3.2.3 Generalization Of Result . . . . . . . 47
4 Practical Applications 48
4.1 Anna : The Well Behaved Robot . . . . . . 48
4.1.1 Problem Statement . . . . . . . . . . 48
4.1.2 Design . . . . . . . . . . . . . . . . . 49
4.1.3 Anna’s Brain . . . . . . . . . . . . . 49
4.1.4 Training Anna . . . . . . . . . . . . 51
iv
4.1.5 Adding Reality To Anna . . . . . . . 52
5 Conclusion And References 54

5.1 Conclusion . . . . . . . . . . . . . . . . . . . 54
v
Abstract
Neural Networks are mathematical representations and mod-

els that mimic processes behind the operations of the hu-
man brain.With important applications in Mathematical
Analysis, Computer science, Physics, Engineering and the
Biological Sciences, neural networks can solve very hard
problems.
This project aims at exploring the concept of neural net-
works and a few of its applications, and as well investigate
concise mathematical results that establishes, but not lim-
ited to its functional approximation properties; which are
capable of approximating functions with infinite number of
discontinuities to any desired accuracy.
Chapter 1
Introduction To Neural
Networks
1.1 What Is An Artificial Neural Network?
An artificial neural network (or in short a neural network) is

a generalization based on the workings of biological neural
systems.In other words, a neural network is a paradigm or a
mathematical model that emulates the processes behind the
operations of biological systems such as the human brain,
electronically via a computer.
A neural network is man‘s crude way of mimicking how
the brain works electronically.
The brain performs astonishing tasks.We can talk, read,
write, recognize hundreds of focus and objects, irrespective
of their distance, orientation or direction. We can drink
1
from a cup, give a lecture or even take a course on neural
networks.
Though it is not the brain that entirely does these things,
it plays an essential role in each process.
As powerful as the brain is, it is considered to consist
of large number of not so intelligent but highly connected
processing elements known as neurons, which their inter-
connectives form a network . It has been estimated that
the brain (human) consist of about 1011 neurons (or more)
having as many as 104 interconnections, with each of neu-
rons communicating with other neurons via signals. since
an artificial neural network works like the brain, it is very
useful in solving a wide array of problems . From robot
control to stock prediction,time series analysis and the cre-
ation of computer frameworks that can mimic human think-
ing,neural networks are very useful, not only in mathemat-
ical analysis, but in very real day to day applications.
2
1.1.1 Motivation Of Study
The main feature of neural networks which account for their

uniqueness is the ability to learn, thus it is capable of gen-
eralization. Neural networks are capable of approximating
any function, to an arbitrary precision or accuracy. In-
spired by this universal approximation ability, this report
is aimed at demonstrating function approximation, among
other useful applications of neural networks.
1.1.2 A Brief History Of Artificial Neural Networks
The development of neural network models date back to

the early 1940s when two scientists: Mcullor and Pitts pre-
sented the first ever model of neurons. In 1950, the neuro-
psychologist Karl Lashley defended the thesis that infor-
mation processing in the brain is realized as a distributed
system in harmony with neural network models.His thesis
was based on experiments on rats,where only the extent
but not the location of destroyed nerve tissues influenced
the rats’ performance in finding its way out of a labyrinth.
Activity continued in this field,and in 1958, the first suc-

cessful neuro-computer: ”The Mark I Perceptron” was de-
3
veloped by Frank Rosemblatt. The field of neural networks
continued to look promising until 1969 when Marvin Min-
sky and Seymout Papert published a precise mathematical
analysis of the perceptron, showing its weakness in many
areas and its incapability of representing many important
mathematical problems.This dealt a huge blow to its re-
search and funding; thus only few researchers were left in
the field.
The field of neural networks however regained significance

and popularity in the late 1980s; (up till date) when new
discoveries and learning algorithms were discovered. In
1989 George Cybenko and other mathematicians of no-
table pedigree published mathematical proofs based on the
Hahn-Banach theorems, showing that neural networks are
capable of use as infallible universal function approxima-
tors.
4
1.1.3 The Biological Neuron
It will be most appropriate to discuss briefly the operations

of the biological neuron; as any work on artificial neural
networks will be incomplete without a quick look at its root
in neuroscience. The neuron has three major components
• The Dendrites (constituting a vastly multi branching

tree-like structure which collects input from other
cells)
• The Cell Body (Processing part called soma)
• The axon(Which carries electrical pulses or signals to

other cells)
Each neuron has only one axon,which may branch out to

reach thousands of other cells.There are many dendrites.
The outgoing signal is in the form of a pulse down the
axon.On arrival at a synapse (The junction where the axon
meets a dendrite) , molecules known as neuro-transmitters
are released and they attach themselves very selectively to
the receptor sites on the receiving neuron.
The membrane of the target neuron is chemically af-
fected, and its inclination to pass on the received signal
5
or ”fire” may be either be enhanced or decreased.Thus ,an
incoming signal can either be excitatory or inhibitory.
The containing wall of the neuron keeps most molecules
from passing either in or out of the cell,but there are special
channels allowing the passage of ions such as N a+ ,K + ,Cl−
and Ca++
By allowing such ions to pass,a potential is generated
and maintained between the inside and outside of the cell.
When an action potential reaches a synapse, it causes a
change in the permeability of the membrane carrying the
pulse, which result in the influx of Ca++ ions. This leads
to the release of neuro transmitters into the synaptic cleft
which are diffused towards the receptor sites of the receiving
cells.
As the neuron continuously receive signals from its input
channels,it sums up these inputs to itself in some way.If
the end result is greater than a predefined threshold,the
neuron is activated and it generates an output signal which
it passes along to the nearest neuron.
6
1.1.4 The Artificial Neuron
An artificial neuron is an extremely simplified model of

the biological neuron. Essentially, it is a function approxi-
mator which transforms input to output in the best of its
ability. In analogy to its biological counterpart,each artifi-
cial neuron receives inputs, weighs them,and computes the
threshold for comparison.
In a neural network, the neurons are arranged in lay-
ers.Each neuron is a simple processing unit which associates
a weight to every input recieved.These weights are vari-
able,and can be adjusted to get a desired output from the
connections.
Let X = (x1 , x2 , x3 ...xn ) where the x0i s are real numbers,
represent the set of inputs to the neuron, which is a vector.
Let W = (w1 , w2 , w3 ...wn ) represent the associated weight
vector corresponding to X.
Looking at a neuron j, we will usually find a lot of neu-

rons with a connection to j, i.e. which transfer their output
to j. For a neuron j the propagation function receives the
output of other neurons (which are connected to j), and
transforms them in consideration of the connecting weights
7
wi,j into the network input u that can be further processed
by the activation function. Thus, the network input is the
result of the propagation function.
The neuron produces a weighted sum or net sum,given
as
n
X
u= wi xi
i=1
assuming the inputs add up linearly.
Let θ be the threshold.The variable v = u − θ is called

the activation potential.The output y of the neuron is a
function of the weighted sum and the threshold, or bias.
This function is known as the activation function.
Let this function be denoted by ψ .Then
y = ψ(u − θ)
8
1.1.5 The Activation Functions
One of the most commonly used activation functions is the

step function, or linear threshold function. In using this
function, the inputs to the neuron are summed (having
each been multiplied by a weight), and this sum is com-
pared with a threshold, t. If the sum is greater than the
threshold, then the neuron fires and has an activation level
of +1. Otherwise, it is inactive and has an activation level
of zero. (In some networks, when the sum does not exceed
the threshold, the activation level is considered to be −1
instead of 0). Hence, the behavior of the neuron can be
expressed as follows:
n
X
X= wi xi
i=1
X is the weighted sum of the n inputs to the neuron,
x1 to xn , where each input, xn is multiplied by its corre-
sponding weight wn . For example, let us consider a simple
neuron that has just two inputs. Each of these inputs has
a weight associated with it, as follows:
w1 = 0.8
9
w2 = 0.4
The inputs to the neuron are x1 and x2 :
x1 = 0.74
x2 = 0.9
So, the summed weight of these inputs is
(0.8 × 0.7) + (0.4 × 0.9) = 0.92
The activation level Y , is defined for this neuron as

 +1 for x > θ
Y =
 0 for x ≤ θ
The second is the sigmoid function,which is used when
data is considered continuous. A sigmoid function is any
differentiable function Ψ(·) such that
Ψ(v) → 0 as v → −∞ , Ψ(v) → 1 as v → ∞
10
1
Ψ(v) =
1 + e−αv
Where α is a parameter.
The larger α is, the greater the slope of Ψ(v).When

α → ∞ , the sigmoid function becomes the binary thresh-
old function.
The sigmoid function
11
1.2 Examples Of Neural Networks
1.2.1 Perceptrons
The perceptron is the first neural network with the capa-

bility to learn.It is made up of single input and output
neuron pairs.The input neurons have two states : ON and
OFF and its output uses the discrete threshold activation
function.
In basic form, the perceptron can only solve linear prob-

lems.
A Typical Artificial Neuron
12
1.2.2 Feedforward Networks
A general feed-forward network is a function which maps a

vector of n real-valued inputs X = (x1 . . . xn ) into a vector
of m real valued outputs.
This type of network can be trained to encode mappings,
whose functional form is unknown by repeatedly presenting
to the network; known input and output pairs.
A feedforward network is one whose topology has no
closed paths, its input neurons have no connections to them
and the outputs have no connections away. It also consist
of hidden layers which are included to improve the com-
plexity and computational power of the network.
13
A Typical Feedforward Network
As in the figure above,weight-vectors and the associated

threshold values are denoted by wj and θj respectively.The
weight associated with the single output unit is denoted by
β and the input vector x.
With this notation,we see that the function a multilay-
ered feedforward network computes is
14
k
X
f (x) = β · σ(wj − θj )
j=1
k being the number of processing units in the hidden

layer.
15
1.3 Mathematical Definitions And Background In-
formation
1.3.1 Measure Theory
Measure On A Set
A measure on a set is a systematic way of assigning to each

suitable subset of a set, a number intuitively interpreted as
the size of the subset.
In this sense, a measure is a generalization of the con-
cepts of area, length and volume. A particular example is
the Lebesgue measure on an Euclidean space Rn .
Limit Of A Sequence
A sequence x1 . . . xn is said to converge to the point, or has

limit x if for > 0 there exist a natural number N such
that the neighborhood of O(X, ) (A circle with center x
and radius ) has all the points xn with n > N . Intuitively
it means that the elements of the sequence eventually be-
comes arbitrarily close to x.
More precisely a real number x is the limit of a sequence
xn if, for ε > 0 ∃ a natural number N such that ∀ n ≥ N
We have
16
| xn − x |<
The sequence (xn ) is said to converge to the limit x.
Cauchy sequence
A sequence x1 . . . xn is said to be a Cauchy sequence if for

0 00
every > 0 there exists an N such that d(xn , xn ) < ∀
0 00
n , n ≥ N . Intuitively, it means that the elements of the
sequences eventually become close to each other.
However, Cauchy sequences are not the same as conver-
gent sequences (having a limit) except in certain cases (for
example in <). Note that
• Any cauchy sequence is bounded
• Any convergent sequence is not necessarily cauchy, but

any cauchy sequence is convergent.
1.3.2 σ Algebra and Borel Algebra
Generally in analysis, σ is used to denote countability.
0
A σ algebra S is a collection of subsets of a set S which
are closed under countable set operations, ie the comple-
17
ment of a member, the union and intersection of members
of S are also its members.
0
In formal terms, an algebra S of subsets of a set S is a σ
0
algebra if S contains the limit of every monotone sequence
of its sets.
0
The pair (S, S ) is then known as a measurable space
and sets in s are said to be measurable
The Borel algebra of a set is the minimal sigma algebra
that contains all open (or closed) sets on the real line.
The elements of Borel algebra are called Borel sets.
1.3.3 More On Measure And Measure Spaces
A function is said to be countably additive if for finite or

countably infinite sequence of disjoint sets, the length of
the union of these sets is equal to the sum of lengths of
these sets i.e
∞
X
φ(s) ≤ φ(St )
i=1
0
∞
where S ∈ s , S ⊆ Ut=1 St
18
if there is a countably additive set function µ defined
0
on a σ algebra S of subsets of the set S ,then the triplet
0
(S, S , µ) is a measure space.
An example is the Euclidean space with Lebesgue measure.
0
The sets S are called measurable sets and the function µ
is called a measure with the following properties
• µ is non negative i.e µ(x) ≥ 0 ∈ S
• µ is countably additive
• µ(0) = 0
• µ obeys monotonicity
Limit Points
A point say ℘ is called a limit point of a set E if however

small a δ is ,there are still other points of E other than ℘
in the interval (℘ − δ, ℘ + δ)
Lebesgue Measure
The outer measure of a set S denoted by µe (S) is defined

as the lower bound of the measures of all open sets which
contain S. It is clear that
19
0 ≤ µe (S) ≤ (b − a)
The inner measure is defined as
µi (x) = (b − a) − µe (CS)
if µi (S) = µe (S) S is said to be measurable and the

common value is known as a lebesgue measure
1
1
µ(x) ≤ µ(y)∀X, Y inS implies monotonicity ,outer,inner,exterior and interior measures
are respectively equivalent
20
Chapter 2
Learning In Neural Networks
2.1 Neural Network Training
There are two main training types for neural networks
• Supervised Training
• Unsupervised Training
In supervised training, the neural network is supplied

with input and output patterns, while the network response
to the input is measured.
The weights are modified to reduce the difference be-
tween the actual and the desired outputs.
During unsupervised training, only inputs are provided.
The neural network adjusts its own weights so that sim-
ilar inputs cause similar outputs in future. The network
21
identifies the patterns and differences without any external
assistance.
2.1.1 The Back Propagation Algorithm
This is the most common form of supervised training. A set

of training patterns are assembled, with each case consist-
ing the problem statement (which represent the input into
the network) and the corresponding solution which repre-
sents the desired network outputs.
The actual output of the network is compared to the
expected output, which results in an error value (The dif-
ference).
The connection weights are gradually adjusted, working

backwards from the output layer, through the hidden layer
to the inputs, until the correct output is produced.
The process of fine-tunning weights in this manner pro-
duces the effect of teaching the network how to produce
the correct outputs for a particular input; i.e the network
learns.
The information processing carried out by the back prop-
agation algorithm is the approximation of a mapping
22
f : A ⊂ Rn → Rm
from a bounded subset A of the n-dimensional Euclidean

space to a bounded subset f [A] of an m-dimensional Eu-
clidean space, by means of training examples (x1 , y1 ), (x2 , y2 ), . . . (xn , yn )
of the mapping, where yk = f (xk ).
It is assumed that such examples are generated by select-

ing vectors xk randomly in accordance with a fixed proba-
bility density function ρ(x).
2.2 Probabilistic Model Of Learning
Let X be a set of examples, which are elements of Rn . If

the network has n real inputs, with Y ⊆ R as the set of
possible outputs. The pair (x, y) ∈ X × Y (X × Y denoted
Z), of examples associated with a possible output is called
a labelled example while Z is known as a training sample .
A learning algorithm is a function
∞
[
A: Zn → H
1
23
which takes a randomly generated training sample of la-
beled examples, (each called a training example) and pro-
duces a function
h : X → [0, 1]
chosen from some class H of functions.
Our goal is to produce an hypothesis h that is a good

fitf to the process generating the labeled examples.
The learning algorithm accepts the training sample and
alters the state of the network in some way in response to
the information provided by the sample. It is to be hoped
that, in the resulting state, the function computed by the
network is a better approximation to the target concept
than the function computed beforehand.
More precisely, we assume that there is some fixed, but
unknown, probability measure µ on Z.
(There is a fixed σ− algebra Σ on Z, which when Z ⊆

Rn , we shall take to be the Borel σ− algebra. Then, µ
denotes a probability measure on (Z, Σ). We assume that
each training example is generated independently accord-
24
ing to µ, so, if the training sample is of length n, then it
is generated according to the product probability measure
µn .)
We say that the learning algorithm is successful if, with

high µm -probability, it produces an output hypothesis A(z)
which is almost as good a fit to the distribution µ as exists
in the class H.
More precisely, we have in mind some loss function
` : [0, 1] × Y → [0, 1]
and we hope that A(z) has a relatively small loss, where,

for h ∈ H, the loss of h is the expectation
L(h) = E`(h(x), y)
(where the expectation is with respect to µ).

Examples of loss functions are
`(r, s) = |r − s|
`(r, s) = (r − s)2
25
,
and the discrete loss, given by

0 if r = s
`(r, s) =
1 if r 6= s
The best loss one could hope to be near is L∗ = infh∈H L(h),
we want A(z) to have loss close to L∗ ,with high probability,
provided the sample size n is large enough.
Definition
With the above notations, A is a successful learning algo-

rithm for H if for all δ ∈ (0, 1), there is some n0 (, δ)
(depending on and δ only) , such that if n > n0 (, δ),
then with probability at least 1 − δ,
L(A(z)) ≤ L∗ +
The minimal such n0 (, δ) is referred to as the sample

complexity of A and is denoted nA (, δ).
We note that if A is successful, then there is some func-

tion 0 (n, δ) of n and δ, with the property that for all δ,
26
lim 0 (n, δ) = 0
n→∞
and such that for any probability measure µ on Z, with

probability at least 1 − δ we have,
L(A(z)) ≤ L∗ + 0 (n, δ)
The minimal 0 (n, δ) is called the estimation error of the

algorithm.
When H is a set of binary functions, meaning each func-
tion in H maps into {0, 1}, if Y = {0, 1}, and if we use
the discrete loss function, then we shall say that we have
a binary learning problem .If L∗ = 0 , then we say that we
have a realizable learning problem.
In this situation there is some t ∈ H such that with

probability 1, for all (x, y) ∈ Z , y = t(x).
We also define success for a binary, realizable learning
problem, the definition of successful learning is quite simple
to understand i.e for any h ∈ H, L(h) is the probability that
on a randomly drawn element (x, y) of Z, h and t agree on
x ; that is, h(x) = t(x) . So what the definition says is that,
provided the sample is large enough, then, with probability
27
at least 1−δ, A produces a hypothesis which agrees with the
target function with probability at least 1 − on a further
randomly drawn example.
2.2.1 Uniform Convergence Results
Borel-Cantelli lemma
Stated thus
Let (En ) be a sequence of events in some probability

space.
If the sum of the probabilities of En is finite,ie
Σ∞
n=1 P r(En ) < ∞
then the probability that infinitely many of them occur

is 0 that is

Pr lim supEn = 0
n→∞
Assumptions
Suppose that F is a set of (measurable) functions from Z to

[0, 1] and that µ is a probability measure on Z. Denote the
expectation Eµ f by µ(f ) and, for z = (z1 , z2 , ..., zn ) ∈ Z n ,
let us denote by µn (f ) the empirical measure of f on z,
28
µn (f ) = n−1 Σni=1 f (zi )
Definition
We say that F is a uniform Glivenko-Cantelli class if it has

the following property

∀ > 0 lim sup P sup sup | µ(f ) − µm (f ) |> = 0
n→∞ u m≥n f ∈F
The strong law of large numbers of classical probability

theory tells us that, for each µ and for each fixed f ,
P (sup | µ(f ) − µm (f ) |> ) → 0

m≥n
as n → ∞
For a class to be a uniform Glivenko-Cantelli class, we
must, additionally, be able to bound the rate of convergence
uniformly over all f ∈ F , and over all probability measures
µ.
29
If F is finite, then it is a uniform Glivenko-Cantelli class.
To see this explicitly, we can use Hoeffding inequality ,
which tells us that for any µ and for each f ∈ F ,
2
P (| µ(f ) − µm (f ) |> ) < 2−2 n
.
It follows that

P sup | µ(f ) − µm (f ) |>
f ∈F
 
[
=P | µ(f ) − µm (f ) |> 
f ∈F
≤ Σf ∈F P (| µ(f ) − µm (f ) |> )
2
≤ 2 | F | −2 n
We now apply the Borell-Cantelli lemma, together with the

observation that the bound just given is independent of µ.
Since, for each > 0,
30
Σ∞
n=1 P (| µ(f ) − µm (f ) |> )
2
< Σ∞
n=1 2 | F |
−2 n
<∞
we have

lim sup P sup sup | µ(f ) − µm (f ) |> = 0
n→∞ u m≥n f ∈F
and the class has the uniform Glivenko-Cantelli prop-

erty.
The above derivation was straightforward, but it demon-
strates a very nice key technique: we bounded the probabil-
ity of a union by the sum of the probabilities of the events
involved.
2.2.2 Application To Successful Learning
With the notations above, define the loss class (correspond-

ing to ` and H) to be
`H = {`h : h ∈ H}
where, for z = (x, y), `h (z) = `(h(x), y).
31
Suppose that `H is a uniform Glivenko-Cantelli class.
For z ∈ Z n , the empirical loss of h ∈ H on z is defined to
be
Lz (h) = µn (`h ) = n−1 Σnn=1 `(h(xi ), yi )
where zi = (xi , yi ). Let us say that A is an approximate

empirical loss minimization algorithm if for all z ∈ Z n ,
1
Lz (A(z)) < + inf Lz (h)
n h∈H
Then A is a successful learning algorithm. (In the binary
case, the infimum is a minimum, and the 1/n is not needed).
This follows from the fact that
if > 0 and δ > 0 are given and let h∗ ∈ H be such that

L(h∗ ) < L∗ +
4
Suppose n > 4 , so that 1
n < 4 . By the uniform Glivenko-
Cantelli property for `H , there is n0 ( 4 , δ) such that for all
n > n0 , with probability at least 1 − δ,

sup | L(h) − Lz (h) |<
h∈H 4
So, with probability at least 1 − δ,
32

L(A(z)) < Lz (A(z)) +
4

1
< inf Lz (h) +
h∈H n

< Lz (h∗ ) + 2
4

< L(h∗ ) + +
4 2
3
∗
< L + +
4 2
= L∗ +
So A is a successful learning algorithm and its sample

complexity is no more than max( 4 , n0 ( 4 , δ)).
In fact, as we have stated the definition of learning, the
apparently weaker form of convergence

∀ > 0 lim sup P sup | µ(f ) − µn (f ) |> = 0
n→∞ µ f ∈F
where F = `H , suffices for this type of learning algorithm

to be successful.
33
Chapter 3
Function Approximation
With Neural Networks
3.1 Introduction And Useful Theorems
3.1.1 Brief introduction
The goal of this section is to demonstrate the functional ap-

proximation capabilities of neural networks, precisely feed-
forward multilayered networks.
According to Kolmogorov, any continuous function f from a com-
pact subset of Rn to Rm can be realized by some multilayered
feedforward neural network.
34
3.1.2 Useful Theorems
Theorem (Kolmogorov)
For any function f : [0, 1]n → < (on the n-dimensional unit
cube) ,there are continuous functions h1 . . . h2n+1 on < and
continuous monotone increasing functions gij for 1 ≤ i ≤ n
and 1 ≤ j ≤ 2n + 1 such that
2n+1 n
!
X X
f (x1 . . . xn ) = hj gij (xi )
j=1 i=1
the functions gij do not depend on f
Hahn-Banach Theorem
The Hahn-Banach theorem allows the extension of bounded

linear functionals defined on a subspace of some vector
space to the whole space.It also shows that there are enough
continuous linear functionals defined on every normed space.
Riesz Representation Theorem
This theorem states that every continuous linear functional

A[F ] over C[0, 1] can be represented as
Z 1
A[F ] = f (x)d(α(x))
0
35
Where α(x) is a function of bounded variation on [0, 1]
and the integral is a Riemann-Stieltjes integral.
Dominated Convergence Theorem
In measure theory,the lebesgue dominated convergence the-

orem provides sufficient conditions under which two limit
processes commute namely Lebesgue integration and al-
most everywhere convergence.
Let {fn } be a sequence of real valued measurable func-
tions on a measure space (S, Σ, µ). Suppose that the se-
quence converges pointwise to a function f and is domi-
nated by some integrable function g in sense that
| fn (x) |≤ g(x)
∀ numbers n in the index set of the sequence and all points

x in S.Then
f is integrable and
Z
lim | fn − f | dµ = 0
n→ S
which also implies
Z Z
lim fn dµ = f dµ
n→ S S
36
Bounded Convergence Theorem
The bounded convergence theorem is a corollary of the

above theorem, which states that
if f1 , f2 , · · · is a sequence of functions which converges

pointwise on a measure space (S, Σ, µ) (ie one in which µ(S)
is finite ) to a function f ,then the limit f is an integrable
function and
Z Z
lim fn dµ = f dµ
n→ S S
3.2 The Universal Approximation Theorem
Recall that feedforwad neural networks computes a func-

tion of the form
k
X
f (x) = βσ(wj .x − θj )
j=1
The right side of the equation is a finite linear combina-

tion which represents functions of an n -dimensional real
variable x ∈ <n which can also be written as
37
N
X
ασ(y T x + θj )
j=1
Where y ∈ <n and αj , θ ∈ < are fixed .(y T is the trans-

pose of y so that y T x is the inner product of y and x )
if σ is sigmoidal ie

 1 as t → ∞
σ(t) →
 0 as t → −∞
The aforementioned sum is dense in the space of contin-

uous functions of the unit cube. As a consequence, we can
approximate any function in this context to an arbitrary
precision.
The simplest nontrivial class of neural network which
serves this purpose is the feedforward network with at least
one hidden layer,one input and one output layer. They
enjoy the kind of universality in approximating continu-
ous functions as filters made from unit delays and constant
multipliers. These sums merely generalizes approximations
by finite fourier series.
38
3.2.1 Main Results
Let In denote the n dimensional unit cube [0, 1]n .The space
of continuous functions on In is denoted by C(In ) and we
use ||f || to denote the uniform norm of an f ∈ C(In )
In general we use ||.|| to denote the maximum of a func-
tion on its domain.The space of finite signed regular borel
measures on In is denoted by M (In )
The conditions under which sums of the form
N
X
G(x) = αj σ(y T j x + θj )
j=1
are dense in C(In ) with respect to the supremum norm (or

uniform norm) are now investigated.
Definition
We say that σ is discriminatory if for a measure µ ∈

M (In )
Z
σ(y T x + θ)dµ(x) = 0
In
∀y ∈ <n and θ ∈ < .This implies that µ = 0
39
Theorem
Let σ be any continuous discriminatory function.The fi-

nite sums of the form
N
X
j=1
are dense in C(In ) .In other words, given any f ∈ C(In )

and 0 there is a sum G(x) of the above form for which
| G(x) − f (x) |< ∀x ∈ In
Proof
Let S ⊂ C(In ) be the set of functions of the form G(x).

Clearly S is a linear subspace of C(In ). We claim that the
40
closure of S is all of C(In ).
Assume that the closure of S is not all of C(In ). Then the

closure of S, say <, is a closed proper subspace of C(In ).
By the Hahn-Banach theorem, there is a bounded linear
functional on C(In ), call it L, with the property that L 6= 0
but L(R) = L(S) = 0.
By the Riesz Representation Theorem, this bounded lin-
ear functional, L, is of the form
Z
L(h) = h(x)dµ(x)
In
for some µ ∈ M (In ), for all h ∈ C(In ).

In particular, since σ(y T x + θ) is in < ∀ y and θ, we
must have that
Z
σ(y T x + θ)dµ(x) = 0
In
∀ y and θ .
However, we assumed that σ was discriminatory so that
this condition implies that µ = 0 contradicting our assump-
tion. Hence, the subspace S must be dense in C(In ).
41
This demonstrates that sums of the form
N
X
j=1
are dense in C(In ) providing that σ is continuous and

discriminatory.
We now show that continuous functions of the form σ is

discriminatory. i.e

 1 as t → ∞
r(t) =
 0 as t → −∞
It is worth to note that in neural networks applica-

tions,continuous , sigmoidal activation functions
are typically taken to be monotone increasing,but
no monotonicity is required in our results.
Lemma 1
Any bounded, measurable sigmoidal function, σ, is discrim-

inatory. In particular, any continuous sigmoidal function is
discriminatory.
Proof
42
To demonstrate this, note that for any x, y, θ, ϕ we have



 →1 yT x + θ > 0 as λ → ∞

σ(λ(y T x + θ) + ϕ) →0 yT x + θ < 0 as λ → ∞


 = σ(x) y T x + θ = 0

∀λ
Thus the functions σ(x) = σ(λ(y T x + θ) + ϕ) converge

pointwise and boundedly to the function



 =1 for y T x + θ > 0

Y (x) =0 for y T x + θ < 0


for y T x + θ = 0

 = σ(ϕ)
As λ → +∞
Q
Let y,θ be the hyperplane defined by
{x|y T x + θ = 0}
and let Hy,θ be the open half-space defined by
{x|y T x + θ > 0}
Then by the Lebesgue Bounded Convergence Theorem,

we have that
Z
0= σλ (x)dµ(x)
In
43
Z
= Y (x)dµ(x)
In
Y
= σ(ϕ)µ y, θ + µ(Hy , θ)
∀ϕθy
We now show that the measure of all half-planes being 0
implies that the measure µ itself must be 0. This would be
trivial if µ were a positive measure but here it is not. Fix
y. For a bounded measurable function, h, define the linear
functional, F , according to
Z
F (h) = h(y T x)dµ(x)
In
and note that F is a bounded functional on L∞ (R) since
µ is a finite signed measure. Let h be the indicator function
of the interval [0, ∞) (that is, h(u) = 1 if u > 0 and h(u) =
0 if u < 0) so that
Z Y
F (h) = h(y T x)dµ(x) = µ( y, −θ) + µ(Hy,−θ ) = 0
In
44
Similarly, F (h) = 0 if h is the indicator function of the
open interval (0, ∞). By linearity, F (h) = 0 for the in-
dicator function of any interval and hence for any simple
function (that is, sum of indicator functions of intervals).
Since simple functions are dense in L∞ (R) ; F = 0 .
In particular, the bounded measurable functions
s(u) = sin(m.u)andc(u) = cos(m.u)
give
Z
F (s + ic) = cos(mT x) + isin(mT x)dµ(x)
In
T
=eim x dµ(x) = 0
∀m
Thus, the Fourier transform of µ is 0 and so µ must be
zero as well. Hence, µ is discriminatory.
45
3.2.2 Application Of Results
We now apply previous results to the case of most interest

in neural network theory.
A straightforward combination of Theorem one and Lemma
one shows that networks with one internal layer and a
continuous sigmoidal function can approximate continu-
ous functions to any desired precision ,providing that no
constraints are placed on the number of nodes or size of
weights. This is illustrated in the theorem below.
Theorem
Let σ be any continuous sigmoidal function.The finite sums

of the form
N
X
j=1
are dense in C(In ).In other words given any f ∈ C(In )

and 0 there is a sum G(x) of the above form for which
|G(x) − f (x)| <
∀x ∈ In
46
Proof
Combine theorem 1 and lemma 1, noting that continu-

ous sigmoidal functions satisfy the conditions of that lemma
3.2.3 Generalization Of Result
Let δ(·) be a non constant bounded and monotone increas-

ing continuous function,then for continuous functions f (x)
with x = {xi ∈ [0, 1] i = 1 . . . m} and > 0 ,there ex-
ists an integer say M with real constants {αj , βj wjk , j =
1 . . . m , k = 1 . . . m}
such that
M M
!
X X
F (x1 . . . xm ) = αj σ wjk xk − bj
j=1 k=1
is an approximation of F (·) ie
|F (x1 . . . xm ) − f (x1 . . . xm )| <
∀ x that lie in an input space

This applies to networks with M hidden units, σ as
the sigmoid activation function and wjk bj as hidden layer
weights and biases respectively.
47
Chapter 4
Practical Applications
4.1 Anna : The Well Behaved Robot
4.1.1 Problem Statement
You were posted by the National Youth Service Corps (NYSC)

to a remote secondary school in Guargwuwuri village of
northern Nigeria, an area infested with a new breed of can-
nibalistic Boko Haram militants. Tipped off by faithful
locals, you have first hand information that your location
will come under siege in less than 16 hours from militants
posing as school students.
However, you also know that militants have unusually
hairy faces, and are slightly taller than normal school kids.As
a computer savvy mathematician you decide to build a
robot named Anna with human like cognitive features us-
48
ing a neural network, which would warn you of these evil
ones and as well take necessary action where applicable :
The Well Behaved Robot
4.1.2 Design
To fool the locals and ignorant militants, we will make

Anna look like one of my friends. She will always sit at
the main entrance reading through an advanced math text-
book. We want Anna to behave well, she will say a cordial
greeting in hausa language to friendly visitors, while she
notifies the control room of anyone suspicious, warns them
not to move any further and lock the gate.
If they disobey, she will run away screaming. (can also
wreck havocs when instructed to do so)
4.1.3 Anna’s Brain
We shall use a feedforward neural network with four input

neurons in the input layer, four hidden and two output
units.
When Anna sees a visitor she receives input from her
two sharp eye cameras, sending in pairs (u1 , u2 ) , (v1 , v2 )
to the network for the visitor details, and its visual po-
sition respectively. We define an hypothetical function of
49
perfection
τ : R4 → R2
which defines actions to be taken in response to vary-

ing inputs. Anna calculates the height of each visitors and
detects if they have unusually hairy faces using feature de-
tection techniques.Then sends the numeric pair (u1 , u2 ) of
relative height and face awkwardness to the network. Also
she sends (v1 , v2 ) , the initial and final position of the visitor
to the network.
Actions defined by τ
Output 1 if:
The relative height and face awkwardness is greater than a
threshold
The final position is less than the initial
Output 0 if :
Otherwise
In short, the network receives input
X ∈ R4
50
and computes a function
f : R4 → R2
From the input space to the output space,which we hope

to be as close to τ as possible.
As we are using a multilayered feedforward network,linear

separability will not be an issue.
4.1.4 Training Anna
Anna starts with an empty brain, which we train by show-

ing her training samples, We will train her to
1. Say a courteous greeting if the network outputs (0, 0)
2. Notify the control room ,gives a warning and locks the

gate if the output is (0, 1)
3. Scream and take cover if the output is (1, 1)
We use the back propagation training method.

let ι be the error function,and η the rate of learning,and
xi the inputs then we adjust the weights ( neuronal con-
nections ) of the network by the following formulae
∆wij = −ηj Σn ιxi
51
For the first training layer
∂E n
ιk = g 0 (ak )
∂yk
and the second
ιj = g 0 (aj )Σk wkj ιj
Where g 0 is the derivative of the sigmoid activation func-

tion,given by
1
g 0 (a) =
1 + e−a
whose derivative can be expresses as
g 0 (a) = g(a)(1 − g(a))
and E n is the sum of the squared error ,given
1
E n = Σck=1 (yk − tk )2
2
yk the network output and tk the desired output.
4.1.5 Adding Reality To Anna
To make her more real, we can make her strike occasional

conversations with strangers. We can also make her mutter
intelligible words to herself. Here the network topology will
52
need expansion, which is no issue, but we leave it for future
work.
53
Chapter 5
Conclusion And References
5.1 Conclusion
We have taken a deep look at neural networks and feedfor-

ward network models.We have also demonstrated the pre-
cise mathematical basis and relevance of neural networks
and how it can be used to approximate functions with infi-
nite number of discontinuities to an arbitrary precision.We
have also examined learning in neural networks and prob-
abilistic models for it,and as well a real life application.
Hence we conclude that neural networks, can encode
mappings and functional representations of unknown func-
tions to give desired approximations to an arbitrary accu-
racy
54
References
1. David Kriesel : (2012) ”Neural Networks” , http://www.wikiversity.com
2. Wikiversity : (2012) ”Learning And Neural Networks”

, http://www.dkriesel.com/en
3. Gaurav Chandalia : (2007) ”A gentle introduction

to Measure Theory” ,SUNY University at Buffalo
4. George Cybenko : (1989) ”Approximation by Super-

positions of a Sigmoidal Function” , Springer-Verlag
New York Inc.
5. Martin Anthony : (2002) ”Uniform Glivenko-Cantelli

Theorems and Concentration of Measure in the Mathe-
matical Modelling of Learning” , Department of Math-
ematics London School of Economics London.
6. Ivan F Wilde : (2008) ”NEURAL NETWORKS” ,

Mathematics Department Kings College London
7. Vincent Cheung : (2006) ”An Introduction to Neu-

ral Networks” ,Signal and Data Compression Labora-
tory, Electrical and Computer Engineering University
of Manitoba Winnipeg, Manitoba, Canada
55

Neural Network Mathematics - A Brief Introduction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Network Mathematics - A Brief Introduction

Uploaded by

Copyright:

Available Formats

A BRIEF INTRODUCTION TO THE

( Draft Version Published January 2013)

To God Almighty,my compassionate creator and my won-

I am indeed grateful to Almighty God for his infinite

I hail you all!

1 Introduction To Neural Networks 1

3 Function Approximation With Neural Net-

5 Conclusion And References 54

Neural Networks are mathematical representations and mod-

1.1 What Is An Artificial Neural Network?

An artificial neural network (or in short a neural network) is

The main feature of neural networks which account for their

1.1.2 A Brief History Of Artificial Neural Networks

The development of neural network models date back to

Activity continued in this field,and in 1958, the first suc-

The field of neural networks however regained significance

It will be most appropriate to discuss briefly the operations

• The Dendrites (constituting a vastly multi branching

• The Cell Body (Processing part called soma)

• The axon(Which carries electrical pulses or signals to

Each neuron has only one axon,which may branch out to

An artificial neuron is an extremely simplified model of

Looking at a neuron j, we will usually find a lot of neu-

Let θ be the threshold.The variable v = u − θ is called

One of the most commonly used activation functions is the

The inputs to the neuron are x1 and x2 :

So, the summed weight of these inputs is

(0.8 × 0.7) + (0.4 × 0.9) = 0.92

The activation level Y , is defined for this neuron as

The larger α is, the greater the slope of Ψ(v).When

The sigmoid function

The perceptron is the first neural network with the capa-

In basic form, the perceptron can only solve linear prob-

A Typical Artificial Neuron

A general feed-forward network is a function which maps a

As in the figure above,weight-vectors and the associated

k being the number of processing units in the hidden

1.3.1 Measure Theory

A measure on a set is a systematic way of assigning to each

A sequence x1 . . . xn is said to converge to the point, or has

The sequence (xn ) is said to converge to the limit x.

A sequence x1 . . . xn is said to be a Cauchy sequence if for

• Any cauchy sequence is bounded

• Any convergent sequence is not necessarily cauchy, but

1.3.2 σ Algebra and Borel Algebra

Generally in analysis, σ is used to denote countability.

1.3.3 More On Measure And Measure Spaces

A function is said to be countably additive if for finite or

• µ is non negative i.e µ(x) ≥ 0 ∈ S

A point say ℘ is called a limit point of a set E if however

The outer measure of a set S denoted by µe (S) is defined

The inner measure is defined as

if µi (S) = µe (S) S is said to be measurable and the

Learning In Neural Networks

2.1 Neural Network Training

There are two main training types for neural networks

In supervised training, the neural network is supplied

2.1.1 The Back Propagation Algorithm

This is the most common form of supervised training. A set

The connection weights are gradually adjusted, working

from a bounded subset A of the n-dimensional Euclidean

It is assumed that such examples are generated by select-

2.2 Probabilistic Model Of Learning

The minimal such n0 (, δ) is referred to as the sample

The minimal 0 (n, δ) is called the estimation error of the

P (sup | µ(f ) − µm (f ) |> ) → 0

| G(x) − f (x) |< ∀x ∈ In

|G(x) − f (x)| <