Professional Documents
Culture Documents
Neural Network Mathematics - A Brief Introduction
Neural Network Mathematics - A Brief Introduction
MATHEMETICS OF NEURAL
NETWORKS
BY
ASHIMI BLESSING
Ashimiblessing@hotmail.com
i
Acknowledgements
Ashimi Blessing . A
ii
Contents
iii
2 Learning In Neural Networks 21
2.1 Neural Network Training . . . . . . . . . . 21
2.1.1 The Back Propagation Algorithm . . 22
2.2 Probabilistic Model Of Learning . . . . . . . 23
2.2.1 Uniform Convergence Results . . . . 28
2.2.2 Application To Successful Learning . 31
4 Practical Applications 48
4.1 Anna : The Well Behaved Robot . . . . . . 48
4.1.1 Problem Statement . . . . . . . . . . 48
4.1.2 Design . . . . . . . . . . . . . . . . . 49
4.1.3 Anna’s Brain . . . . . . . . . . . . . 49
4.1.4 Training Anna . . . . . . . . . . . . 51
iv
4.1.5 Adding Reality To Anna . . . . . . . 52
v
Abstract
Introduction To Neural
Networks
1
from a cup, give a lecture or even take a course on neural
networks.
Though it is not the brain that entirely does these things,
it plays an essential role in each process.
As powerful as the brain is, it is considered to consist
of large number of not so intelligent but highly connected
processing elements known as neurons, which their inter-
connectives form a network . It has been estimated that
the brain (human) consist of about 1011 neurons (or more)
having as many as 104 interconnections, with each of neu-
rons communicating with other neurons via signals. since
an artificial neural network works like the brain, it is very
useful in solving a wide array of problems . From robot
control to stock prediction,time series analysis and the cre-
ation of computer frameworks that can mimic human think-
ing,neural networks are very useful, not only in mathemat-
ical analysis, but in very real day to day applications.
2
1.1.1 Motivation Of Study
3
veloped by Frank Rosemblatt. The field of neural networks
continued to look promising until 1969 when Marvin Min-
sky and Seymout Papert published a precise mathematical
analysis of the perceptron, showing its weakness in many
areas and its incapability of representing many important
mathematical problems.This dealt a huge blow to its re-
search and funding; thus only few researchers were left in
the field.
4
1.1.3 The Biological Neuron
5
or ”fire” may be either be enhanced or decreased.Thus ,an
incoming signal can either be excitatory or inhibitory.
The containing wall of the neuron keeps most molecules
from passing either in or out of the cell,but there are special
channels allowing the passage of ions such as N a+ ,K + ,Cl−
and Ca++
By allowing such ions to pass,a potential is generated
and maintained between the inside and outside of the cell.
When an action potential reaches a synapse, it causes a
change in the permeability of the membrane carrying the
pulse, which result in the influx of Ca++ ions. This leads
to the release of neuro transmitters into the synaptic cleft
which are diffused towards the receptor sites of the receiving
cells.
As the neuron continuously receive signals from its input
channels,it sums up these inputs to itself in some way.If
the end result is greater than a predefined threshold,the
neuron is activated and it generates an output signal which
it passes along to the nearest neuron.
6
1.1.4 The Artificial Neuron
7
wi,j into the network input u that can be further processed
by the activation function. Thus, the network input is the
result of the propagation function.
The neuron produces a weighted sum or net sum,given
as
n
X
u= wi xi
i=1
assuming the inputs add up linearly.
y = ψ(u − θ)
8
1.1.5 The Activation Functions
n
X
X= wi xi
i=1
X is the weighted sum of the n inputs to the neuron,
x1 to xn , where each input, xn is multiplied by its corre-
sponding weight wn . For example, let us consider a simple
neuron that has just two inputs. Each of these inputs has
a weight associated with it, as follows:
w1 = 0.8
9
w2 = 0.4
x1 = 0.74
x2 = 0.9
+1 for x > θ
Y =
0 for x ≤ θ
The second is the sigmoid function,which is used when
data is considered continuous. A sigmoid function is any
differentiable function Ψ(·) such that
Ψ(v) → 0 as v → −∞ , Ψ(v) → 1 as v → ∞
10
1
Ψ(v) =
1 + e−αv
Where α is a parameter.
11
1.2 Examples Of Neural Networks
1.2.1 Perceptrons
12
1.2.2 Feedforward Networks
13
A Typical Feedforward Network
14
k
X
f (x) = β · σ(wj − θj )
j=1
15
1.3 Mathematical Definitions And Background In-
formation
Measure On A Set
Limit Of A Sequence
16
| xn − x |<
Cauchy sequence
0
A σ algebra S is a collection of subsets of a set S which
are closed under countable set operations, ie the comple-
17
ment of a member, the union and intersection of members
of S are also its members.
0
In formal terms, an algebra S of subsets of a set S is a σ
0
algebra if S contains the limit of every monotone sequence
of its sets.
0
The pair (S, S ) is then known as a measurable space
and sets in s are said to be measurable
The Borel algebra of a set is the minimal sigma algebra
that contains all open (or closed) sets on the real line.
The elements of Borel algebra are called Borel sets.
∞
X
φ(s) ≤ φ(St )
i=1
0
∞
where S ∈ s , S ⊆ Ut=1 St
18
if there is a countably additive set function µ defined
0
on a σ algebra S of subsets of the set S ,then the triplet
0
(S, S , µ) is a measure space.
An example is the Euclidean space with Lebesgue measure.
0
The sets S are called measurable sets and the function µ
is called a measure with the following properties
• µ is countably additive
• µ(0) = 0
• µ obeys monotonicity
Limit Points
Lebesgue Measure
19
0 ≤ µe (S) ≤ (b − a)
µi (x) = (b − a) − µe (CS)
20
Chapter 2
• Supervised Training
• Unsupervised Training
21
identifies the patterns and differences without any external
assistance.
22
f : A ⊂ Rn → Rm
∞
[
A: Zn → H
1
23
which takes a randomly generated training sample of la-
beled examples, (each called a training example) and pro-
duces a function
h : X → [0, 1]
24
ing to µ, so, if the training sample is of length n, then it
is generated according to the product probability measure
µn .)
` : [0, 1] × Y → [0, 1]
L(h) = E`(h(x), y)
`(r, s) = |r − s|
`(r, s) = (r − s)2
25
,
and the discrete loss, given by
0 if r = s
`(r, s) =
1 if r 6= s
The best loss one could hope to be near is L∗ = infh∈H L(h),
we want A(z) to have loss close to L∗ ,with high probability,
provided the sample size n is large enough.
Definition
L(A(z)) ≤ L∗ +
26
lim 0 (n, δ) = 0
n→∞
L(A(z)) ≤ L∗ + 0 (n, δ)
27
at least 1−δ, A produces a hypothesis which agrees with the
target function with probability at least 1 − on a further
randomly drawn example.
Borel-Cantelli lemma
Stated thus
Σ∞
n=1 P r(En ) < ∞
Assumptions
28
µn (f ) = n−1 Σni=1 f (zi )
Definition
∀ > 0 lim sup P sup sup | µ(f ) − µm (f ) |> = 0
n→∞ u m≥n f ∈F
as n → ∞
For a class to be a uniform Glivenko-Cantelli class, we
must, additionally, be able to bound the rate of convergence
uniformly over all f ∈ F , and over all probability measures
µ.
29
If F is finite, then it is a uniform Glivenko-Cantelli class.
To see this explicitly, we can use Hoeffding inequality ,
which tells us that for any µ and for each f ∈ F ,
2
P (| µ(f ) − µm (f ) |> ) < 2−2 n
.
It follows that
P sup | µ(f ) − µm (f ) |>
f ∈F
[
=P | µ(f ) − µm (f ) |>
f ∈F
≤ Σf ∈F P (| µ(f ) − µm (f ) |> )
2
≤ 2 | F | −2 n
30
Σ∞
n=1 P (| µ(f ) − µm (f ) |> )
2
< Σ∞
n=1 2 | F |
−2 n
<∞
we have
lim sup P sup sup | µ(f ) − µm (f ) |> = 0
n→∞ u m≥n f ∈F
`H = {`h : h ∈ H}
31
Suppose that `H is a uniform Glivenko-Cantelli class.
For z ∈ Z n , the empirical loss of h ∈ H on z is defined to
be
1
Lz (A(z)) < + inf Lz (h)
n h∈H
Then A is a successful learning algorithm. (In the binary
case, the infimum is a minimum, and the 1/n is not needed).
This follows from the fact that
if > 0 and δ > 0 are given and let h∗ ∈ H be such that
L(h∗ ) < L∗ +
4
Suppose n > 4 , so that 1
n < 4 . By the uniform Glivenko-
Cantelli property for `H , there is n0 ( 4 , δ) such that for all
n > n0 , with probability at least 1 − δ,
sup | L(h) − Lz (h) |<
h∈H 4
So, with probability at least 1 − δ,
32
L(A(z)) < Lz (A(z)) +
4
1
< inf Lz (h) +
h∈H n
< Lz (h∗ ) + 2
4
< L(h∗ ) + +
4 2
3
∗
< L + +
4 2
= L∗ +
∀ > 0 lim sup P sup | µ(f ) − µn (f ) |> = 0
n→∞ µ f ∈F
33
Chapter 3
Function Approximation
With Neural Networks
34
3.1.2 Useful Theorems
Theorem (Kolmogorov)
For any function f : [0, 1]n → < (on the n-dimensional unit
cube) ,there are continuous functions h1 . . . h2n+1 on < and
continuous monotone increasing functions gij for 1 ≤ i ≤ n
and 1 ≤ j ≤ 2n + 1 such that
2n+1 n
!
X X
f (x1 . . . xn ) = hj gij (xi )
j=1 i=1
the functions gij do not depend on f
Hahn-Banach Theorem
Z 1
A[F ] = f (x)d(α(x))
0
35
Where α(x) is a function of bounded variation on [0, 1]
and the integral is a Riemann-Stieltjes integral.
| fn (x) |≤ g(x)
Z Z
lim fn dµ = f dµ
n→ S S
36
Bounded Convergence Theorem
Z Z
lim fn dµ = f dµ
n→ S S
k
X
f (x) = βσ(wj .x − θj )
j=1
37
N
X
ασ(y T x + θj )
j=1
1 as t → ∞
σ(t) →
0 as t → −∞
38
3.2.1 Main Results
Let In denote the n dimensional unit cube [0, 1]n .The space
of continuous functions on In is denoted by C(In ) and we
use ||f || to denote the uniform norm of an f ∈ C(In )
In general we use ||.|| to denote the maximum of a func-
tion on its domain.The space of finite signed regular borel
measures on In is denoted by M (In )
The conditions under which sums of the form
N
X
G(x) = αj σ(y T j x + θj )
j=1
Definition
39
Theorem
N
X
G(x) = αj σ(y T j x + θj )
j=1
Proof
40
closure of S is all of C(In ).
Z
L(h) = h(x)dµ(x)
In
Z
σ(y T x + θ)dµ(x) = 0
In
∀ y and θ .
However, we assumed that σ was discriminatory so that
this condition implies that µ = 0 contradicting our assump-
tion. Hence, the subspace S must be dense in C(In ).
41
This demonstrates that sums of the form
N
X
G(x) = αj σ(y T j x + θj )
j=1
1 as t → ∞
r(t) =
0 as t → −∞
Lemma 1
42
To demonstrate this, note that for any x, y, θ, ϕ we have
→1 yT x + θ > 0 as λ → ∞
σ(λ(y T x + θ) + ϕ) →0 yT x + θ < 0 as λ → ∞
= σ(x) y T x + θ = 0
∀λ
=1 for y T x + θ > 0
Y (x) =0 for y T x + θ < 0
for y T x + θ = 0
= σ(ϕ)
As λ → +∞
Q
Let y,θ be the hyperplane defined by
{x|y T x + θ = 0}
{x|y T x + θ > 0}
Z
0= σλ (x)dµ(x)
In
43
Z
= Y (x)dµ(x)
In
Y
= σ(ϕ)µ y, θ + µ(Hy , θ)
∀ϕθy
We now show that the measure of all half-planes being 0
implies that the measure µ itself must be 0. This would be
trivial if µ were a positive measure but here it is not. Fix
y. For a bounded measurable function, h, define the linear
functional, F , according to
Z
F (h) = h(y T x)dµ(x)
In
and note that F is a bounded functional on L∞ (R) since
µ is a finite signed measure. Let h be the indicator function
of the interval [0, ∞) (that is, h(u) = 1 if u > 0 and h(u) =
0 if u < 0) so that
Z Y
F (h) = h(y T x)dµ(x) = µ( y, −θ) + µ(Hy,−θ ) = 0
In
44
Similarly, F (h) = 0 if h is the indicator function of the
open interval (0, ∞). By linearity, F (h) = 0 for the in-
dicator function of any interval and hence for any simple
function (that is, sum of indicator functions of intervals).
Since simple functions are dense in L∞ (R) ; F = 0 .
In particular, the bounded measurable functions
give
Z
F (s + ic) = cos(mT x) + isin(mT x)dµ(x)
In
T
=eim x dµ(x) = 0
∀m
Thus, the Fourier transform of µ is 0 and so µ must be
zero as well. Hence, µ is discriminatory.
45
3.2.2 Application Of Results
Theorem
∀x ∈ In
46
Proof
M M
!
X X
F (x1 . . . xm ) = αj σ wjk xk − bj
j=1 k=1
is an approximation of F (·) ie
47
Chapter 4
Practical Applications
48
ing a neural network, which would warn you of these evil
ones and as well take necessary action where applicable :
The Well Behaved Robot
4.1.2 Design
49
perfection
τ : R4 → R2
Actions defined by τ
Output 1 if:
The relative height and face awkwardness is greater than a
threshold
The final position is less than the initial
Output 0 if :
Otherwise
In short, the network receives input
X ∈ R4
50
and computes a function
f : R4 → R2
51
For the first training layer
∂E n
ιk = g 0 (ak )
∂yk
and the second
1
g 0 (a) =
1 + e−a
whose derivative can be expresses as
1
E n = Σck=1 (yk − tk )2
2
yk the network output and tk the desired output.
52
need expansion, which is no issue, but we leave it for future
work.
53
Chapter 5
5.1 Conclusion
54
References
55