You are on page 1of 41

Radial Basis Functions Neural Network

Tirtharaj Dash

Course: BITS F312

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 1 / 41
Till now I

Till now we have discussed that the BackProp could be used to


somehow generalize the learning of an MLP. And we know the
generalization depends on
The size of the training data
The architecture of the network
Physical complexity of the problem at hand –however, control of this is
not in our hand
BackProp could be viewed as a tool which finds out a stochastic
approximation behind the problem we are going to solve. That is
estimation of the free parameters such as synaptic weights.
Why did we study MLP? – Because an SLP can not solve a problem
where the patterns (data) are not linearly separable. But, an MLP
has capability to solve nonlinearly separable problems with the use of
one or more hidden layers architectures.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 2 / 41
Till now II

MLP can have more than one hidden layer which makes it capable to
solve nonlinearly separable problems.
Although there are advantages, there are few limitations of the MLP.
One such problem could be: MLP can also learn outliers or noises in
the training pattern – which will probably lowers its robustness to
noisy patterns.
Noisy patterns are those patterns which do not follow any underlying
behavior as the other samples in dataset; e.g. a set of samples may
follow a particular distribution whereas few noisy samples keep
themselves apart from that.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 3 / 41
Radial Basis Functions Networks I

Here, we solve the problem of classifying nonlinearly separable


patterns by proceeding in a hybrid manner using two stages:
The first stages transforms the set of nonlinearly separable samples
(patterns) into a new set, which under certain conditions, are more
likely to be linearly separable in the new sample space (Mathematically
justified by Cover’s paper in 1965)
The second stage completes the solution to the prescribed classification
problem by using least-squares estimation.
RBF networks are usually has a single hidden layer and an output
layer.
The inputs are the source nodes (sensory units)
The hidden units, applies a nonlinear transformation from the input
space to the hidden (feature) space. [For most applications, the
dimensionality of this layer should be high, so as to satisfy the Cover’s
theorem] (stage 1)
The output layer is linear, designed to supply the response of the
network to the activation pattern applied to the input layer(stage 2)
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 4 / 41
Radial Basis Functions Networks II

The nonlinear transformation from the input space to the hidden


space and the high dimensionality of the hidden space satisfy the only
two conditions of Cover’s theorem.
We should note that much of the theory developed on RBF networks
builds on the Gaussian function.
The Gaussian function may also be viewed as a kernel – hence the
designation of the two-stage procedure based on the Gaussian
function as a kernel method.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 5 / 41
Cover’s theorem – separability of patterns I

The basic idea behind the RBF networks is that when a RBF network
is used to perform a complex pattern classification task, the problem
is basically solved by first transforming it into a high dimensional
space in a nonlinear manner and then separating the classes in the
output layer.
As previously mentioned, the underlying justification is found in
Cover’s theorem on the separability of patterns, which, in qualitative
terms, may be stated as follows (Cover, 1965):
A complex pattern-classification problem, cast in a high-dimensional
space nonlinearly, is more likely to be linearly separable than in a
low-dimensional space, provided that the space is not densely
populated.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 6 / 41
Cover’s theorem – separability of patterns II

For clarity:– Given a set of training data that is not linearly separable,
one can with high probability transform it into a training set that is
linearly separable by projecting it into a higher-dimensional space via
some non-linear transformation.
Why should it work? – It should work because we know that a linearly
separable classification problem is easier to solve than a nonlinear
classification problem.
For separating a set of samples or patterns in the high dimensional
space, we need hyper-surfaces(s).

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 7 / 41
Cover’s theorem – separability of patterns III

Figure 1: hypersurface in 2D (source: http://web.stanford.edu/)

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 8 / 41
Cover’s theorem – separability of patterns IV

Consider a family of (hyper-)surfaces where each surface naturally


divides an input space into two regions (Let focus on a binary
classification problem).
Let H denote a set of N patterns (vectors) x1 , x2 , . . . , xN , each of
which is assigned to one of two classes H1 and H2 .
This dichotomy (meaning, binary partition) of the points is said to be
separable with respect to the family of surfaces if a surface exists in
the family that separates the points in the class H1 from those in the
class H2 .
For each pattern x ∈ H, define a vector made up of a set of
real-valued functions {ϕi (x)|i = 1, 2, . . . , m1 }, as shown by

φ(x) = [ϕ1 (x), ϕ2 (x), . . . , ϕm1 (x)]T (1)

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 9 / 41
Cover’s theorem – separability of patterns V

Suppose that the pattern x is a vector in an m0 -dimensional input


space. So, here the function φ(·) is mapping this m0 -dimensional
input space to m1 -dimensional input space.
We refer ϕi (x) as a hidden function, because it plays a role similar
to that of a hidden unit in a MLP. Correspondingly, the space
spanned by the set of hidden functions {ϕi (x)}m 1
i=1 is referred to as
the feature space.
As per (Cover, 1965) we can say a dichotomy {H1 , H2 } of H
ϕ-separable if there exists an m1 -dimensional vector w such that it
satisfies the following conditions:

wT φ(x) > 0, x ∈ H1 (2)

wT φ(x) < 0, x ∈ H2 (3)

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 10 / 41
Cover’s theorem – separability of patterns VI

So, the hyperplane, which is the separating surface in the φ−space


(new feature space), can be defined by the equation

wT φ(x) = 0 (4)

The inverse image of this surface which defines the decision boundary
in the input space is given as,

x : wT φ(x) = 0 (5)

Since the pattern classification problem has been transformed to be


nonlinear, we can not use a linear equation (e.g. wT x = 0) which will
linearly separate a set these patterns. So, it would be sufficient to
make higher order terms in the equations so that we may probably
generate a hyper-surface which can separate the patterns in the
transformed feature space.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 11 / 41
Cover’s theorem – separability of patterns VII

Consider a natural class of mappings obtained by using a linear


combination of r -wise products of the pattern vector coordinates.
The separating surfaces corresponding to such mappings are referred
to as r th-order rational varieties.
A rational variety of order r in a space of dimension m0 is described
by an r th-degree homogeneous equation in the coordinates of the
input vector x, as shown by
X
ai1 i2 ...ir xi1 xi2 . . . xir = 0, (6)
0≤i1 ≤i2 ≤...≤ir ≤m0

where xi is the ith component of the input vector x and x0 is the set
of unity in order to express the equation in homogeneous form. An
r th-order product entries xi of x – that is xi1 xi2 . . . xir – is called a
monomial.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 12 / 41
Cover’s theorem – separability of patterns VIII

For an input space of m0 -dimensionality, there are C (m0 , r ) number


of monomials. (e.g. if r = 1, it is a set of hyperplanes; if r = 2 –
quadrices; if r = 2 with linear constraints on the coefficients a –
hyperspheres)

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 13 / 41
Cover’s theorem – separability of patterns IX

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 14 / 41
Cover’s theorem – separability of patterns X

Note that if patterns belonging to two classes are linearly separable,


they are hyper-spherically separable. But hyper-spherical separability
does not imply linear separability(by a hyperplane).
In general, linear separability implies spherical separability, which
implies quadric separability; however, the converses are not necessarily
true. (Lower to higher possible; not vice versa)
If there are N patterns, there could be a possibility of generating
many different dichotomies. However, not all the dichotomies are
separable. Few are separable. Cover studied – what is probability of
separable dichotomies.
So, he assumed the following:
All possible dichotomies of H = {xi }N
i=1 are equi-probable.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 15 / 41
Cover’s theorem – separability of patterns XI

He defined a probability P(N, m1 ) denotes the probability that a


particular dichotomy is picked at random is ϕ separable. (Without
proof) P(N, m1 ) can be written as
m 1 −1
1 X
P(N, m1 ) = ( )N−1 C (N − 1, m) for N > m1 − 1 (7)
2 m=0

and
P(N, m1 ) = 1 for N ≤ m1 − 1 (8)

From the above equations, one could say that if we maximize this
probability, then there high chance of getting good ϕ-separable
dichotomies.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 16 / 41
XOR Problem I
We know that the XOR problem is not solvable by a perceptron. See the
figure 2.

x1 x2 d
0 0 0
0 1 1
1 0 1
1 1 0

Figure 2: XOR problem: the four patterns

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 17 / 41
XOR Problem II

If we consider two neurons in the hidden layer of the RBFNN with centroid
being µ1 = [1, 1] and µ2 = [0, 0].
Please note that here the transformation to a higher dimension is not
necessary. As, these two hidden layer functions ϕ1 (·) and ϕ2 (·) can make
the problem linearly separable as shown in the figure 4 as per the values
obtained at the hidden layers using the following hidden functions.

ϕ1 (x) = exp(||x − µ1 ||2 ) (9)


2
ϕ2 (x) = exp(||x − µ2 || ) (10)

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 18 / 41
XOR Problem III

Figure 3: Values of the hidden functions (using some distance function)

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 19 / 41
XOR Problem IV

Figure 4: After kernel transformation in the hidden layer

It should be noted that unlike the XOR problem, it is quite difficult to


intuit of a possible dimension that would solve the purpose.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 20 / 41
Separating capacity of a surface I
Let a sequence of patterns be described as {xi }N i=1 . Let N be a random
variable (r.v.) defined as the largest integer such that this sequence is ϕ
separable, where ϕ has m1 degree of freedom.

Pr (N = n) = P(n, m1 ) − P(n + 1, m1 ) (11)


1
= ( )n n−1 Cm1 −1 , n = 0, 1, 2, . . . (12)
2
This distribution equals the probability that k failures precede the r th
success, in a long, repeated sequence of Bernoulli trials – true or false
outcome of an experiment.
Let p and q be probability of success and failures respectively so,
p + q = 1. The negative binomial distribution is written as

f (k; r , p) = p r q k r +k−1
Ck (13)
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 21 / 41
Separating capacity of a surface II
For a special case p = q = 0.5 (success and failures are equiprobable) and
k + r = n, so

f (k; n − k, 0.5) = 0.5n q k n−1


Ck , n = 0, 1, , 2, . . . (14)

With this definition, we now see that the result described in Eq. 12 is just
the negative binomial distribution, shifted by m1 units to the right, and
with parameters m1 and 0.5. Thus, N corresponds to the “waiting time”
for the m1 -th failure in a sequence of tosses of a fair coin.The expectation
of the random variable N and its median are, respectively,

E[N] = 2m1 (15)

and
median[N] = 2m1 (16)
The expected maximum number of randomly assigned patterns (vectors)
that are linearly separable in a space of dimensionality m1 is equal to 2m1.
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 22 / 41
The interpolation problem I

Following Cover’s theoerm on separability of patterns, we studied that


there is actually advantages while solving a nonlinear problem by
mapping the input space into a new space of high enough dimension.
Basically, a nonlinear mapping is used to transform a nonlinearly
separable classification problem into a linearly separable one with high
probability.
Similarly, nonlinear filtering could also be transformed into linear
filtering by the use of this fundamental idea.
Consider then a feedforward network with an input layer, a single
hidden layer, and an output layer consisting of a single unit. The
network is designed to perform a nonlinear mapping from the input
space to the hidden space, followed by a linear mapping from the
hidden space to the output space.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 23 / 41
The interpolation problem II

Let mo denote the dimension of the input space. So, in this


transformation the network actually map the mo -dimensional input
space to the single-dimensional output space, written as

s: Rm0 → R1 (17)

As discussed earlier, this type of network has high chance of


transforming a noisy pattern to somewhat linear-nonlinear
transformation at the hidden layer and learn this transformed pattern.
Moreover, the following two points also could be viewed:
The training phase constitutes the optimixation of the fitting procedure
for some separating surface, based on the available data patterns which
have been presented to the network in the form of input-output
examples.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 24 / 41
The interpolation problem III
The generalization phase is synonymous with interpolation between the
data points with the interpolation being performed along the
constrained separating surface generated by the fitting procedure
(FFNN) as the optimum approximation to the true surface among
pattern.
The theory of multivariable interpolation in high-dimensional space in
its strict sense may be stated as
Given a set of N different points {xi ∈ Rm0 |i = 1, 2, . . . , N} and a
corresponding set of N real numbers (desired outputs)
{di ∈ R1 |i = 1, 2, . . . , N}, find a function F : Rm0 → R1 that satisfies
the interpolation condition:

F (xi ) = di , i = 1, 2, . . . , N (18)

So basically, we are trying to say that all the data patterns will fall
over the separating interpolation surface F . This is different from our
approximation function which transforms the input to the output
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 25 / 41
The interpolation problem IV
given some data. The former statement will make more sense when
we visualize/think the underlying pattern behavior among the data
patterns provided to us.
For strict interpolation as specified here, the interpolating surface
(i.e., function F ) is constrained to pass through all the training data
points for the input-output transformation.
The radial-basis functions (RBF) technique consists of choosing a
function F that has the form
N
X
F (x) = wi ϕ(||x − xi ||) (19)
i=1

where ϕ(||x − xi ||)|i = 1, 2, . . . , N is a set of N arbitrary (generally


nonlinear, e.g. Gaussian) functions, known as radial-basis functions,
and || · || denotes the norm that is usually Euclidean. The known data

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 26 / 41
The interpolation problem V

points xi ∈ Rm0 , i = 1, 2, . . . , N are taken to be the centers of the


radial-basis functions.
Now, inserting the interpolation conditions of Eq.(18) into Eq.(19),
we obtain a set of simultaneous linear equations for the unknown
coefficients (weights) of the expansion {wi } given by
    
ϕ11 ϕ12 . . . ϕ1N w1 d1
 ϕ21 ϕ22 . . . ϕ2N   w2   d2 
..   ..  =  ..  (20)
    
 .. .. ..
 . . . .   .   . 
ϕN1 ϕN2 . . . ϕNN wN dN

where ϕij = ϕ(||xi − xj ||), i, j = 1, 2, . . . , N,

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 27 / 41
The interpolation problem VI

In general form we can write

Φw = x (21)

Assuming that Φ is non-singular (i.e. inverse exists), we may find the


value of the parameter set w by

w = Φ−1 x (22)

Can we be sure that the interpolation matrix Φ be non-singular?

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 28 / 41
The interpolation problem VII

Micchelli (1986) discussed this question for some of the possible


radial-basis functions and the theorem is stated as follows

Let {xi }N m0
i=1 be a set of distinct points in R . Then the N-by-N
interpolation matrix Φ, whose ij-th element is ϕij = ϕ(xi − xj ), is
non-singular.

There are a number of radial-basis functions in Miccelli’s theorem


paper; the following are a set of examples:
Multiquadrics:

ϕ(r ) = (r 2 + c 2 )0.5 for some c > 0 and r ∈ R (23)

Inverse multiquadrics:
1
ϕ(r ) = for some c > 0 and r ∈ R (24)
(r 2 + c 2 )0.5

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 29 / 41
The interpolation problem VIII

Gaussian functions:

r2
 
ϕ(r ) = exp − 2 for some σ > 0 and r ∈ R (25)

The important constraint behind the singularity of these equations,


rather the interpolation matrix Φ, is that all the data patterns have
to be distinct.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 30 / 41
RBF Networks I

Figure 5: Structure of an RBF network, based on interpolation theory.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 31 / 41
RBF Networks II

Inputs: consists of m0 source nodes, where m0 is the dimensionality


of the input vector x .
Hidden layer: consists of same number of computation units as the
size of the training sample i.e. N; each unit is mathematically
described by r.b.f.

ϕ(x ) = ϕ(||x − xj ||), j = 1, 2, . . . , N (26)

The jth input data point xj defines the center of the r.b.f., and the
vector x is the signal (pattern/input) applied to the input layer. (So,
the links from the input to the hidden layer have no weights)
Output layer: Depends on the classes or type of the problem, the
number of neurons in the output layer may vary. It should be noted
that the number of neurons in the output layer is lower than the
number of neurons in the hidden layer.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 32 / 41
RBF Networks III

Let focus on the Gaussian r.b.f. so

−||x − xj ||2
!
ϕ(x ) = ϕ(||x − xj ||) = exp , j = 1, 2, . . . , N (27)
2σj2

The parameter σj is the width of the jth Gaussian function with center
xj .[Typically, but not always, the widths of the Gaussians are same.] – The
only difference among different hidden units in this case would be the
center xj given the widths are same.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 33 / 41
What are we doing here? I

The figure given above follows the interpolation theory which we had
just discussed; and it is quite okay to assume that it is alright for any
data patterns. However, here we are trying to make an assumption
that the data patterns are not noisy at all (remember, it could be true
also). But, why take a chance, when we are not sure of anything
about the data patterns?
If the data patterns are noisy, then we may finally our obtained results
misleading. Can we think of a different approach? Moreover, noise
could mean either deviating from following a pattern as the other data
patterns, or inherent redundancy in terms of some kind of correlation.
Whatever improvements that could be done with the RBF network is
the modification with the hidden layer only. Because the other two
layers (input, and output) are fixed before.
With hidden layer the major improvement that could be done is the
change in the architecture – i.e. the number of neuron.
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 34 / 41
What are we doing here? II

Can we fix this number of neurons in the hidden layer to size of


input? – Well, we can. But, actually we will be unnecessarily
transforming the data instances to the same dimensional feature
space – this does not hold quite well as per Cover’s theorem.
Can we fix the number of hidden neurons to number of training
samples (N)? – well, we can. But, again we are ignoring the inherent
noise associated with the available data patterns in terms of
correlation. Think of this case, consider there are 10000 data patterns
sampled at random. Can you guarantee that all these data patterns
are unique in some way? I guess, we can not. There could be some
data patterns (say 56 samples) which falls into some category (that
means there is some kind of correlation involved within in these 56
data patterns). Similarly, there could be many more such groups
inside the available number of input data patterns.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 35 / 41
What are we doing here? III

Because of the reasons explained in the above point, it is always a


good idea to set the number of neurons in the hidden layer to K
where K < N. (see the following figure)

Figure 6: Practical RBF Network with number of hidden neurons smaller than N

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 36 / 41
What are we doing here? IV

How do we find out the centroid of the K hidden neurons?


After finding out the centroids, then what to do? – Use the centroids
for computing the hidden functions in the hidden layer. So we get the
values at the hidden neurons to compute the final output y using the
synaptic weights.
Use some optimization algorithm such as least-square-error method
for obtaining the final set of optimal synaptic weights.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 37 / 41
Centroids of Hidden Neurons

There are various ways to find similarity among data points or


patterns (or, say similary K subgroups). The best way to use a
clustering algorithm without considering the output into account.
The clustering could be done by employing the simple K −means
algorithm.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 38 / 41
K -means clustering I

Inputs:
– K (number of clusters) – there are various ways to choose a wise K
which we won’t discuss
– The set of unlabeled examples {x}Ni=1 (note that the training set
does not contain d); each x ∈ R m0

Algorithm:
Randomly initialize K cluster centroids µ1 , µ2 , . . ., µK ∈ Rm0
Repeat{
for i = 1 to N /*cluster assignment*/
clustIdi =index (from 1 to K ) of cluster centroid closest to xi (see
eq. (28))
for k = 1 to K /*update the centroids*/
µk =average(mean) of points assigned to cluster k (see eq. (29))
}

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 39 / 41
K -means clustering II

clustIdi = argmin(||xi − µk ||2 ) (28)


k
Pc
j=1 xj
µk = ; (29)
c
where c is the number of patterns closest to the cluster-k.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 40 / 41
After finding out K , then what to do?

After finding out K we need to find out the optimal weight set
(synaptic weights) from hidden layer to the output layer.
Well, how to get these optimal weights? – You can employ least
square estimation approach.

Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 41 / 41

You might also like