Radial Basis Function Neural Network

Radial Basis Functions Neural Network
Tirtharaj Dash
Course: BITS F312
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 1 / 41
Till now I
Till now we have discussed that the BackProp could be used to

somehow generalize the learning of an MLP. And we know the
generalization depends on
The size of the training data
The architecture of the network
Physical complexity of the problem at hand –however, control of this is
not in our hand
BackProp could be viewed as a tool which finds out a stochastic
approximation behind the problem we are going to solve. That is
estimation of the free parameters such as synaptic weights.
Why did we study MLP? – Because an SLP can not solve a problem
where the patterns (data) are not linearly separable. But, an MLP
has capability to solve nonlinearly separable problems with the use of
one or more hidden layers architectures.
Till now II
MLP can have more than one hidden layer which makes it capable to
solve nonlinearly separable problems.
Although there are advantages, there are few limitations of the MLP.
One such problem could be: MLP can also learn outliers or noises in
the training pattern – which will probably lowers its robustness to
noisy patterns.
Noisy patterns are those patterns which do not follow any underlying
behavior as the other samples in dataset; e.g. a set of samples may
follow a particular distribution whereas few noisy samples keep
themselves apart from that.
Radial Basis Functions Networks I
Here, we solve the problem of classifying nonlinearly separable

patterns by proceeding in a hybrid manner using two stages:
The first stages transforms the set of nonlinearly separable samples
(patterns) into a new set, which under certain conditions, are more
likely to be linearly separable in the new sample space (Mathematically
justified by Cover’s paper in 1965)
The second stage completes the solution to the prescribed classification
problem by using least-squares estimation.
RBF networks are usually has a single hidden layer and an output
layer.
The inputs are the source nodes (sensory units)
The hidden units, applies a nonlinear transformation from the input
space to the hidden (feature) space. [For most applications, the
dimensionality of this layer should be high, so as to satisfy the Cover’s
theorem] (stage 1)
The output layer is linear, designed to supply the response of the
network to the activation pattern applied to the input layer(stage 2)
Radial Basis Functions Networks II
The nonlinear transformation from the input space to the hidden

space and the high dimensionality of the hidden space satisfy the only
two conditions of Cover’s theorem.
We should note that much of the theory developed on RBF networks
builds on the Gaussian function.
The Gaussian function may also be viewed as a kernel – hence the
designation of the two-stage procedure based on the Gaussian
function as a kernel method.
Cover’s theorem – separability of patterns I
The basic idea behind the RBF networks is that when a RBF network
is used to perform a complex pattern classification task, the problem
is basically solved by first transforming it into a high dimensional
space in a nonlinear manner and then separating the classes in the
output layer.
As previously mentioned, the underlying justification is found in
Cover’s theorem on the separability of patterns, which, in qualitative
terms, may be stated as follows (Cover, 1965):
A complex pattern-classification problem, cast in a high-dimensional
space nonlinearly, is more likely to be linearly separable than in a
low-dimensional space, provided that the space is not densely
populated.
Cover’s theorem – separability of patterns II
For clarity:– Given a set of training data that is not linearly separable,
one can with high probability transform it into a training set that is
linearly separable by projecting it into a higher-dimensional space via
some non-linear transformation.
Why should it work? – It should work because we know that a linearly
separable classification problem is easier to solve than a nonlinear
classification problem.
For separating a set of samples or patterns in the high dimensional
space, we need hyper-surfaces(s).
Cover’s theorem – separability of patterns III
Figure 1: hypersurface in 2D (source: http://web.stanford.edu/)
Cover’s theorem – separability of patterns IV
Consider a family of (hyper-)surfaces where each surface naturally

divides an input space into two regions (Let focus on a binary
classification problem).
Let H denote a set of N patterns (vectors) x1 , x2 , . . . , xN , each of
which is assigned to one of two classes H1 and H2 .
This dichotomy (meaning, binary partition) of the points is said to be
separable with respect to the family of surfaces if a surface exists in
the family that separates the points in the class H1 from those in the
class H2 .
For each pattern x ∈ H, define a vector made up of a set of
real-valued functions {ϕi (x)|i = 1, 2, . . . , m1 }, as shown by
φ(x) = [ϕ1 (x), ϕ2 (x), . . . , ϕm1 (x)]T (1)
Cover’s theorem – separability of patterns V
Suppose that the pattern x is a vector in an m0 -dimensional input

space. So, here the function φ(·) is mapping this m0 -dimensional
input space to m1 -dimensional input space.
We refer ϕi (x) as a hidden function, because it plays a role similar
to that of a hidden unit in a MLP. Correspondingly, the space
spanned by the set of hidden functions {ϕi (x)}m 1
i=1 is referred to as
the feature space.
As per (Cover, 1965) we can say a dichotomy {H1 , H2 } of H
ϕ-separable if there exists an m1 -dimensional vector w such that it
satisfies the following conditions:
wT φ(x) > 0, x ∈ H1 (2)
wT φ(x) < 0, x ∈ H2 (3)
Cover’s theorem – separability of patterns VI
So, the hyperplane, which is the separating surface in the φ−space

(new feature space), can be defined by the equation
wT φ(x) = 0 (4)
The inverse image of this surface which defines the decision boundary
in the input space is given as,
x : wT φ(x) = 0 (5)
Since the pattern classification problem has been transformed to be

nonlinear, we can not use a linear equation (e.g. wT x = 0) which will
linearly separate a set these patterns. So, it would be sufficient to
make higher order terms in the equations so that we may probably
generate a hyper-surface which can separate the patterns in the
transformed feature space.
Cover’s theorem – separability of patterns VII
Consider a natural class of mappings obtained by using a linear

combination of r -wise products of the pattern vector coordinates.
The separating surfaces corresponding to such mappings are referred
to as r th-order rational varieties.
A rational variety of order r in a space of dimension m0 is described
by an r th-degree homogeneous equation in the coordinates of the
input vector x, as shown by
X
ai1 i2 ...ir xi1 xi2 . . . xir = 0, (6)
0≤i1 ≤i2 ≤...≤ir ≤m0
where xi is the ith component of the input vector x and x0 is the set
of unity in order to express the equation in homogeneous form. An
r th-order product entries xi of x – that is xi1 xi2 . . . xir – is called a
monomial.
Cover’s theorem – separability of patterns VIII
For an input space of m0 -dimensionality, there are C (m0 , r ) number

of monomials. (e.g. if r = 1, it is a set of hyperplanes; if r = 2 –
quadrices; if r = 2 with linear constraints on the coefficients a –
hyperspheres)
Cover’s theorem – separability of patterns IX
Cover’s theorem – separability of patterns X
Note that if patterns belonging to two classes are linearly separable,

they are hyper-spherically separable. But hyper-spherical separability
does not imply linear separability(by a hyperplane).
In general, linear separability implies spherical separability, which
implies quadric separability; however, the converses are not necessarily
true. (Lower to higher possible; not vice versa)
If there are N patterns, there could be a possibility of generating
many different dichotomies. However, not all the dichotomies are
separable. Few are separable. Cover studied – what is probability of
separable dichotomies.
So, he assumed the following:
All possible dichotomies of H = {xi }N
i=1 are equi-probable.
Cover’s theorem – separability of patterns XI
He defined a probability P(N, m1 ) denotes the probability that a

particular dichotomy is picked at random is ϕ separable. (Without
proof) P(N, m1 ) can be written as
m 1 −1
1 X
P(N, m1 ) = ( )N−1 C (N − 1, m) for N > m1 − 1 (7)
2 m=0
and
P(N, m1 ) = 1 for N ≤ m1 − 1 (8)
From the above equations, one could say that if we maximize this
probability, then there high chance of getting good ϕ-separable
dichotomies.
XOR Problem I
We know that the XOR problem is not solvable by a perceptron. See the
figure 2.
x1 x2 d
0 0 0
0 1 1
1 0 1
1 1 0
Figure 2: XOR problem: the four patterns
XOR Problem II
If we consider two neurons in the hidden layer of the RBFNN with centroid
being µ1 = [1, 1] and µ2 = [0, 0].
Please note that here the transformation to a higher dimension is not
necessary. As, these two hidden layer functions ϕ1 (·) and ϕ2 (·) can make
the problem linearly separable as shown in the figure 4 as per the values
obtained at the hidden layers using the following hidden functions.
ϕ1 (x) = exp(||x − µ1 ||2 ) (9)

2
ϕ2 (x) = exp(||x − µ2 || ) (10)
XOR Problem III
Figure 3: Values of the hidden functions (using some distance function)
XOR Problem IV
Figure 4: After kernel transformation in the hidden layer
It should be noted that unlike the XOR problem, it is quite difficult to

intuit of a possible dimension that would solve the purpose.
Separating capacity of a surface I
Let a sequence of patterns be described as {xi }N i=1 . Let N be a random
variable (r.v.) defined as the largest integer such that this sequence is ϕ
separable, where ϕ has m1 degree of freedom.
Pr (N = n) = P(n, m1 ) − P(n + 1, m1 ) (11)

1
= ( )n n−1 Cm1 −1 , n = 0, 1, 2, . . . (12)
2
This distribution equals the probability that k failures precede the r th
success, in a long, repeated sequence of Bernoulli trials – true or false
outcome of an experiment.
Let p and q be probability of success and failures respectively so,
p + q = 1. The negative binomial distribution is written as
f (k; r , p) = p r q k r +k−1
Ck (13)
Separating capacity of a surface II
For a special case p = q = 0.5 (success and failures are equiprobable) and
k + r = n, so
f (k; n − k, 0.5) = 0.5n q k n−1

Ck , n = 0, 1, , 2, . . . (14)
With this definition, we now see that the result described in Eq. 12 is just
the negative binomial distribution, shifted by m1 units to the right, and
with parameters m1 and 0.5. Thus, N corresponds to the “waiting time”
for the m1 -th failure in a sequence of tosses of a fair coin.The expectation
of the random variable N and its median are, respectively,
E[N] = 2m1 (15)
and
median[N] = 2m1 (16)
The expected maximum number of randomly assigned patterns (vectors)
that are linearly separable in a space of dimensionality m1 is equal to 2m1.
The interpolation problem I
Following Cover’s theoerm on separability of patterns, we studied that

there is actually advantages while solving a nonlinear problem by
mapping the input space into a new space of high enough dimension.
Basically, a nonlinear mapping is used to transform a nonlinearly
separable classification problem into a linearly separable one with high
probability.
Similarly, nonlinear filtering could also be transformed into linear
filtering by the use of this fundamental idea.
Consider then a feedforward network with an input layer, a single
hidden layer, and an output layer consisting of a single unit. The
network is designed to perform a nonlinear mapping from the input
space to the hidden space, followed by a linear mapping from the
hidden space to the output space.
The interpolation problem II
Let mo denote the dimension of the input space. So, in this

transformation the network actually map the mo -dimensional input
space to the single-dimensional output space, written as
s: Rm0 → R1 (17)
As discussed earlier, this type of network has high chance of

transforming a noisy pattern to somewhat linear-nonlinear
transformation at the hidden layer and learn this transformed pattern.
Moreover, the following two points also could be viewed:
The training phase constitutes the optimixation of the fitting procedure
for some separating surface, based on the available data patterns which
have been presented to the network in the form of input-output
examples.
The interpolation problem III
The generalization phase is synonymous with interpolation between the
data points with the interpolation being performed along the
constrained separating surface generated by the fitting procedure
(FFNN) as the optimum approximation to the true surface among
pattern.
The theory of multivariable interpolation in high-dimensional space in
its strict sense may be stated as
Given a set of N different points {xi ∈ Rm0 |i = 1, 2, . . . , N} and a
corresponding set of N real numbers (desired outputs)
{di ∈ R1 |i = 1, 2, . . . , N}, find a function F : Rm0 → R1 that satisfies
the interpolation condition:
F (xi ) = di , i = 1, 2, . . . , N (18)
So basically, we are trying to say that all the data patterns will fall
over the separating interpolation surface F . This is different from our
approximation function which transforms the input to the output
The interpolation problem IV
given some data. The former statement will make more sense when
we visualize/think the underlying pattern behavior among the data
patterns provided to us.
For strict interpolation as specified here, the interpolating surface
(i.e., function F ) is constrained to pass through all the training data
points for the input-output transformation.
The radial-basis functions (RBF) technique consists of choosing a
function F that has the form
N
X
F (x) = wi ϕ(||x − xi ||) (19)
i=1
where ϕ(||x − xi ||)|i = 1, 2, . . . , N is a set of N arbitrary (generally

nonlinear, e.g. Gaussian) functions, known as radial-basis functions,
and || · || denotes the norm that is usually Euclidean. The known data
The interpolation problem V
points xi ∈ Rm0 , i = 1, 2, . . . , N are taken to be the centers of the

radial-basis functions.
Now, inserting the interpolation conditions of Eq.(18) into Eq.(19),
we obtain a set of simultaneous linear equations for the unknown
coefficients (weights) of the expansion {wi } given by
    
ϕ11 ϕ12 . . . ϕ1N w1 d1
 ϕ21 ϕ22 . . . ϕ2N   w2   d2 
..   ..  =  ..  (20)
    
 .. .. ..
 . . . .   .   . 
ϕN1 ϕN2 . . . ϕNN wN dN
where ϕij = ϕ(||xi − xj ||), i, j = 1, 2, . . . , N,
The interpolation problem VI
In general form we can write
Φw = x (21)
Assuming that Φ is non-singular (i.e. inverse exists), we may find the

value of the parameter set w by
w = Φ−1 x (22)
Can we be sure that the interpolation matrix Φ be non-singular?
The interpolation problem VII
Micchelli (1986) discussed this question for some of the possible

radial-basis functions and the theorem is stated as follows
Let {xi }N m0
i=1 be a set of distinct points in R . Then the N-by-N
interpolation matrix Φ, whose ij-th element is ϕij = ϕ(xi − xj ), is
non-singular.
There are a number of radial-basis functions in Miccelli’s theorem

paper; the following are a set of examples:
Multiquadrics:
ϕ(r ) = (r 2 + c 2 )0.5 for some c > 0 and r ∈ R (23)
Inverse multiquadrics:
1
ϕ(r ) = for some c > 0 and r ∈ R (24)
(r 2 + c 2 )0.5
The interpolation problem VIII
Gaussian functions:
r2

ϕ(r ) = exp − 2 for some σ > 0 and r ∈ R (25)
2σ
The important constraint behind the singularity of these equations,

rather the interpolation matrix Φ, is that all the data patterns have
to be distinct.
RBF Networks I
Figure 5: Structure of an RBF network, based on interpolation theory.
RBF Networks II
Inputs: consists of m0 source nodes, where m0 is the dimensionality

of the input vector x .
Hidden layer: consists of same number of computation units as the
size of the training sample i.e. N; each unit is mathematically
described by r.b.f.
ϕ(x ) = ϕ(||x − xj ||), j = 1, 2, . . . , N (26)
The jth input data point xj defines the center of the r.b.f., and the
vector x is the signal (pattern/input) applied to the input layer. (So,
the links from the input to the hidden layer have no weights)
Output layer: Depends on the classes or type of the problem, the
number of neurons in the output layer may vary. It should be noted
that the number of neurons in the output layer is lower than the
number of neurons in the hidden layer.
RBF Networks III
Let focus on the Gaussian r.b.f. so
−||x − xj ||2
!
ϕ(x ) = ϕ(||x − xj ||) = exp , j = 1, 2, . . . , N (27)
2σj2
The parameter σj is the width of the jth Gaussian function with center
xj .[Typically, but not always, the widths of the Gaussians are same.] – The
only difference among different hidden units in this case would be the
center xj given the widths are same.
What are we doing here? I
The figure given above follows the interpolation theory which we had
just discussed; and it is quite okay to assume that it is alright for any
data patterns. However, here we are trying to make an assumption
that the data patterns are not noisy at all (remember, it could be true
also). But, why take a chance, when we are not sure of anything
about the data patterns?
If the data patterns are noisy, then we may finally our obtained results
misleading. Can we think of a different approach? Moreover, noise
could mean either deviating from following a pattern as the other data
patterns, or inherent redundancy in terms of some kind of correlation.
Whatever improvements that could be done with the RBF network is
the modification with the hidden layer only. Because the other two
layers (input, and output) are fixed before.
With hidden layer the major improvement that could be done is the
change in the architecture – i.e. the number of neuron.
What are we doing here? II
Can we fix this number of neurons in the hidden layer to size of

input? – Well, we can. But, actually we will be unnecessarily
transforming the data instances to the same dimensional feature
space – this does not hold quite well as per Cover’s theorem.
Can we fix the number of hidden neurons to number of training
samples (N)? – well, we can. But, again we are ignoring the inherent
noise associated with the available data patterns in terms of
correlation. Think of this case, consider there are 10000 data patterns
sampled at random. Can you guarantee that all these data patterns
are unique in some way? I guess, we can not. There could be some
data patterns (say 56 samples) which falls into some category (that
means there is some kind of correlation involved within in these 56
data patterns). Similarly, there could be many more such groups
inside the available number of input data patterns.
What are we doing here? III
Because of the reasons explained in the above point, it is always a

good idea to set the number of neurons in the hidden layer to K
where K < N. (see the following figure)
Figure 6: Practical RBF Network with number of hidden neurons smaller than N
What are we doing here? IV
How do we find out the centroid of the K hidden neurons?

After finding out the centroids, then what to do? – Use the centroids
for computing the hidden functions in the hidden layer. So we get the
values at the hidden neurons to compute the final output y using the
synaptic weights.
Use some optimization algorithm such as least-square-error method
for obtaining the final set of optimal synaptic weights.
Centroids of Hidden Neurons
There are various ways to find similarity among data points or

patterns (or, say similary K subgroups). The best way to use a
clustering algorithm without considering the output into account.
The clustering could be done by employing the simple K −means
algorithm.
K -means clustering I
Inputs:
– K (number of clusters) – there are various ways to choose a wise K
which we won’t discuss
– The set of unlabeled examples {x}Ni=1 (note that the training set
does not contain d); each x ∈ R m0
Algorithm:
Randomly initialize K cluster centroids µ1 , µ2 , . . ., µK ∈ Rm0
Repeat{
for i = 1 to N /*cluster assignment*/
clustIdi =index (from 1 to K ) of cluster centroid closest to xi (see
eq. (28))
for k = 1 to K /*update the centroids*/
µk =average(mean) of points assigned to cluster k (see eq. (29))
}
K -means clustering II
clustIdi = argmin(||xi − µk ||2 ) (28)

k
Pc
j=1 xj
µk = ; (29)
c
where c is the number of patterns closest to the cluster-k.
After finding out K , then what to do?
After finding out K we need to find out the optimal weight set
(synaptic weights) from hidden layer to the output layer.
Well, how to get these optimal weights? – You can employ least
square estimation approach.

Radial Basis Function Neural Network

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Radial Basis Function Neural Network

Uploaded by

Copyright:

Available Formats

Radial Basis Functions Neural Network

Course: BITS F312

Till now we have discussed that the BackProp could be used to

Here, we solve the problem of classifying nonlinearly separable

The nonlinear transformation from the input space to the hidden

Figure 1: hypersurface in 2D (source: http://web.stanford.edu/)

Consider a family of (hyper-)surfaces where each surface naturally

φ(x) = [ϕ1 (x), ϕ2 (x), . . . , ϕm1 (x)]T (1)

Suppose that the pattern x is a vector in an m0 -dimensional input

wT φ(x) > 0, x ∈ H1 (2)

wT φ(x) < 0, x ∈ H2 (3)

So, the hyperplane, which is the separating surface in the φ−space

Since the pattern classification problem has been transformed to be

Consider a natural class of mappings obtained by using a linear

For an input space of m0 -dimensionality, there are C (m0 , r ) number

Note that if patterns belonging to two classes are linearly separable,

He defined a probability P(N, m1 ) denotes the probability that a

Figure 2: XOR problem: the four patterns

ϕ1 (x) = exp(||x − µ1 ||2 ) (9)

Figure 3: Values of the hidden functions (using some distance function)

Figure 4: After kernel transformation in the hidden layer

It should be noted that unlike the XOR problem, it is quite difficult to

Pr (N = n) = P(n, m1 ) − P(n + 1, m1 ) (11)

f (k; n − k, 0.5) = 0.5n q k n−1

E[N] = 2m1 (15)

Following Cover’s theoerm on separability of patterns, we studied that

Let mo denote the dimension of the input space. So, in this

As discussed earlier, this type of network has high chance of

where ϕ(||x − xi ||)|i = 1, 2, . . . , N is a set of N arbitrary (generally

points xi ∈ Rm0 , i = 1, 2, . . . , N are taken to be the centers of the

where ϕij = ϕ(||xi − xj ||), i, j = 1, 2, . . . , N,

In general form we can write

Assuming that Φ is non-singular (i.e. inverse exists), we may find the

Can we be sure that the interpolation matrix Φ be non-singular?

Micchelli (1986) discussed this question for some of the possible

There are a number of radial-basis functions in Miccelli’s theorem

ϕ(r ) = (r 2 + c 2 )0.5 for some c > 0 and r ∈ R (23)

The important constraint behind the singularity of these equations,

Figure 5: Structure of an RBF network, based on interpolation theory.

Inputs: consists of m0 source nodes, where m0 is the dimensionality

ϕ(x ) = ϕ(||x − xj ||), j = 1, 2, . . . , N (26)

Let focus on the Gaussian r.b.f. so

Can we fix this number of neurons in the hidden layer to size of

Because of the reasons explained in the above point, it is always a

How do we find out the centroid of the K hidden neurons?

There are various ways to find similarity among data points or

clustIdi = argmin(||xi − µk ||2 ) (28)

You might also like