Professional Documents
Culture Documents
Tirtharaj Dash
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 1 / 41
Till now I
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 2 / 41
Till now II
MLP can have more than one hidden layer which makes it capable to
solve nonlinearly separable problems.
Although there are advantages, there are few limitations of the MLP.
One such problem could be: MLP can also learn outliers or noises in
the training pattern – which will probably lowers its robustness to
noisy patterns.
Noisy patterns are those patterns which do not follow any underlying
behavior as the other samples in dataset; e.g. a set of samples may
follow a particular distribution whereas few noisy samples keep
themselves apart from that.
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 3 / 41
Radial Basis Functions Networks I
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 5 / 41
Cover’s theorem – separability of patterns I
The basic idea behind the RBF networks is that when a RBF network
is used to perform a complex pattern classification task, the problem
is basically solved by first transforming it into a high dimensional
space in a nonlinear manner and then separating the classes in the
output layer.
As previously mentioned, the underlying justification is found in
Cover’s theorem on the separability of patterns, which, in qualitative
terms, may be stated as follows (Cover, 1965):
A complex pattern-classification problem, cast in a high-dimensional
space nonlinearly, is more likely to be linearly separable than in a
low-dimensional space, provided that the space is not densely
populated.
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 6 / 41
Cover’s theorem – separability of patterns II
For clarity:– Given a set of training data that is not linearly separable,
one can with high probability transform it into a training set that is
linearly separable by projecting it into a higher-dimensional space via
some non-linear transformation.
Why should it work? – It should work because we know that a linearly
separable classification problem is easier to solve than a nonlinear
classification problem.
For separating a set of samples or patterns in the high dimensional
space, we need hyper-surfaces(s).
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 7 / 41
Cover’s theorem – separability of patterns III
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 8 / 41
Cover’s theorem – separability of patterns IV
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 9 / 41
Cover’s theorem – separability of patterns V
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 10 / 41
Cover’s theorem – separability of patterns VI
wT φ(x) = 0 (4)
The inverse image of this surface which defines the decision boundary
in the input space is given as,
x : wT φ(x) = 0 (5)
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 11 / 41
Cover’s theorem – separability of patterns VII
where xi is the ith component of the input vector x and x0 is the set
of unity in order to express the equation in homogeneous form. An
r th-order product entries xi of x – that is xi1 xi2 . . . xir – is called a
monomial.
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 12 / 41
Cover’s theorem – separability of patterns VIII
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 13 / 41
Cover’s theorem – separability of patterns IX
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 14 / 41
Cover’s theorem – separability of patterns X
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 15 / 41
Cover’s theorem – separability of patterns XI
and
P(N, m1 ) = 1 for N ≤ m1 − 1 (8)
From the above equations, one could say that if we maximize this
probability, then there high chance of getting good ϕ-separable
dichotomies.
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 16 / 41
XOR Problem I
We know that the XOR problem is not solvable by a perceptron. See the
figure 2.
x1 x2 d
0 0 0
0 1 1
1 0 1
1 1 0
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 17 / 41
XOR Problem II
If we consider two neurons in the hidden layer of the RBFNN with centroid
being µ1 = [1, 1] and µ2 = [0, 0].
Please note that here the transformation to a higher dimension is not
necessary. As, these two hidden layer functions ϕ1 (·) and ϕ2 (·) can make
the problem linearly separable as shown in the figure 4 as per the values
obtained at the hidden layers using the following hidden functions.
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 18 / 41
XOR Problem III
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 19 / 41
XOR Problem IV
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 20 / 41
Separating capacity of a surface I
Let a sequence of patterns be described as {xi }N i=1 . Let N be a random
variable (r.v.) defined as the largest integer such that this sequence is ϕ
separable, where ϕ has m1 degree of freedom.
f (k; r , p) = p r q k r +k−1
Ck (13)
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 21 / 41
Separating capacity of a surface II
For a special case p = q = 0.5 (success and failures are equiprobable) and
k + r = n, so
With this definition, we now see that the result described in Eq. 12 is just
the negative binomial distribution, shifted by m1 units to the right, and
with parameters m1 and 0.5. Thus, N corresponds to the “waiting time”
for the m1 -th failure in a sequence of tosses of a fair coin.The expectation
of the random variable N and its median are, respectively,
and
median[N] = 2m1 (16)
The expected maximum number of randomly assigned patterns (vectors)
that are linearly separable in a space of dimensionality m1 is equal to 2m1.
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 22 / 41
The interpolation problem I
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 23 / 41
The interpolation problem II
s: Rm0 → R1 (17)
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 24 / 41
The interpolation problem III
The generalization phase is synonymous with interpolation between the
data points with the interpolation being performed along the
constrained separating surface generated by the fitting procedure
(FFNN) as the optimum approximation to the true surface among
pattern.
The theory of multivariable interpolation in high-dimensional space in
its strict sense may be stated as
Given a set of N different points {xi ∈ Rm0 |i = 1, 2, . . . , N} and a
corresponding set of N real numbers (desired outputs)
{di ∈ R1 |i = 1, 2, . . . , N}, find a function F : Rm0 → R1 that satisfies
the interpolation condition:
F (xi ) = di , i = 1, 2, . . . , N (18)
So basically, we are trying to say that all the data patterns will fall
over the separating interpolation surface F . This is different from our
approximation function which transforms the input to the output
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 25 / 41
The interpolation problem IV
given some data. The former statement will make more sense when
we visualize/think the underlying pattern behavior among the data
patterns provided to us.
For strict interpolation as specified here, the interpolating surface
(i.e., function F ) is constrained to pass through all the training data
points for the input-output transformation.
The radial-basis functions (RBF) technique consists of choosing a
function F that has the form
N
X
F (x) = wi ϕ(||x − xi ||) (19)
i=1
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 26 / 41
The interpolation problem V
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 27 / 41
The interpolation problem VI
Φw = x (21)
w = Φ−1 x (22)
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 28 / 41
The interpolation problem VII
Let {xi }N m0
i=1 be a set of distinct points in R . Then the N-by-N
interpolation matrix Φ, whose ij-th element is ϕij = ϕ(xi − xj ), is
non-singular.
Inverse multiquadrics:
1
ϕ(r ) = for some c > 0 and r ∈ R (24)
(r 2 + c 2 )0.5
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 29 / 41
The interpolation problem VIII
Gaussian functions:
r2
ϕ(r ) = exp − 2 for some σ > 0 and r ∈ R (25)
2σ
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 30 / 41
RBF Networks I
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 31 / 41
RBF Networks II
The jth input data point xj defines the center of the r.b.f., and the
vector x is the signal (pattern/input) applied to the input layer. (So,
the links from the input to the hidden layer have no weights)
Output layer: Depends on the classes or type of the problem, the
number of neurons in the output layer may vary. It should be noted
that the number of neurons in the output layer is lower than the
number of neurons in the hidden layer.
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 32 / 41
RBF Networks III
−||x − xj ||2
!
ϕ(x ) = ϕ(||x − xj ||) = exp , j = 1, 2, . . . , N (27)
2σj2
The parameter σj is the width of the jth Gaussian function with center
xj .[Typically, but not always, the widths of the Gaussians are same.] – The
only difference among different hidden units in this case would be the
center xj given the widths are same.
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 33 / 41
What are we doing here? I
The figure given above follows the interpolation theory which we had
just discussed; and it is quite okay to assume that it is alright for any
data patterns. However, here we are trying to make an assumption
that the data patterns are not noisy at all (remember, it could be true
also). But, why take a chance, when we are not sure of anything
about the data patterns?
If the data patterns are noisy, then we may finally our obtained results
misleading. Can we think of a different approach? Moreover, noise
could mean either deviating from following a pattern as the other data
patterns, or inherent redundancy in terms of some kind of correlation.
Whatever improvements that could be done with the RBF network is
the modification with the hidden layer only. Because the other two
layers (input, and output) are fixed before.
With hidden layer the major improvement that could be done is the
change in the architecture – i.e. the number of neuron.
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 34 / 41
What are we doing here? II
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 35 / 41
What are we doing here? III
Figure 6: Practical RBF Network with number of hidden neurons smaller than N
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 36 / 41
What are we doing here? IV
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 37 / 41
Centroids of Hidden Neurons
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 38 / 41
K -means clustering I
Inputs:
– K (number of clusters) – there are various ways to choose a wise K
which we won’t discuss
– The set of unlabeled examples {x}Ni=1 (note that the training set
does not contain d); each x ∈ R m0
Algorithm:
Randomly initialize K cluster centroids µ1 , µ2 , . . ., µK ∈ Rm0
Repeat{
for i = 1 to N /*cluster assignment*/
clustIdi =index (from 1 to K ) of cluster centroid closest to xi (see
eq. (28))
for k = 1 to K /*update the centroids*/
µk =average(mean) of points assigned to cluster k (see eq. (29))
}
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 39 / 41
K -means clustering II
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 40 / 41
After finding out K , then what to do?
After finding out K we need to find out the optimal weight set
(synaptic weights) from hidden layer to the output layer.
Well, how to get these optimal weights? – You can employ least
square estimation approach.
Tirtharaj Dash Radial Basis Functions Neural Network Course: BITS F312 41 / 41