You are on page 1of 38

74

CHAPTER 4
GABOR AND NEURAL BASED FACE RECOGNITION

4.1

GABOR BASED FACE RECOGNITION


In the previous chapter, a brief description about the theory and

experimental results of PCA and FLDA techniques for Face recognition


were presented (Duc et al 1999). In this chapter the theory and
experimentation related to Gabor and Neural network based Face Recognition
are discussed.
Although PCA based eigenfaces method works well for a
reconstruction, it is uncertain that it does the best projection for recognition.
Pose, scale and illumination variations are the main problems reported for
Eigen faces. FLDA algorithm, though suitable for illumination variations, is
not suitable for large databases. To overcome some of these problems,
wavelets are included in Face Recognition.
A wavelet is a waveform of effectively limited duration but has an
average value of zero. Wavelet Transform is a process that decomposes a
function into components at different scales of frequencies and locations. By
decomposing an image using Wavelet Transform, the resolution of the sub
images and the computational complexity are reduced.

75

4.1.1

Introduction
Generally, a wavelet can be viewed as a continuous wave

propagating in different directions ( ) and modulated by a Gaussian Envelope


with different frequencies (f). Gabor filters are most popular for automatic
face recognition system as they are motivated by their computational
properties and biological relevance. The Gabor filters represent a powerful
tool in image processing, as spatial localization, spatial frequency and
orientation selectivity are the main properties of it. The main characteristics of
Gabor wavelets are the possibility to provide a multi resolution analysis of the
image in the form of coefficient matrices. Gabor wavelets are applied in the
fiducially points on faces in order to take into account more features for best
recognition (Baochang Zhang et al 2009). An image can be represented by the
Gabor wavelet transform (Ranganath and Arun 1997) allowing the description
of both the spatial frequency structure and spatial relations.
Gabor wavelets seem to be the optimal basis to extract local
features for pattern recognition for several reasons:
1) Biological motivation: the shapes of Gabor wavelets are
similar to the receptive fields of simple cells in the primary
visual cortex.
2)

Mathematical motivation: the Gabor wavelets are optimal for


measuring local spatial frequencies.

3)

Empirical motivation: Gabor wavelets have been found to


yield distortion tolerant feature spaces for a number of pattern
recognition tasks, including texture segmentation, character
recognition, and finger print recognition.

76

4.1.2

Theory and Design of Gabor Wavelets


A Gabor Wavelet is a linear filter used in image processing. The

Gabor wavelets are self similar (i.e.), some filters can be generated from one
mother wavelet by dilation and rotation. 2D Gabor (Hossein Sahoolizadeh
et al 2008) functions are similar to enhancing edge contours, as well as
valleys and ridge contours of the image. This corresponds to enhancing
eye, mouth, nose edges which are supposed to be the main important points
on a face. Moreover, such an approach also enhances moles, dimples, scars,
etc. Hence, by using such enhanced points as feature locations, a feature
map for each facial image can be obtained and each face can be
represented with its own characteristics without any initial constraints.
Having feature maps specialized for each face makes it possible to keep
overall face information while enhancing local characteristics (Zhang B et al
2007).
Gabor wavelets are used to extract facial appearance changes as a
set of multistage and multi orientation coefficients. It is shown to be robust
against noise and changes in illumination for all facial patterns. The common
approach when using Gabor filters (Chengjun Liu 2002) for face recognition
is to construct a filter bank with filters of different scales and orientations and
to filter the given face image. A well-designed Gabor filter bank can capture
the relevant frequency spectrum in all directions.
This method is based on selecting peaks (high energized points) of
the Gabor wavelet responses, as feature points. Detected feature points
together with locations are stored as feature vectors. The feature vector
consists of all useful information extracted from different frequencies,
orientations and from all locations and is hence very useful for expression
recognition. Feature vectors are generated by sampling wavelet responses of
the facial images at the specific nodes.

77

Initially, the image of the face must be converted to wavelets


known as Jets. These jets are localized wavelets and are focused on a specific
area of an image. In this way, individual jets can be created for the eyes, nose,
mouth, and other facial features. The variation of

changes the sensitivity to

edge and texture orientations. The variation of

will change the scale at

which, the image is viewed. Here, the most adequate combinations of ,


and f are considered to represent particular features of face for recognition
task.
Each of these Gabor filters is convolute with the input image,
resulting in forty filtered copies of the face image. To encompass all the
features produced by the different Gabor kernels, the resulting Gabor wavelet
features are concatenated to derive an augmented Gabor feature vector. Then,
in order to reduce the dimensionality of the feature vector, both the PCA and
FLDA are implemented.
Gabor Wavelet (GW) filter works as a band pass filter for the local
spatial frequency distribution, achieving an optimal resolution in both spatial
and frequency domains. The 2D Gabor filter

f,

( x, y ) can be represented as

a complex sinusoidal signal modulated by a Gaussian kernel function as


follows:

f,

where

( x, y )

sin

exp

con

cos

sin

1
2

n
n

x 2n

y 2n

2
x

2
y

exp( 2 fx n )

(4.1)

x
y

are the standard deviations of the Gaussian envelope along the x and y

dimensions, f is the central frequency of the sinusoidal plane wave and

78

the orientation. The rotation of the x-y plane by an angle


Gabor filter at the orientation

n 1

n.

will result in a

The angle n is defined by:

For n=1, 2, ,p and p

N,

(4.2)

where p denotes the number of orientations. Design of Gabor filters is


accomplished by tuning the filter with a specific band of spatial frequency and
orientation by appropriately selecting the filter parameters like the spread of
the filter ( x,

y),

radial frequency f and the orientation of the filter

n.

The

important issue in the design of Gabor filters for face recognition is the choice
of filter parameters. The Gabor representation of a face image is computed by
convolving the face image with the Gabor filters. Here, f(x, y) be the intensity
at the coordinate (x, y) in a gray scale face image, its convolution with a
Gabor filter

f,

( x, y ) is defined as:

g f , ( x, y )
where

f ( x, y )

f,

( x, y )

(4.3)

denotes the convolution operator.


The response to each Gabor kernel filter representation is a

complex function with a real part R {g

f,

(x, y)} and an imaginary part

J {g f, (x, y)}. The magnitude response g f , ( x, y ) is expressed by

g f , ( x, y )

R 2 g f , ( x, y )

J 2 g f , ( x, y )

(4.4)

This work uses the magnitude response g f , ( x, y ) to represent the


features. To reduce the influence of the lighting conditions, the output of
Gabor filter about each direction has been normalized. This work
organizes 40 Gabor channels consisting of eight orientation parameters

79

0,

2
5

3
5

4
5

5
5

6
5

7
5

and five spatial frequencies.

The Gabor wavelets are scale invariant and the statistics of the image must
remain constant as one magnifies any local region of the image.
Figure 4.1 illustrates the convolution result of a face image with a
Gabor filter. Here, a 2D Gabor filter is expressed as a Gaussian modulated
sinusoid in the spatial domain and as shifted Gaussian in the frequency
domain.

Figure 4.1 Network architecture of Gabor based filter

4.1.3

Gabor Wavelet Representation of Faces


Feature extraction algorithm for this proposed GW based face

recognition has two main steps as Feature point localization and Feature
vector computation (Lee 1996). In this step, feature vectors are extracted
from points with high information content on the face image. In most
feature-based methods, facial features are assumed to be the eyes, nose and
mouth (Yousra Ben Jemaa and Sana Khanfir 2009). The number of feature
vectors and their locations can vary in order to better represent diverse

80

facial characteristics of different faces, such as pimples, moles, etc.,


which are also the features that people might use for recognizing faces.
In this work, a face image is convolved with Gabor filter of five
spatial frequencies and eight orientations,

so that it can capture the whole

frequency spectrum, both amplitude and phase as shown in Figure 4.2.

Figure 4.2 Flowchart of the feature extraction stage of the facial images
From the responses of the face image to Gabor filters, peaks are
found by searching the locations in a window W0 of size (w*w) by the
following procedure:
A feature point is located at (x0, y0), if
R j x0 , y0

max R j x, y
x , y W0

(4.5)

81

R j x0 , y0

1
N1 N 2

N1 N 2

R j x, y

(4.6)

x 1 y 1

j = 1 40, where Rj is the response of the face image to the jth Gabor filter.
N1 and N2 are the sizes of the face image and the center of the window W0 is at
(x0, y0). Window size W must be chosen small enough to capture the
important features and large enough to avoid redundancy.
4.1.4

Feature Vector Generation


Feature vectors are generated at the feature points as a

composition of Gabor wavelet transform coefficients. Here kth feature vector


of ith reference face is defined as,
vi ,k

xk , yk , Ri , j xk , yk

(4.7)

The first two components in equation 4.6 represents the location


of the feature point by storing (x, y) coordinates. After feature vectors are
constructed from the test image, they are compared to the feature
vectors of each reference image in the database.
4.1.5

Algorithm for Gabor Wavelets


STEP 1: Get the image as the parameter by the function im2 vec.
STEP 2: Load Gabor filters.
STEP 3: Adjust the window histogram.
STEP 4: Find features matrix.
STEP 5: Change the matrix to a vector.

82

Generally, it is difficult to deal with a high dimensional image


space. So this GW method is used to reduce the space dimension by down
sampling each G (u, v) and concatenating its rows to form a 1D feature vector
is proposed and used extensively. This algorithm is tested for the task of
identification using neural network classifier. This is explained as in
Figure 4.3.
Input query

im2vec (image)

Load Gabor filters

Adjust the window histogram

Find matrix features

Find image vector

Call Neural Network


program

Figure 4.3 Flow chart of Gabor based face recognition


To reduce the dimensionality of the vector space and obtain more
useful features for subsequent pattern discrimination and associative recall,
the FLDA technique is used here. The results clearly show that Gabor filters
improve the performance of raw image data with the operating points
corresponding to high false acceptance ratio (FAR).

83

4.1.6

Experimentation and Results of Gabor Wavelets


The input of the function is a 50 * 50 window which is the resized

size of the test image whose actual size is 320 * 243. At first the function
adjusts the histogram of the window. Then to convolve the window with
Gabor filters, the window in frequency domain will be multiplied by the
Gabor filters. Gabor filters are loaded and then the window histogram is
adjusted so that the parameters are set with trial and error. The numbers in
the input vector of the neural network should be between -1 and 1. For this
the feature matrix of size 45 * 48 is formed. Thus the matrix of the image is
converted into an image vector of size 2160 * 1 by reshaping.
The input query image is shown in Figure 4.4 and is resized into matrix
of size 50 * 50.

Figure 4.4 Query image applied to Gabor wavelets


For this image matrix, the image vector which is the output of
Gabor filter is given by
0 .9 5 6 1
0 .9 0 2 8
0 .8 5 7 5
G a b o r V e c to r =
0 .4 2 5 2
0 .5 3 7 9
0 .7 2 7 2

84

Gabor wavelets technique has recently been used not only for face
recognition, but also for face tracking and face position estimation. Thus this
approach not only reduces computational complexity, but also improves the
performance in the presence of occlusions. For a given input image the Gabor
filters are formed as shown in Figure 4.11(c).
4.1.7

Summary
Gabor wavelet provides the optimized resolution in both time and

frequency domains for time frequency analysis. It saves neighborhood


relationship between pixels, performs better than the traditional approaches in
terms of recognition rate and accuracy. It is also easy to update, invariant to
homogeneous illumination changes, rotational and scale.
Despite the success of Gabor wavelet based face recognition
systems, both the feature extraction process and the huge dimension of Gabor
features extracted demand large computation and memory costs, which makes
them impractical for real applications. Also it is affected by the complex
background. Another limitation in the case of Gabor wavelets is that the time
for Gabor feature extraction is very long and its dimension is prohibitively
large.
4.2

NEURAL NETWORKS (NN)


A Neural Network is a powerful data modeling tool that is able to

capture complex input/output relationships. The neural network technology


stemmed to develop an artificial system. The network is composed of a large
number of highly interconnected processing elements, called neurons,
working in parallel to solve a specific problem.

85

An artificial neural network is a computing system that consists of a


collection of artificial neurons connected with each other. An artificial neuron
simulates performance of a biological neuron. The essence of this algorithm is
that various patterns are forwarded to the inputs of a simple neuron. The
neural element transforms input signals into the output signal, the latter is
compared with the expected results and if the real output does not coincide
with the expected one, the algorithm is being corrected. The samples are
forwarded to the outputs one by one until the result is satisfactory.
4.2.1

Introduction
A boosting learning process is used to reduce the feature

dimensions and make the Gabor feature extraction process substantially more
efficient (Daugman 1988). Combining optimized Gabor features with Neural
Networks (Rowley et al 1996) reduces computation and memory cost of the
feature extraction process, but also achieves very accurate recognition
performance. Actually, training process in a neural network does not consist
of a single call to a training function. Instead, the network was trained several
times on various noisy images (Hutchinson and Welsh 1989).
In the previous chapters PCA and LDA based face reconstruction
and discrimination are done in an effective way. But the classification of face
and non-face are not carried out and the images are reconstructed for a rose as
shown in Figure 4.5. In this work, neural networks (Agui et al 1992)
effectively classify a face and non-face by BPNN algorithm.

86

Figure 4.5 Image reconstructions by PCA

4.2.2

Theory of NN for Face Recognition


Neural networks are particularly effective for predicting events

when the networks have a large database of prior stored data base. Neural
networks can be used to extract patterns and detect trends that are too
complex to be noticed by either humans or other computer techniques. Neural
networks exhibit the ability (Hutchinson and Welsh 1989) of adaptive
learning, which is the ability to learn how to do tasks based on the data given
for training or initial experience.
To reduce complexity, neural network (Jahan Zeb et al 2007) is
often applied to the face recognition phase rather than to the feature extraction
phase. The network is initialized with random weights at first, and the data is
then fed into the network. As each data is tested, the result is checked. The
square of the difference between the expected and actual result is calculated,
and this data is used to adjust the weights of each connection accordingly. The
accuracy of neural networks is mostly a function of the size of their training

87

set rather than their complexity. The procedure for face recognition using
neural network is shown in Figure 4.6.

Figure 4.6 Face recognition using neural networks


The gradient descent with momentum and adaptive learning with
Back Propagation Neural Network (BPNN) learning algorithm has been used
to implement the supervised learning in such a way that both the inputs and
corresponding outputs are provided at the time of training the network. Thus
an inherent clustering and optimized learning of weights provide efficient and
better results.
4.2.3

Back-Propagation Neural Network


Neural Network is a good tool for classification purposes. It can

approximate almost any regularity between its input and output. The delta rule
is often utilized by the most common class of ANNs called back propagation
neural networks.The NN weights are adjusted by supervised training
procedure called back propagation. Back propagation performs a gradient
descent within the solution's vector space towards a global minimum. The

88

flow chart for BPNN Algorithm to identify whether the given image is face or
non-face is as shown in Figure 4.7.
Load new Neural Network

Train the Neural Network

Find the index for the input


vector

If result>0.1

No

The given
sample is non
face

Yes
F=0
The given
sample is a face
F=1

If F=1

Call PCA program

Figure 4.7 Flow chart for neural network based face recognition
Back propagation is a kind of the gradient descent method, which
searches an acceptable local minimum in the NN weight space in order to
achieve minimal error. In principle, NNs can compute any computable
function, i.e., they can do everything a normal digital computer can do
(Kurita et al 2003). Almost any mapping between vector spaces can be

89

approximated to arbitrary precision by feed forward neural networks


(Lin Shang-Hung et al 1997).
4.2.4

BPNN Algorithm
STEP 1: Load the new neural network using the mat lab function.
STEP 2: Call the special function sim by sending the new neural
network and the image vector as parameters.
STEP 3: Train the neural network.
STEP 4: And obtain the return variable in result.
STEP 5: If result is greater than 0.1, then print the given image as
a face.
STEP 6: And make F=1.
STEP 7: Else print the given image as a non-face.
STEP 8: Also make F=0
STEP 9: If F=1 call PCA Program.
The BPNN algorithm involves two phases, during the first phase,

the input vector is presented and propagated forward through the network to
compute the output values ok for each output unit. This output is compared
with its desired value, resulting in an error signal

for each output unit. The

second phase involves a backward pass through the network during which the
error signal is passed to each unit in the network and appropriate weight
changes are calculated.
Learning process in back propagation requires providing pairs of
input and target vectors. The output vector o of each input vector is
compared with target vector t. In case of difference of these two, the weights
are adjusted to minimize the difference. Initially, random weights and

90

thresholds are assigned to the network. These weights are updated every
iteration in order to minimize the cost function or the mean square error
between the output vector and the target vector. The BPNN algorithm applied
in face recognition is shown in Figure 4.8.

Figure 4.8 Back propagation neural networks algorithm

Input for hidden layer is given by


n

net m

x z w mz

(4.8)

z 1

The units of output vector of hidden layer after passing through the
activation function are given by

hm

1
1 exp

net m

(4.9)

91

In same manner, input for output layer is given by


m

net k

(4.10)

h z w kz
z 1

and the units of output vector of output layer are given by

ok

1
1

exp

net k

(4.11)

For updating the weights, we need to calculate the error. This can
be done by

1
2

oi

ti

(4.12)

i l

If the error is minimum than a predefined limit, training process


will stop; otherwise weights need to be updated.
Each hidden unit sums its delta inputs from the above and
multiplied by the derivative of its activation function; it also computes its own
weight correction term and its bias correction term. Each output unit updates
its weights and bias. Each training cycle is called an epoch and the weights
are updated in each cycle. It is not analytically possible to determine where
the global minimum is. Eventually the algorithm stops in a low point, which
may just be a local minimum.
For weights between hidden layer and output layer, the change in
weights is given by

wij

i j

(4.13)

92

where

is a learning rate coefficient that is restricted to the range [0.01, 1.0].

The learning coefficient


the gradient. If

controls the size of a step against the direction of

is too small, learning is slow; if too large, the process of the

error minimization can be oscillatory. Here, hj is the output of neuron j in the


hidden layer and
i

can be obtained by

ti oi oi l oi

(4.14)

oi and ti represents the real output and target output at neuron i in the output
layer respectively. Similarly, the change of the weights between hidden layer
and output layer is given by
wij

where

Hi

(4.15)

xj

is a training rate coefficient that is restricted to the range [0.01,1.0],

xj is the output of neuron j in the input layer. A hidden unit h receives a delta
from each output unit o equal to the delta of that output unit weighted with the
weight of the connection between those units.

Hi can

be obtained by

k
Hi

xi l

xi

(4.16)

wij

j 1

xi is the output at neuron i in the input layer, and summation term represents
the weighted sum of all

values corresponding to neurons in output layer .

After calculating the weight change in all layers, the weights can simply
updated by
wij new

wij old

wij

(4.17)

Updating hidden units process is repeated for each instance in the


training set until the error for the entire system is acceptably low, or the pre-

93

defined number of iterations is reached. Given image is identified as a face or


Non face, according to the value of error E as per equation (4.12).
To effectively increase the learning rate is to modify the delta rule
by including a momentum term.
w (N+1) = m w (N)
where m is a positive constant, 0

E (w (N))

(4.18)

m < 0.9, termed the momentum constant

and this is called the generalized delta rule. The effect is that if the basic delta
rule is consistently pushing a weight in the same direction, then it gradually
gathers "momentum" in that direction. If momentum term is included, it will
have the effects of smoothening the weight changes, amplifies the learning
rate causing a faster convergence enabling to escape from small local minima
on the error surface.
The feature representation vectors from PCA and LDA are then
used to train the weighting factors in the combined neural networks. One of
the algorithms developed for non-linear optimization problems, following the
ideas of steepest descent are also called as gradient descent. BPNN algorithm
is made in the space of variables in the direction opposite to the direction of
the gradient of the minimized function.
A large number of neurons in the hidden layer can give high
generalization error due to over fitting and high variance. On the other hand,
by having less neurons, high training error and high generalization error is
obtained due to under fitting and high statistical bias. 'Over fitting' is the
phenomenon that in most cases a network gets worse instead of better after a
certain point during training when it is trained to as low errors as possible.

94

4.2.5

Experimental Results of Gabor Based BPNN


In this work, BPNN algorithm gives an amazing capacity to

actually learn from input data. Various parameters assumed for this network
are as follows:
No. of Input unit

= 1 feature vector

No. of hidden neurons

= 70

No. of output unit

= 1

Learning rate

= 0.4

No. of epochs

= 400

Optimum value of goal

= 0.01

Momentum

= 0.9

The output of the Gabor wavelet is an image vector of size 2160 * 1.


A new neural network is loaded using the Matlab special function load
newnet. This neural network and the input image vector are sent as
parameters to the function sim and the index of this input image vector is
found. If this index is positive, then the given image will be declared as a face.
If the index value is negative or zero, then the given image will be declared as
a non face. Then if it is a face, FLDA program will be called.
Face and non face images are given as in Figures 4.9 and 4.10, the
results are as follows:
TrainDatabasePath =C:\Desktop\proj(2010) 22.4\PCA_pgm\TrainDatabase1
TestDatabasePath =C:\Desktop\proj(2010) 22.4\PCA_pgm\non-face

95

Figure 4.9 A non-face query image


Result = -0.8227, the given Sample image is a non-face.
TrainDatabasePath =C:\Desktop\Proj(2010)22.4\PCA_pgm\TrainDatabase1
TestDatabasePath = C:\Desktop\proj(2010)22.4\PCA_pgm\TestDatabase1

Figure 4.10 A Face query


Result = 1.2707, The given Sample image is a face and is 33.jpeg

Experiments are carried over on the face images of different images


of Yale database by BPNN and its results are presented as in a, b, c, d and e of
Figure 4.11.

96

(a)

(b)

(c)
Figure 4.11 (Continued)

97

(d)

(e)
Figure 4.11 Images of Gabor based neural network
4.2.6

Advantages,

Disadvantages

and Applications

of

Neural

Networks
High accuracy, more than 90 % recognition rate, easy to implement
and reduced execution time are the main advantages of Neural Network based
Face Recognition. Neural Networks are more flexible for solving non-linear
tasks. As gradient based method is applied, some inherent problems like slow
convergence and escaping from local minima are encountered here.
In practice, NNs are especially useful for classification and
approximation problems when rules such as those that might be used in an
expert system cannot easily be applied. NNs are, at least today, difficult to

98

apply successfully to problems that concern manipulation of symbols and


memory.
Comparisons of PCA, FLDA and NN based Face Recognition on
different databases for 400 images are presented as in Table 4.1 and
Figure 4.12.
Table 4.1 Comparison of recognition rate for PCA, FLDA and BPNN
algorithm

Algorithm

Figure 4.12

No of Images

PCA

FLDA

BPNN

50

89

92

94

100

86

88

90

200

83

86

88

300

80

83

86

400

75

79

82

Comparison of recognition rate for PCA, FLDA and BPNN


algorithm

99
4.3

CASCADE CORRELATION NEURAL NETWORKS (CCNN)

The Cascade-Correlation learning network algorithm was


developed in an attempt to overcome the problem of time complexity in
the popular back-propagation learning algorithm. CCNN algorithm
(Fahlman and Liebiere 1990) and it not only trains a neural network but also
dynamically builds the network architecture. In this network, the number of
hidden layers is not assigned in advance, but is determined during the process
of learning. It means that the topology of a cascade-correlation neural network
only depends on the task being solved and on the nature of data forwarded to
the network inputs.
Cascade-Correlation is a new architecture and a supervised learning
algorithm for artificial neural networks. Instead of just adjusting the weights
in a network of fixed topology, Cascade-Correlation begins with a minimal
network, then automatically trains and adds new hidden units one by one,
creating a multi-layer structure. Once a new hidden unit has been added to the
network, its input-side weights are frozen. This unit then becomes a
permanent feature-detector in the network, available for producing outputs or
for creating other more complex feature detectors.
The idea behind the cascade-correlation architecture is to build the
architecture by adding new neurons together with their connections to all the
inputs as well as to the previous hidden neurons and to learn the newly
created neuron by fitting its weights so as to minimize the residual error of the
network.
4.3.1

Cascade Correlation Network Architecture


A cascade correlation network (Feraud et al 2001) consists of a

cascade architecture, in which hidden neurons are added to the network one at

100

a time and do not change after they have been added. It is called a cascade
because the output from all neurons already in the network feed into new
neurons. As new neurons are added to the hidden layer, the learning algorithm
attempts to maximize the magnitude of the correlation between the new
neurons output and the residual error of the network which is to be
minimized. A cascade correlation neural network has three layers: input,
hidden and output.
Input Layer: A vector of predictor variable values (x1xp) of the
given image is presented to the input layer. The input neurons perform no
action on the values other than distributing them to the neurons in the hidden
and output layers. In addition to the predictor variables, there is a constant
input of 1.0, called the bias that is fed to each of the hidden and output
neurons. The bias is multiplied by a weight and added to the sum going into
the hidden neuron.
Hidden Layer: Arriving at a neuron in the hidden layer, the value
from each input neuron is multiplied by a weight, and the resulting weighted
values are added together producing a combined value. The weighted sum is
fed into a transfer function, which outputs a value. The outputs from the
hidden layer are distributed to the output layer.
Output Layer: Each output neuron receives values from all of the
input neurons and all the hidden layer neurons, with the bias values. Each
value presented to the output neuron is multiplied by a weight, and the
resulting weighted values are added together producing a combined output
value. The weighted sum is fed into a transfer function, which outputs a final
value for classification. For regression problems, a linear transfer function is
used in the output neurons. But for classification problems, there is a neuron
for each category of the target variable and a sigmoid transfer function is used.

101

4.3.2

Cascade-Correlation Learning Algorithm


Cascade-Correlation (David DeMers and Cottrell 1993) combines

two key ideas: The first is the cascade architecture, in which hidden units are
added to the network one at a time and do not change after they have been
added. The second is the learning algorithm, which creates and installs the
new hidden units. For each new hidden unit, an attempt is made to maximize
the magnitude of the correlation between the new units output and the
residual error signal. The training steps for CCNN algorithm is as follows
Step1:

Initiate a cascade correlation neural network with only


the input and output layer neurons with no hidden layer
neurons. Train the initial net until the mean square error
E reaches a minimum.

Step2:

A hidden candidate node is installed. Initialize weights


and learning constants.

Step3:

The hidden candidate node is trained. Stop if the


correlation between its output and the network output
error is maximized.

Step4:

The hidden candidate unit to the main net is added, i.e.


freeze its weights, connect it to the other hidden units and
connect to the network outputs.

Step5:

The main net that includes a hidden unit is trained. Stop


if the minimum mean square error is reached.

Step6:

Another hidden unit is added. Repeat steps 2-5, until the


mean square error value is acceptable.

102

The cascade-correlation learning algorithm exemplifies the


supervised learning. While learning, it constructs the minimal network that is
a network with the minimal possible number of hidden layers. Learning starts
when the network is minimal, i.e. when there is an input layer, an output layer
and no hidden layers. For learning, an algorithm is used that minimizes the
value of the network output error E. Every input is connected to every output
neuron by connection with an adjustable weight, as shown in Figure 4.13.

Figure 4.13 CCNN with No hidden units

The input and the output neurons are linked by a weight value.
Values on a vertical line are added together after being multiplied by their
weights. Every input is connected to every output unit by a connection with
an adjustable weight. There is also a bias input, permanently set to +1. The
output units may just produce a linear sum of their weighted inputs, or they
may employ some non-linear activation function. So each output neuron
receives a weighted sum from all of the input neurons including the bias. The
cascade architecture with one hidden unit is shown in Figure 4.14.

103

Figure 4.14 CCNN with one hidden unit


The output neuron sends this weighted input sum through its
transfer function to produce the final output. Even a simple cascade
correlation network with no hidden neurons has considerable predictive
power. For a fair number of problems, a cascade correlation network with just
input and output layers provides excellent predictions. After the addition of
the first hidden neuron, the network would have this structure.
The input weights for the hidden neuron are shown as square boxes
to indicate that they are fixed once the neuron has been added. Weights for
the output neurons shown as x can be adjustable. Here is a schematic
representation of a network with two hidden neurons. The cascade
architecture with two hidden units is illustrated in Figure 4.15.
Each new hidden unit receives a connection from each of the
networks original inputs and also from every pre-existing hidden unit. The
hidden units input weights are frozen at the time the unit is added to the net;
only the output connections are trained repeatedly. Each new unit therefore
adds a new one-unit layer to the network, unless some of its incoming
weights happen to be zero. This leads to the creation of very powerful high-

104

order feature detectors; it also may lead to very deep networks and high fan-in
to the hidden units.

Figure 4.15 CCNN with two hidden units


Network learning is considered to be completed when the
convergence of the network is achieved, that is, the value of the error stops to
change or if the value of the error is sufficiently small and does not exceed
earlier set maximal error value. In case the error value does not meet the
above requirements, learning should be continued. For this, a new hidden
layer is added to the network. This node is called a candidate node and its
output is not activated in the main network at this stage.
After a new hidden layer is added, all patterns out of the training
sample are then passed through this node. The candidate node learns, that is
its weights are being revised. The aim of the candidate node weight correction
is to maximize the value of correlation C between the output of the
candidate node and network output error.

C
o p

( yp

y )(eop

eo )

(4.19)

105

where y and eo are the mean values of the outputs and output errors over the
all patterns p of the training sample.
After learning, the candidate-node is added to the main net. The
weights of this added node are frozen. The output of this node, in its turn, can
either be forwarded to the output of the main net or serve as one of inputs for
the hidden units. One by one added hidden nodes thus make cascade
architecture as shown in Figure 4.16.

O u t p ut
u n its
H i d de n u ni t 2
H i d de n u ni t 1

In p u t
u n its

Figure 4.16 Cascade architecture of CCNN


During the process of the cascade-correlation network learning, the
gradient descent and the gradient ascent are used, respectively, to minimize
the value of error E and to maximize the value of correlation ( Fahlman S.E.
and Liebiere C 1990).
For error E, it is computed as
E
woi
where

eop

eop X ip
p

( yop t op ) f p

(4.20)

106

For correlation C it will look as


C
wi
where

X ip

(4.21)

(eop

eo ) f p

If

E
and
w

C
will be denoted as S, weights correction formula
w

will look as follows:


W (t+1)=w(t)+ w(t)
where

(4.22)

wt = S(t)

if

wt = wt-1S(t)/(S(t-1)-S(t))

if wt-1< >0 and S(t)/S(t-1)-S(t)) <

wt =

in all other cases.

w(t-1)

Here is the error correction step, but

wt-1=0,

is the minimal correction step.

Instead of a single candidate unit, it is possible to use a pool of


candidate units, each with a different set of random initial weights. All receive
the same input signals and see the same residual error for each training pattern.
Because they do not interact with one another or affect the active network
using training, all of these candidate units can be trained in parallel; whenever
no further progress is being made, the candidate whose correlation score is the
best is installed.
The use of this pool of candidates is beneficial in two ways: it
greatly reduces the chance that a useless unit will be permanently installed
because an individual candidate got stuck during training and it can speed up
the training because many parts of weight-space can be explored

107

simultaneously. One final note on the

implementation of this algorithm:

While the weights in the output layer are being trained, the other weights in
the active network are frozen. While the candidate weights are being trained,
none of the weights in the active network are changed. In a machine with
plenty of memory, it is possible to record the unit-values and the output errors
for an entire epoch, and then to use these cached values repeatedly during
training, rather than recomputing them for each training case. A reasonably
small net is built automatically. This can result in a tremendous speedup,
especially for large networks.
4.3.3

Advantages

and

Disadvantages

of

Cascade

Correlation

Algorithm
Cascade-Correlation Network is useful for incremental learning, in
which new information is added to an already-trained net. Once built, a
feature detector is never cannibalized. It is available from that time on for
producing outputs or more complex features. Training on a new set of
examples may alter a networks output weights, but these are quickly restored
on return to the original problem. At any given time, only one layer of
weights in the network can be trained. The rest of the network is not changing.
In CCNN, there is no need to guess the size, depth, and
connectivity pattern of the network in advance. It may be possible to build
networks with a mixture of nonlinear types. Cascade-Correlation learns fast.
In backpropogation, the hidden units engage in a complex way before they
settle into distinct useful roles: In Cascade-Correlation, each unit sees a fixed
problem and can move decisively to solve that problem. The learning time in
epochs grows very roughly as NlogN, where N is the number of hidden units
ultimately needed to solve the problem. Cascade-Correlation can build deep
nets (high-order feature detectors) without the dramatic slowdown that is seen
in back-propagation networks with more than one or two hidden layers.

108

Here error signals are not propagated backwards as in BPNN, but a


single residual error signal can be broadcast to all candidates. The weighted
connections transmit signals in only one direction, eliminating one
troublesome difference between backpropogation connections and biological
synapses. The candidate units do not interact with one another, except to pick
a winner.
Cascade-correlation can converge quickly and it is less likely to get
trapped in local minima than multilayer perceptron networks. But cascade
correlation scales up to handle large problems far better than probabilistic or
general regression networks. Training time is very fast in CNN; hence it is
suitable for large training sets. Typically, cascade correlation networks are
fairly small, often having fewer than a dozen neurons in the hidden layer.
Contrast this to probabilistic neural networks which require a hidden-layer
neuron for each training case.
As with all types of models, there are some disadvantages to
cascade correlation networks. They have an extreme potential for over-fitting
the training data. Over-fitting can also take place in the presence of noisy
features. This results in excellent accuracy on the training data but poor
accuracy on new, unseen data. Cascade correlation networks usually are less
accurate than probabilistic and general regression neural networks for small to
medium size problems.
Experimental results and comparison of BPNN and CNN based on
recognition rate and execution time for ORL database are presented in Tables
4.2 and 4.3 and Figures 4.17 and 4.18 respectively.

109

Table 4.2

Comparison of recognition rate of BPNN and CNN

Recognition rate (%)

No. of images

BPNN

CNN

50

94

95

100

90

92

200

88

89

300

86

88

400

82

84

Figure 4.17 Comparison of recognition rate of BPNN and CNN

110

Table 4.3

Comparison of Execution time of BPNN and CNN

Execution time (sec)

No. of images

BPNN+FLDA

CNN+FLDA

50

30.09

25.36

100

39.51

33.04

200

45.26

39.41

300

51.02

45.21

400

60.26

53.34

Figure 4.18 Comparison of execution time of BPNN and CNN

111

4.3.4

Summary
Neural Networks (NN) have found use in a large number of

computational disciplines. The well known PCA and FLDA algorithms are
applied with BPNN to improve the performance. LDA is a robust algorithm
for the illumination variance. The performances of the LDA with BPNN are
discussed here, with various databases and with diverse environments.
BPNN enhances the classification and the performance of LDA
with BPNN resulted in more than 90 % recognition rate. CNN is better for the
large number of database images, as it has a fast recognition. Mostly, the
execution time of CNN is 20% less, when compared to BPNN.
Neural networks are currently used prominently in voice
recognition systems, image recognition systems, industrial robotics, medical
imaging and data mining and aerospace applications.

You might also like