You are on page 1of 6

2015 International Conference on Pervasive Computing (ICPC)

A Genetic Algorithm-Based 3D Feature Selection for


Lip Reading
Sunil Sudam Morade Suprava Patnaik
PhD Student, Professor,
Department of Electronics Engineering, Department of E and TC Engineering,
SVNIT, Surat, India. Xavier Institute of Engineering,
ssm.eltx@gmail.com Mahim, Mumbai, India.
suprava_patnaik@yahoo.com

Abstract—In lip reading, selection of features play crucial used to test classifiers. Meyor et al. [4] used DCT transform
role. In lip reading applications database is video, so 3 technique for pixel information of continuous digit recognition
Dimensional transformation is appropriate to extract lip motion and proposed different fusion techniques for audio and video
information. State of art the lip reading is based on frame feature data. They found that Word Error Rate (WER) is more
normalization and frame wise feature extraction. However this is
for continuous digit recognition.
not appropriate due to chances of information loss during frame
normalization. Also all the frames cannot be considered equally High dimensional feature set can negatively affect the
as they bear varying motion information. In this paper 3D performance of pattern or image recognition systems. In other
transform based method is proposed for feature extraction. words, too many features sometimes reduce the classification
These features are the input to Genetic Algorithm (GA) model accuracy of the recognition system since some of the features
for discriminative analysis. Genetic Algorithm is used for may be redundant and non- informative [5]. In machine
dimensionality reduction and to improve the performance of the learning , feature selection, which is also called variable
classifiers at low cost of computation. Both testing and training selection or variable subset selection, is the process of
time for classifier is reduced by compact feature size. For obtaining a subset of relevant features is useful in many
experimentation of digit utterances CUAVE and Tulips database
are used. The results obtained are compared with various feature
application. There are lots of techniques available for
selectors from WEKA software. It is found that from obtaining such subsets. Some of these techniques include
classification accuracy point of view proposed method is better Principal Component Analysis (PCA), Particle Swarm
than others. Optimization (PSO) and Genetic Algorithm (GA) [6]. More
often, lots of researchers in recent times have employed
Keywords— BPNN, Feature Selection, Genetic Algorithm, Waikato Environment for Knowledge Analysis (WEKA)
KNN, SVM, 3D-DWT, 3D-DCT. software for dimensionality reduction. However, WEKA
software is static in its feature selection approach as the users
I. INTRODUCTION cannot change the configuration of the concerned feature
Visual speech recognition is a technique used to identify the selectors [7]. GA has been known to be a very adaptive and
speech by lip movement. Lip reading is also called visual efficient method of feature selection as reported by [6]. It is to
speech recognition. Best lip reader can understand the speech be noted that feature selection is inherently a multi-objective
by lip movement. Hearing impaired person understand the problem with two main goals: 1) minimizing both the number
speech by lip movement. From long days people know that the of features 2) reduce classification error.
lip movement having speech information. In Many
applications such as speech in noisy area, places where you do Frame work of proposed lip reading model is described in
not have to speak and disaster condition (earthquake) visual section-II. Section III deals with GA based Feature selection.
speech is important. Experimentation results and description of test corpus is given
State of art literatures on appearance model are many, out of in section-IV. Finally section-V is based on our conclusion and
which few noteworthy literatures are cited here for basic scope for future work.
understanding of challenges in lip reading paradigm. E.
II. PROPOSED LIP READING FRAMEWORK
Petajan [1] experimented on lip-reading to enhance speech
recognition by using visual information. Potamianos et al. [2] A typical lip reading system consists of three major stages:
used linear image transforms namely PCA, DWT and DCT video frame normalization and lip localization, feature
transform techniques. R. Seymour et al. [3] compared image extraction, and the final step is classifier. In our proposed
transform features in visual speech recognition of clean and model one more stage can be added as feature selection. Fig. 1
shows the major steps used in the proposed lip reading process.
corrupted videos. They evaluated PCA, DCT, Fast Discrete
Curvalet Transform (FDCT), and Linear Discriminant
Analysis (LDA) methods. The classification performance
parameters such as specificity, sensitivity and accuracy are

978-1-4799-6272-3/15/$31.00(c)2015 IEEE
Authorized licensed use limited to: KIIT University. Downloaded on July 17,2023 at 04:24:18 UTC from IEEE Xplore. Restrictions apply.
Features 2𝑛
Normalised extraction Classifiers
Frames and Lip by 3 D DWT/ (SVM,BPNN 𝑊ℎ (𝑛, 𝑗) = � 𝐼(𝑚, 𝑗 − 1) ∗ 𝑔(2𝑛 − 𝑚) (2)
localisation 3 D DCT , KNN) 𝑚=0

where W (n, j) is wavelet output. h (n) and g(n) are the filter
Fig. 1. Lip reading process impulse response of low pass and high pass filter, j is the
current level, n is the current input index and I(n, j − 1) is the
A. Video Segmentation and Lip Contour Localization input signal. The results from [8] motivated us to select Dmey
There are large inter and intra subject variations in wavelet, as in lip reading application as the speech
utterance of a digit and this results in difference in the number information is extracted from visual information.
of frames for each utterance. We have used audio analysis,
using Pratt software to segment the time duration and the III. FEATURE SELECTION
associated video frames of each digit which is uttered. On an A. Genetic Algorithm (GA)
average 16 frames are sufficient for utterance of any digit
between 0-9.Out of 16 frames we have selected 10 significant Genetic Algorithm (GA) is an optimization technique, a
frames. We have used Adaboost algorithm for face and mouth population-based and algorithmic search heuristic methods
detection. A sample result is shown in Fig. 2(a-b). that mimic natural evolution process of man [12, 13]. The
operations in a GA are iterative procedures manipulating one
population of chromosomes (solution candidates) to produce a
new population through genetic functional such as crossover
and mutation. S. Sivanandam [14] used the terminology
between human genetic and GA. Table 1 is a comparison
between human genetic and GA. The fitnesses of the solution
candidates (chromosomes) are evaluated using a function
commonly referred to as objective or fitness function. The
formulation of the fitness function depends on the problem
being solved. In relation to this article, maximizing
classification accuracy is equivalent to minimizing the
Fig. 2. (a) Detection of face and lip area for CUAVE datbase s02m(b) Lip classification error rate. Fig. 3 shows Selection of 3D-DWT or
portion 3D-DCT features using Genetic Algorithm.

B. Frame Feature Extraction Number of features Number of


K. Min et al. used 3-D DCT for feature extraction and 3-D 164 after 3D DWT features ≤
Genetic
/3D DCT 164/149 (..F2,
HMM as classifier [9]. In order to consider dynamic feature 3- Algorithm
F5,…….F8,F21)
(F1,F2……………. (GA)
D DCT features are important. F164)/(F1,F2……
While 2-D DWT is used for computing the feature vector. )
Goal is to select only those coefficients which play the
dominant role in the representation of lip motion. In standard 2- Fig. 3. Selection of 3D-DWT or 3D-DCT features using Genetic Algorithm
D wavelet decomposition based approach, each level of
filtering splits the input image into four parts via pair of low- The GA operates on binary search space as the chromosomes
pass and high-pass filters with respect to column vectors and are bit strings. The GA manipulates the finite binary
row vectors of the image matrix. Then the low-spatial population in similar to human natural evolution. To begin
frequency sub-image is selected for further decomposition. with, an initial population is created (mostly random) and
After few levels of decomposition the lowest spatial-frequency evaluated using a fitness function. For binary chromosome
approximation sub-image, is extracted as the feature vector.The employed in this work, a gene value ’1’ depicts that the
3-D DWT is like a 1-D DWT in three directions. Lip reading is particular feature indexed by the position of the ’1’is selected.
a video processing application. To use the wavelet transform Otherwise, (i.e. if it is ’0’), the feature is not selected for
for volume and video processing, a 3-D version of filter banks
chromosomal evaluation. Using the positional index of
are implemented. In 3-D DWT, the 1D analysis filter bank is
features indexed by the ’1s’, the chromosomes are then ranked
applied in turn to each of the three dimensions [10, 11].
Equations 1 and 2 show the output 1 D filter output for one and based on the rankings, the top n fittest kids (Elitism of size
direction image signal. Filter coefficients are derived using n) are selected to survive to the next generation. After the elite
Dmey wavelet. kids are pushed automatically to the next generation, the
remaining kids (individuals) in the current population are
2𝑛
allowed to genetically pass through the functional crossover
𝑊𝑙 (𝑛, 𝑗) = � 𝐼(𝑚, 𝑗 − 1) ∗ ℎ(2𝑛 − 𝑚) (1) and mutation to form crossover and mutation kids
𝑚=0

Authorized licensed use limited to: KIIT University. Downloaded on July 17,2023 at 04:24:18 UTC from IEEE Xplore. Restrictions apply.
respectively. The three (3) kids viz elite, crossover and population. Tournament selection is performed iteration until
mutation then form the new population (new generation). the new population is filled up.
Crossover (a genetic functional) is a combination of two e. Crossover function
individuals (chromosomes) to form a crossover kids. Mutation In this XOR operation is performed on the two parent
operator on the other hand, is used for genetic perturbation of chromosomes since they are binary
the genes in each chromosome through bits flipping depending 𝑐𝑜𝑘𝑖𝑑𝑠(𝑛) = 𝑝1 𝑥𝑜𝑟 𝑝2 (4)
on the mutation probability. Using the steps in Fig. 6, GA-
based feature selection is explained in this section [6]. Where n is an index that is from 1 to number of kids, p1 is first
parent chromosome and p2 is second parent chromosome.
TABLE 1 COMPARATIVE TERMINOLOGY BETWEEN HUMAN GENETIC AND
GA.
f. Mutation function
Mutation is genetic perturbation of individuals in a
S.No. Human genetic GA terminology population. Mutation ensures genetic diversity and searching
1 Chromosomes Bit strings of broader solution. In this paper uniform mutation is used and
2 Genes Features GA generates set of random numbers from uniform
3 Allele Feature value distribution.
4 Locus Bit position
g. Termination of GA
5 Genotype Encoded string
6 pheotype Decoded genotype Once the GA reaches the optimum solution, it stops. TWO
stopping conditions are applicable: i) Maximum number of
generation. Here this value is 100. ii) Stall generation limit. Its
TABLE 2 PARAMETERS USED IN GA value is 0.000001.

Sr. No. GA Parameter Value B. WEKA Correlation-based Feature Selection ( CFS)


1 Population size 164
2 Genome length 164 The two WEKA evaluators used are correlation feature
3 Population Type Bit strings selection subset evaluator (CFS-CE) and ranker i.e.
4 Fitness function KNN –based classifier error Information Gain (IG). CFS uses a search algorithm like most
5 No. of generations 100 of the feature selector algorithm used. This algorithm also
6 Crossover Arithmetic evaluates the merit of feature subset. Selection of features is on
7 Crossover Probability .8
8 Mutation Probability 0.1
the basis of correlation with the class and yet uncorrelated to
9 Selection scheme Tournament of size 2 each other [7]. WEKA ranker (IG) is filter-based technique to
10 Elite Count 2 reduce feature size.
.

B. Selection of features using GA based


Table 2 shows the configuration of GA. Seven step criteria
for implementation GA is shown in Fig. 6. Description of each
step is given below
a. Generation of starting population
Starting population for this algorithm is a matrix of dimension
Population size x chromosome length generated by random
binary digit.
b. Generation of children for new population
Elite count ≤ population size. With its size of 2, GA
automatic picks the top two best chromosomes and push them
auto to the next generation.
c.New population on which GA performs are
i) Elitism ii) Cross over iii) Mutation
If total population is 100 and Elite = 2 then Crossover is 98 x
Fig. 4. Plot of Fitness value vs generation for 3D DWT features ( CUAVE
0.8=72 and remaining is mutation i.e. 20. data base)
c. Fitness evaluation
Equation (3) is used for fitness evaluation
𝑎
𝑓 = + 𝑒 −1/𝑁 (3)
𝑁
Where 𝑎 is KNN classification error and N is cardinality of
feature vector.
d. Selection mechanism
Tournament of size 2 is used in this work due to its simplicity,
speed, and efficiency [6]. Two chromosomes are selected from

Authorized licensed use limited to: KIIT University. Downloaded on July 17,2023 at 04:24:18 UTC from IEEE Xplore. Restrictions apply.
TABLE 3 RECO. ACCURACY USING GA AND IG FEATURE SELECTION FOR
3D-DWT FEATURES. (CUAVE AND TULIP DATABASE)

Database CUAVE Tulips


S.No. Selector+Classifier Reco. Reco.
Accuracies(%) Accuracies
(%)
1 GA+BPNN 78.42 73.95
2 GA+SVM 73.71 69.7
3 GA+KNN 70.85 63.54
4 IG+BPNN 68.28 72.91
5 IG+SVM 68.14 67.5
6 IG+KNN 69.42 61.45
7 CFS+BPNN 73.42 73.9
8 CFS+SVM 69 68.5

Fig. 5. Plot of Fitness value vs generation for 3-D DCT ( CUAVE data 9 CFS+KNN 70.28 61.45
base)

TABLE 4 RECO. ACCURACY USING GA AND IG FEATURE SELECTION FOR 3D-


IV. CORPUS AND RESULT
DCT FEATURES (CUAVE DATABASE)

Database CUAVE
A. CUAVE database S.No. Selector+Classifier Reco. Accuracies(%)
1 GA+BPNN 74.00
CUAVE [15] (Clemson University Audio Visual 2 GA+SVM 72.85
Experiments) was recorded by E. K. Pattererson of 3 GA+KNN 68.28
Department of Electrical and Computer Engineering, Clemson 4 IG+BPNN 69.14
University, US. The database was recorded in an isolated 5 IG+SVM 72.41
sound booth at a resolution of 720 x 480 with the NTSC 6 IG+KNN 70.28
7 CFS+BPNN 70
standard of 29.97 frames per second using 1 Megapixel-CCD
8 CFS+SVM 71.41
camera. This database is a speaker-independent database 9 CFS+KNN 70.28
consisting of connected and continuous digits spoken in
different situations. The database consists of two major V. CONCLUSION
sections: one of speaker pairs and the other one of
individuals.
In most cases, the difference in the classification accuracy
B. Tulips Database reported by the two approaches is very small. The features
Tulips are a small audiovisual database of 12 subjects saying selected by both method on the CUAVE and Tulips database
the first 4 digits in English. Subjects are undergraduate respectively. In overall, both the GA method and WEKA-CFS
students from the Cognitive Science Program at UCSD. The which are wrapper-based feature selectors produced better
database was compiled at R. Movellan's laboratory. Tulips classification accuracy than WEKA ranker (IG) which is
contains the video files in 100 x 75 pixel 8 bit gray level. filter-based. The main advantage of the method herein lies in
Each frame corresponds to 30 frames per second. R. Movellan the area of controllability as the GA can be fine-tuned to
presents work on speaker independent visual speech produce better results all the time by changing the fitness
recognition system and used simple HMM as a classifier used functions. Of all the features selected. With the application of
Tulips database of 1 to 4 digits for testing result [16]. GA for dimensionality reduction, more discriminating features
were obtained. After feature selection with GA the
C. Results performance of 3D-DWT is better for BPNN classifier. In
Results of GA are compared to two WEKA based feature future combination of 3D DWT and 3D DCT are used get
reduction technique The reduced feature vector are tested most discriminative feature between them.
classifiers such as SVM, BPNN, KNN. Figs. (4-5) show the
convergence of GA based on the chosen fitness function.
From the Figs. (4-5) fitness evaluation value is small for 3D
DWT as compared to 3D DCT. This is also reflected in
performance of 3D DWT. The performance of 3D DWT is
better as compared to 3D DCT. After GA based feature
selection features are reduced to 32% of original features.

Authorized licensed use limited to: KIIT University. Downloaded on July 17,2023 at 04:24:18 UTC from IEEE Xplore. Restrictions apply.
Start

Normalised frames for digits

Feature Vectors using 3D


DWT / DCT

N X 164 chromosomes

Fitness function

Selection of parent
chromosomes

Crossover Mutation
Elite Kids Kids Kids

Best Termination
Chromosom condition

Reduce
feature vector Fitness
Evaluation

End
New Population

Fig. 6. Flowchart for GA based feature selection

REFERENCES Department of Information Engineering and Computer Science,


University of Trento, 2010.
[6] B. Oluleye , A. Leisa, J. Leng, D. Dean, “A genetic algorithm-
[1] E. D. Petajan, Automatic lip-reading to enhance speech based Feature Selection”, IJECCE, Vol.5, 899-905, 2014.
recognition, Ph.D. Thesis University of Illinois, 1984. [7] M. Hall, E. Frank, G. Holmes, B. Pfahringer. P. Reutemann and
[2] Potamianos, H. Graf, and E. Cosatto, “An image transform H. Ian “The WEKA Data Mining Software: An Update; SIGKDD
approach for HMM based automatic lip reading,” International Explorations,” Vol. 11, 1, 2009.
Conference on Image Processing, 173–177, 1998. [8] V. long and L. Gang “Selection of the best wavelet base for
[3] R. Seymour, D. Stewart, and Ji Ming, “Comparison of image speech signal” IEEE Intelligent multimedia, video and speech
transform-based features for visual speech recognition in clean processing, 2004.
and corrupted videos,” EURASIP Journal on Video Processing, [9] K. Min and M. Fac, “A lip reading method based 3-D DCT and
Vol. 2008, 1-9, 2008. 3-D HMM” IEEE conf. ICIOE, 115-119, 2012.
[4] G. F. Meyor, J. B. Mulligan and S. M. Wuerger, “Continuous [10] M. C.Weeks “Architectures For The 3-D Discrete Wavelet
audio-visual using N test decision fusion,” Elsevier Journal on Transform” Ph.D. Thesis University of Southwestern Louisiana
Information Fusion, 91-100, 2004. ,1998.
[5] B. L. Persello “A novel approach to the selection of robust and
invariant features for classification of hyperspectral images,”

Authorized licensed use limited to: KIIT University. Downloaded on July 17,2023 at 04:24:18 UTC from IEEE Xplore. Restrictions apply.
[11] S. Morade and S. Patnaik, “Lip reading by using 3-D Discrete [15] E. Patterson, S. Gurbuz, Z. Tufekci and J. Gowdy, “CUAVE: a
Wavelet Transform with Dmey wavelet” , IJIP, Vol 8, 385-396, new audio-visual database for multimodal human computer-
2014. interface research”, Proceedings of IEEE International
[12] J.Tian, Q. Hu, X. Ma and M.Ha “An Improved KPCA/GA-SVM conference on Acoustics, speech and Signal Processing, 2017-
Classication Model for Plant Leaf Disease Recognition Journal of 2020, 2002.
Computational Information Systems” 18, 7737-7745, 2012. [16] J. R. Movellan “ Visual Speech Recognition with Stochastic
[13] Melanie M. “An Introduction to Genetic Algorithms A Bradford Networks” Advances in Neural Information Processing Systems,
Book” The MIT Press, 1999. MIT Pess Cambridge, Vol. 7, 1995.
[14] N. Sivanandam and S. N. Deepa,” Introduction to Genetic
Algorithms Springer-Verlag” , Berlin, Heidelberg, 2008

Authorized licensed use limited to: KIIT University. Downloaded on July 17,2023 at 04:24:18 UTC from IEEE Xplore. Restrictions apply.

You might also like