You are on page 1of 7

ENTERFACE 2010 Project Proposal

Title: Audio-visual speech recognition


Principal Investigators: Dr. Hakan Erdogan and Saygin Topkaya
Date: 14/12/2009
Abstract: The project aims to develop real-time audio-visual speech recognition software that will show the
recent developments in audio-visual speech recognition research.

1. INTRODUCTION AND PROJECT OBJECTIVES


Speech recognition is an open research area that requires continuing research effort for further
advancement. Although one can obtain high recognition rates in audio-only speech recognition in controlled
environments, recognition accuracy degrades in noisy environments. For such cases, supporting audio
information with visual information is a commonly recommended approach in the literature. This approach,
called audio-visual speech recognition, aims to provide the supplemental visual information from a video of the
lip region of the subject of interest.
We propose an Enterface 2010 workshop project to work on building real-time software that showcases the
current and emerging audio-visual speech recognition technology. Our group at Sabanci University has an
ongoing nationally supported project on improving the performance of audio-visual speech recognition. In this
project, we propose a tandem classifier fusion approach for combining audio and visual information. In addition,
we work on improving visual feature extraction by extracting stable features correcting for head motion artifacts
and normalized features correcting for speaker and environment variability.
We can list the visual feature extraction aspects that we aim to include in the proposed software as follows:
 Extracting visual features using edge and texture information.
 Using dynamic visual features that account for lip motion in the region of interest.
 Normalization of visual features using warping.
In addition, we propose to implement the following methods in the core recognition part of the audio-visual
speech recognition system:
 We propose using first level classifiers (e.g. support vector machines [1] or neural networks [2]) for each
stream and combining their outputs using classifier fusion in a tandem fashion to improve second level
structured output classifier (e.g. hidden Markov model [3]). This method is the one we are promoting
and we wish to implement it in the system.
 Combination can also be achieved in an asynchronous fashion (e.g. using product or coupled HMMs). If
there is enough interest to our project, we can implement these methods for audio-visual fusion as well.
 Complex structured multi-stream conditional random fields can be employed for recognition. If there is
enough interest from the community, we can develop and implement these methods for audio-visual
fusion.
The proposed software will be user friendly by default but it will have advanced features for research
analysis and improvement. The software will receive data through a camera and a microphone connected to a PC
and will automatically transcribe spoken utterances in real time.
Although this is enough to present the power of the application, it can later (beyond the scope of the
workshop) be used as a core module of a human-computer interface software that would work well even in noisy
environments. An example is to perform hands-free command/control in a car on a highway.
To give the software potential to be easily adaptable to future environments, we plan to develop the
components as OS-independent as possible, and utilize open source third party components as necessary.
2. BACKGROUND INFORMATION
For researchers who are interested in the theoretical background of our project, we wish to provide some
brief summary and relevant links to the selected works in the literature. A speech recognition system can be
divided into two parts; feature extraction and recognition.
2.1. Feature Extraction
i. Audio Features
There is vast literature about feature extraction from audio and video for utterance recognition. For audio
data, three of the most commonly preferred audio feature types in the literature are:
1. Mel Frequency Cepstral Coefficients (MFCC) [4]
2. Perceptual Linear Predictive (PLP) Coefficients [5]
3. Advanced Front End (AFE) [6]
MFCC, which is issued as a standard audio feature for speech processing by European Telecommunications
Standard Institute (ETSI) in 2000, is the most popular among the three.
ii. Visual Features
Compared to audio, visual feature extraction is a more complex process. Prior to visual feature extraction, a
region of interest (ROI) has to be obtained which directly affects the performance of the overall system. The ROI
is typically a rectangle enclosing the mouth including the nose tip and the chin. After the ROI is obtained, we
can briefly classify visual feature extraction methods into two major categories and hybrid of those two as a
third:
1. Region (or appearance) based visual features [7] [8]
2. Lip contour based visual features [9]
2.2. Recognition System
Compared to audio-only recognition [3], audio-visual recognition has a more complex structure due to joint
processing of audio and visual data. Researchers [10] concluded that visual information is in fact complementary
to the acoustic information and it has been the primary motivation for audio visual speech recognition research
introducing the visual feature extraction and the information fusion issues into the problem. The information
fusion problem is addressed in many works which can be listed as follows:
1. Feature concatenation where the feature vectors from the two streams (i.e. audio and vision) are
concatenated to train a single HMM.
2. Multiple stream HMMs [11] in which the two streams are independently modeled and combined with a
weight.
3. Asynchronous HMMs [11] including Product HMM (PHMM), Factorial HMM (FHMM) and Coupled
HMM (CHMM).
4. Tandem fusion approach which is promoted by us based on the tandem approach [12] which relies on
the idea that the classifier posteriors derived from discriminative classifiers (such as support vector
machines) are more discriminative features as compared to regular features for HMMs.
For training the tandem fusion system, we plan to use stacked generalization [13] for accurate combiner
training. Stacked generalization divides the training data into subsets, and uses each subset in a cross-validation
like scenario for combiner training. It is shown that this method improves the accuracy of second level classifiers
(combiners) by increasing the capacity of generalization.
3. DETAILED TECHNICAL DESCRIPTION
3.1. Technical Description
Proposed work packages in the project are:
1. Audio-visual speech activity detection to start recognition: This is a general step in many speech
recognition systems; which is usually caused by a practical problem, just to decide when to 'trigger' the
recognizer. We wish to employ both audio and visual activity to decide on when to start interpreting audio-visual
data. For example, as a visual cue, one may need to face the computer to start the recognizer.
2. Real-time extraction of visual region of interest: For the visual information extracting ROI (i.e. a region
including lip, mouth and chin) compensating for head motion. This step includes, detecting face, facial features
and tracking them over time. For this step we aim to use facial feature extraction technique developed by our
team [8].
3. Normalization of extracted visual information: This step includes preprocessing of regions of interest to
make all realizations of a viseme similar across subjects or sessions.
We believe this step will reduce the effects of differences caused by lip shapes, beard, skin color etc.
between different persons. Also rotations of head should be eliminated so that all ROI features (lips, chin etc.)
occupy same regions among all videos.
4. Real-time estimation of audio SNR to adjust audio-visual weights: Most audio-visual recognition
techniques require weights or coefficients for deciding on how much to rely on audio or visual streams. The
naive idea is to “trust more on audio when there’s less noise and vice versa”. So, calculating the level of noise
helps the system to make that decision.
5. Robustness to noise: This step includes handling and trying to mitigate the effects of audio and visual
noise on the system.
6. Sequential data recognition using audio and visual information: Many different machine learning
techniques may be utilized in the recognition step. The implementation of each method depends on the
participation level in the workshop. We provided the details about these methods in the previous section.
3.2 Resources Needed
Apart from the development computers, for the main system;
i. Hardware:
1 Laptop PC
1 Directional Microphone
1 External Sound Card (If proper interface for the microphone is N/A on Laptop)
1 Video Capture Camera
1 External Port for the camera (Again, if N/A on Laptop)
ii. Software:
We plan to use free software and development environments for implementation. Although subject to
change, we plan to use Qt for development environment and all Open Source libraries (OpenCV, HTK, libsvm
etc.).
Using SVN for version control is planned.
iii. Team Members:
We plan to build a team of capable programmers and theoreticians. We require people who have experience
in building speech or face recognition systems. We may need help in the user interface design for the software as
well.
The project staff can be grouped in three areas:
- Data acquisition
For this group, we plan to include programmers experienced in accessing auxiliary devices (i.e. camera
and sound card). Since we plan not to stick to a single operating system, programmers experienced in
multi-platform libraries (such as OpenCV or CLAM) are needed.
- Visual signal processing
For this group, we plan to include researchers with interest in computer vision and image processing.
Since data acquisition is the responsibility of the first group, researchers in this group will work on
processing raw (or a little preprocessed) data and presenting the results ready to be used by classifier /
machine learning group.
As we plan to use some recent methods (like Active Appearance Models) and methods developed by
our group, programmers experienced in these methods or eager to learn our method will be preferred.
- Audio signal processing
For this group, we plan to include researchers with interest in audio and speech processing. Like visual
signal processing group researchers in this group will work on processing raw (or a little preprocessed)
data and presenting the results ready to be used by classifier / machine learning group.
We mainly prefer to use conventional speech processing methods (like MFCC), we plan to work with
researchers experienced in these areas.
- Classifier / machine learning
For this last group, we wish to include researchers with interest in state of the art classification
techniques. Although the details are to be decided, since the main classifiers to be used are expressed
(HMMs, SVMs, NNs) researchers with experience in these methods will be beneficial for our project.
- Other
We plan to develop the whole application in C++ platform. We believe some primary libraries (Qt,
OpenCV etc.) are feasible for user and other interaction (with auxiliary devices) needs, and satisfy our
multi-platform aim. So we wish to have team members who are fluent in using these libraries and the
C++ platform.
3.3. Project Management
i. Before the workshop:
 Team member selection and definition of roles for each member
 Continue our ongoing research, provide theoretical results to decide methods to be used during the
workshop
 E-mail discussion between members, share results and information
ii. During the workshop:
 A meeting at the beginning of every week
 Development of the software, primarily real time recognition phase
 Preparation the project documentation
 Preparation of reports
 Software packaging, wrapping of libraries etc.
iii. After the workshop:
 Publication of the results
 Further joint research between the team members

4. WORKPLAN AND IMPLEMENTATION SCHEDULE


Week 1:
 Implementation of feature extraction modules for visual and audio data
 Implement normalization and noise handling processors
 Prepare training data and grammars for recognition
Week 2:
 Train classifiers and combiners, use third party common classifiers (as libraries or software) if necessary
 Train baseline HMM system
 Implement speech activity detector, noise level detector, graphical display system
Week 3:
 Train all models in English for final demonstration
 Extensively test project modules
 Debugging and improvements
Week 4:
 Preparation of the final demo and presentation
 Write final report
 Decision of the publication material
5. BENEFITS AND EXPECTED OUTCOMES
Demonstrating research innovations in a real environment is valuable for the public to appreciate the work.
The software will publicize the research on audio-visual speech recognition. It may also help other research
groups in Europe to develop and implement their own improvements on top of the software. We believe the
software can serve as a test-bed for implementing new and innovative ideas in a rapid fashion in this area. The
outputs of the project will be available to all interested researchers.
In addition, if we can make the software reliable enough, it can be used as a module (library) for developing
human-computer interface systems with many different components.
We need to design the software to be flexible and easy to use, and open to new developments. We would
like to begin with demonstrating satisfactory recognition of spoken digits of three different European languages,
for example, Turkish, English and French in real world environments.

6. TEAM PROFILE
Team Leaders
1. Hakan Erdogan, Assistant Professor, Sabanci University
Dr. Erdogan is going to oversee the development of the software and help with the development of
the software as necessary.
Bio: Hakan Erdogan is an assistant professor at Sabanci University in Istanbul, Turkey. He received
his B.S. degree in Electrical Engineering and Mathematics in 1993 from METU, Ankara and his
M.S. and Ph.D. degrees in Electrical Engineering: Systems from the University of Michigan, Ann
Arbor in 1995 and 1999 respectively. His Ph.D. was on developing algorithms to speed up statistical
image reconstruction methods for PET transmission and emission scans. His work there resulted in
three journal papers which are highly cited. He was with the Human Language Technologies group
at IBM T.J. Watson Research Center, NY between 1999 and 2002 where he worked on various
internally funded and DARPA funded projects. At IBM, he focused on the following problems of
speech recognition: acoustic modeling, language modeling and speech translation. He has been with
Sabanci University since 2002. His research interests are in developing and applying probabilistic
methods and algorithms for multimedia information extraction. Specifically, he is interested in
speech recognition, audio-visual speech recognition and multiple biometrics systems. As of
December 2009, Dr. Erdogan has published 10 journal papers, 2 book chapters and 40+ conference
papers. He has 3 patents. His works have been cited more than 200 times in the science citation
index. He served as co-organizer of "Speech to speech translation workshop" in ACL 02, technical
co-chair of IEEE SIU 2006 conference and DSP-in-cars 2007 workshop. He has been a program
committee member for LREC 2005-2010, ISCIS 2006-2010, SIU 2006-2010 and IEEE ICPS 2007.
He is the finance co-chair of ICPR 2010. He is a member of IEEE since 1992 and a member of
ISCA since 2003.
2. Saygin Topkaya, Ph.D. student, Sabanci University
Saygin Topkaya has good experience in software development and he will be the main leader for
the project. He is already working on the project and it is his Ph.D. topic.
Research Interests: Computer Vision, Speech Recognition, Object Tracking
Ph.D. (2008 - Cont.) Sabanci University - Electronics Engineering
Fellow Ph.D. Student in TUBITAK Project; Novel Approaches in Audio Visual Speech Recognition
M.Sc. (2005 - 2008) Yildiz Technical University - Mathematical Engineering
Master's Thesis: Face Recognition in Videos
B.Sc. (1998 - 2004) Yildiz Technical University - Mathematical Engineering
Researchers
1. Berkay Yılmaz, Ph.D. student, Sabanci University
Berkay Yilmaz will work on implementing the visual acquisition and feature extraction part of the
project.
Research Interests: 2d/3d computer vision, image processing, machine learning
Ph.D. (2009 - Cont.) Sabanci University - Computer Science
Fellow Ph.D. Student in TUBITAK Project; Novel Approaches in Audio Visual Speech Recognition
M.Sc. (2007 - 2008) Sabanci University - Mechatronics Engineering
Master's Thesis: Statistical Facial Feature Extraction and Lip Segmentation
B.Sc. (2003 - 2007) Bahcesehir University – Computer Engineering
2. Mehmet Umut Sen, MS student, Sabanci University
Mehmet Umut Sen will help develop the recognition system in general.
Research Interests: Statistical Signal Proc., Pattern Recog., Speech and Speaker Recog.
M.Sc. (2009 - Cont.) Sabanci University - Electronics Engineering
Fellow M.Sc. Student in TUBITAK Project; Novel Approaches in Audio Visual Speech Recognition
B.Sc. (2004 - 2009) Sabanci University - Electronics Engineering
3. Murat Saraclar, Professor, Bogazici University
Dr. Murat Saraclar has expressed interest in the project saying that he is willing to contribute as
necessary. We will ask for his valuable advice while developing the software.
4. Other interested researchers:
We seek interested researchers to help in all aspects of the project as listed in the “team members”
part of this proposal.
7. REFERENCES

[1] A. Ganapathiraju, J. Hamaker, and J. Picone, “Hybrid SVM/HMM architectures for speech recognition,” in in Speech
Transcription Workshop, 2000.
[2] A. J. Robinson, L. Almeida, J. Boite, H. Bourlard, F. Fallside, M. Hochberg, D. Kershaw, P. Kohn, Y. Konig, N. Morgan,
J. P. Neto, S. Renals, M. Saerens, C. Wooters, H. Speechproducts, and H. Speechproducts, “A neural network based, speaker
independent, large vocabulary, continuous speech recognition system: The wernicke project,” in Proc. EUROSPEECH’93,
1993, pp. 1941– 1944.
[3] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” in Proceedings of
the IEEE, 1989, pp. 257–286.
[4] B. P. Bogert, M. J. R. Healy, and J. W. Tukey, “The quefrency analysis of time series for echoes: Cepstrum, pseudo
autocovariance, cross-cepstrum and saphe cracking,” in Proceedings of the Symposium on Time Series Analysis (M.
Rosenblatt, Ed). New York:Wiley, 1963, ch. 15, pp. 209–243.
[5] H. Hermansky, “Perceptual linear predictive (PLP) analysis for speech,” Jour- nal of Acoustical Society of America, vol.
87, pp. 1738–1753, 1990.
[6] ETSI, “Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end
feature extraction algorithm; compres- sion algorithms,” in ETSI ES 202 050 Ver.1.1.3, Nov. 2002.
[7] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proc. of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, 2001, pp. 511–518.
[8] Yilmaz B., Erdogan, H., Unel, M., "Probabilistic Facial Feature Extraction Using Joint Distribution of Location and
Texture Information," International Symposium on Visual Computing 2009, Las Vegas, USA, Nov. 2009.
[9] M. B. Stegmann, R. Fisker, B. K. Ersbøll, H. H. Thodberg, L. Hyldstrup, Active appearance models: Theory and Cases,
Proc. 9th Danish Conference on Pattern Recognition and Image Analysis, vol. 1, pp. 49-57, AUC, 2000
[10] H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” Nature, vol. 264, pp. 746–748, 1976.
[11] A. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, “Dynamic bayesian networks for audio-visual speech recognition,”
EURASIP Journal on Applied Signal Pro- cessing, pp. 1–5, Nov. 2002.
[12] H. Hermansky, D. P. W. Ellis, and S. Sharma, “Tandem connectionist feature extraction for conventional HMM
systems,” in Proc. ICASSP, 2000, pp. 1635– 1638.
[13] David H. Wolpert, “Stacked generalization” in Neural Networks, v.5 n.2, p.241-259, 1992

You might also like