You are on page 1of 14

Research Project

“Classification of Speaker based on their Facial


Expression using Artificial Neural Network”

Research Area : Speech Processing combined with Facial Expression and ANN

Name of Co-investigator Ajit Shivajirao Ghodke

Designation Professor

Tilak Maharashtra Vidyapeeth


Name and Address of the Vidyapeeth Bhavan, Gultekadi, Pune 411 037
Phone: 020-24403000, 24261856, 24264699, 24267888
Institute Website: www.tmv.edu.in/

Principal Investigator
Dr. R. R. Deshmukh
Professor, Deptt. of Computer Science and Information Technology,
Dr. B. A. M. University, Aurangabad
Contents

1. Introduction to Speech and Speaker Recognition........................................................(1)

2. Speech Features........................................................................................................(4)

3. Facial Action Coding System.....................................................................................(7)

4. Classification of Speakers based on their Facial Expression using ANN ....................(10)

5. Conclusion.................................................................................................................(13)

References
1. Introduction to Speech and Speaker Recognition

Speech Recognition

Speech recognition is the process by which a computer (or other type of machine) identifies spoken words.
Basically, it means talking to your computer, AND having it correctly recognize what you are saying [1].

The following definitions are the basics needed for understanding speech recognition technology.
1] Utterance
An utterance is the vocalization (speaking) of a word or words that represent a single meaning to the
computer. Utterances can be a single word, a few words, a sentence, or even multiple sentences.
2] Speaker Dependence
Speaker dependent systems are designed around a specific speaker. They generally are more accurate
for the correct speaker, but much less accurate for other speakers. They assume the speaker will
speak in a consistent voice and tempo. Speaker independent systems are designed for a variety of
speakers. Adaptive systems usually start as speaker independent systems and utilize training techniques
to adapt to the speaker to increase their recognition accuracy.
3] Vocabularies
Vocabularies (or dictionaries) are lists of words or utterances that can be recognized by the SR system.
Generally, smaller vocabularies are easier for a computer to recognize, while larger vocabularies are
more difficult. Unlike normal dictionaries, each entry doesn’t have to be a single word. They can be as
long as a sentence or two. Smaller vocabularies can have as few as 1 or 2 recognized utterances
(e.g.”Wake Up”), while very large vocabularies can have a hundred thousand or more!
4] Accurate
The ability of a recognizer can be examined by measuring its accuracy - or how well it recognizes
utterances. This includes not only correctly identifying an utterance but also identifying if the spoken
utterance is not in its vocabulary. Good ASR systems have an accuracy of 98% or more! The acceptable
accuracy of a system really depends on the application.
5] Training
Some speech recognizer have the ability to adapt to a speaker. When the system has this ability, it may
allow training to take place. An ASR system is trained by having the speaker repeat standard or
common phrases and adjusting its comparison algorithms to match that particular speaker. Training a
recognizer usually improves its accuracy. Training can also be used by speakers that have difficulty in
speaking, or pronouncing certain words. As long as the speaker can consistently repeat an utterance,
ASR systems with training should be able to adapt.

(1)
Speaker Recognition
Speech processing is a diverse field with many applications. The speakers can be identified from their
voices. Speaker identification encompasses both identification and verification of speakers. Speaker verification
is the subject of validating whether or not a user is who he/she claims to be. The fundamental assumption that
is made in any of these systems is that there are quantifiable features in each person’s voice that are unique
among individuals and therefore can be measured.
Automatic Speaker Recognition is generally subdivided into two broad categories: automatic speaker
identification (ASI) and automatic speaker verification (ASV). In ASI, there is no a priory identity claim,
and the system decides who the person is, what group the person is a member of, that the person is unknown.
Speaker verification is defined as deciding if a speaker is whom he claims to be. This is different than
the identification problem, which is deciding if a speaker is a specific person or is among a group of persons.
In general, a person can be authenticated in three different ways:
1. Something the person has, e.g. a key or a credit card, account number.
2. Something the person knows, e.g. a PIN number or a password.
3. Something the person is, e.g. signature, fingerprints, voice, facial features.
The first two are traditional authentication methods that have been used several centuries. However,
they have the shortcoming that the key or credit card can be stolen or lost, and the PIN number or password
can be easily misused or forgotten. For the last class of authentication methods, known as biometric person
authentication [2,3], these problems are lesser. Each person has unique anatomy, physiology and learned
habits that familiar persons use in everyday life to recognize the person. Increased computing power and
decreased microchip size has given impetus for implementing realistic biometric authentication methods. The
interest in biometric authentication has been increasing rapidly in the past few years. Speaker recognition
refers to task of recognizing peoples by their voices.
The human speech conveys different types of information. The primary type is the meaning or words,
which speaker tries to pass to the listener. But the other types that are also included in the speech are information
about language being spoken, speaker emotions, gender and identity of the speaker. The goal of automatic
speaker recognition is to extract, characterize and recognize the information about speaker identity[4]. Speaker
recognition is usually divided into two different branches, speaker verification and speaker identification.
Speaker verification task is to verify the claimed identity of person from his voice[5,6]. This process involves
only binary decision about claimed identity. In speaker identification there is no identity claim and the system
decides who the speaking person is.
Speaker identification can be further divided into two branches. Open-set speaker identification
decides to whom of the registered speakers unknown speech sample belongs or makes a conclusion that the
speech sample is unknown. In this work, we deal with the closed-set speaker identification, which is a
decision making process of who of the registered speakers is most likely the author of the unknown speech
sample. Depending on the algorithm used for the identification, the task can also be divided into text-dependent
(2)
and text-independent identification. The difference is that in the first case the system knows the text spoken
by the person while in the second case the system must be able to recognize the speaker from any text. This
taxonomy is represented in Figure 1.1.
Speech Processing

Analysis/Synthesis Recognition Coding

Speech Speaker Language


Recognition Recognition Recognition

Speaker Speaker Speaker


Identification Detection Verification

Text Text Text Text


Independent Dependent Independent Dependent

Figure 1.1 Speech Processing Fundamentals.

Identification and Verification Tasks


In the identification task, or 1:M matching, an unknown speaker is compared against a database of M
known speakers, and the best matching speaker is returned as the recognition decision. The verification task,
or 1:1 matching, consists of making a decision whether a given voice sample is produced by a claimed
speaker. An identity claim (e.g., an ID) is given to the system, and the unknown speaker’s voice sample is
compared against the claimed speaker’s voice template. If the similarity degree between the voice sample and
the template exceeds a predefined decision threshold, the speaker is accepted, and otherwise rejected.
Among the identification and verification tasks, identification task is generally considered more difficult.
This is intuitive when the number of registered speakers increases, the probability of an incorrect decision
increases [2,7,8]. The performance of the verification task is not, at least in theory, affected by the population
size since only two speakers are compared.

(3)
2. Speech Feautures

There are different ways of extracting speaker discriminative characteristics from speech signal.

4.3.1 Introduction

The acoustic speech signal contains different kind of information about speaker. This includes “high-

level” properties such as dialect, context, speaking style, emotional state of speaker and many others. A great

amount of work has been already done in trying to develop identification algorithms based on the methods

used by humans to identify speaker. But these efforts are mostly impractical because of their complexity and

difficulty in measuring the speaker discriminative properties used by humans. More useful approach is based

on the “low-level” properties of the speech signal such as pitch (fundamental frequency of the vocal cord
vibrations), intensity, formant frequencies and their bandwidths, spectral correlations, short-time spectrum

and others.

From the automatic speaker recognition task point of view, it is useful to think about speech signal as a

sequence of features that characterize both the speaker as well as the speech. It is an important step in

recognition process to extract sufficient information for good discrimination in a form and size which is amenable

for effective modeling. The amount of data, generated during the speech production, is quite large while the

essential characteristics of the speech process change relatively slowly and therefore, they require less data.

According to these matters feature extraction is a process of reducing data while retaining speaker discriminative

information.

The speech wave is usually analyzed based on spectral features. There are two reasons for it. First is

that the speech wave is reproducible by summing the sinusoidal waves with slowly changing amplitudes and

phases. Second is that the critical features for perceiving speech by humans ear are mainly included in the

magnitude information and the phase information is not usually playing a key role.

(4)
Mel Frequency cepstral coefficients

In Feature Extraction, a set of acoustic vectors are extracted from a digital speech signal. Out of much

possibility for parametrically representing the speech signal for speaker recognition task we choose Mel frequency

cepstral coefficients (MFCC). Mel-scale, which translates regular frequencies to a scale that is more appropriate

for speech, since human perception of the frequency contents of sounds does not follow a linear scale. The

relationship between the Mel scale spectrum and the frequency spectrum is approximately given by,

Mel(f) = 2595 log(1 +f /700)

Where f is frequency in Hz obtained after FFT. Figure 4.10 shows the calculation of MFCC.

Figure 2.1 Calculation of MFCC.

(5)
However, mel-warped (or any other) filterbank is applied in the frequency domain before the logarithm

and inverse DFT. The purpose of the mel-bank is to simulate the critical band filters of the hearing mechanism.

The filters are evenly spaced on the mel-scale, and usually they are triangular shaped. The triangular filter

outputs Y (i) = 1, . . . ,M are compressed using logarithm, and discrete cosine transform (DCT) is applied.
M
 n 1 
c[n]   log Y (i ) cos (i  ) 
i 1 M 2 

Notice that c[0] presents the log magnitude, and therefore it depends on the intensity. Typically c[0] is excluded

for this reason.


In Pattern Matching the similarity between unknown feature vectors and reference templates is measured.

The Mel Coefficient and Delta Mel Coefficient Vectors are given as input to obtain acceptance Matrices for

Speakers. Pattern matching algorithm compares the incoming speech signal to the reference model and scores

their difference, called a distance.

(6)
3. Facial Action Coding System (FACS)

Ekman and Friesen (1976, 1978) were pioneers in the development of measurement systems for facial
expression. Their system, known as the Facial Action Coding System or FACS, was developed based on a
discrete emotions theoretical perspective and is designed to measure specific facial muscle movements. A
second system, EMFACS, is an abbreviated version of FACS that assesses only those muscle movements
believed to be associated with emotional expressions. In developing these systems, Ekman importantly distin-
guishes between two different types of judgments: those made about behavior (measuring sign vehicles) and
those that make inferences about behavior (message judgments). Ekman has argued that measuring specific
facial muscle movements (referred to as action units in FACS) is a descriptive analysis of behavior, whereas
measuring facial expressions such as anger or happiness is an inferential process whereby assumptions about
underlying psychological states are made. It is important to point out, as Ekman does, that any observational
system requires inferences about that which is being measured. Other available systems have been designed to
measure either specific aspects of facial behavior or more generally defined facial expressions[9].
The Facial Expression Coding System (FACES) was developed as a less time consuming alternative
to measuring facial expression that is aligned with dimensional models of emotion. The system provides infor-
mation about the frequency, intensity, valence, and duration of facial expressions. The selection of the variables
included in the system was based on theory and previous empirical studies. Adopting the descriptive style of
Ekman and similar to the work of Notarious and Levenson (1979), an expression is defined as any change in
the face from a neutral display (i.e., no expression) to a non-neutral display and back to a neutral display.
When this activity occurs, a frequency count of expressions is initiated. Next, coders rate the valence (positive
or negative) and the intensity of each expression detected. Notice that this is quite different from assigning an
emotion term to each expression. While FACES requires coders to decide whether an expression is positive
or negative, it does not require the application of specific labels. There is support in the literature for this
approach, often referred to as the cultural informants approach. That is, judgments about emotion, in this case
whether an expression is positive or negative, are made by persons who are considered to be familiar with
emotion in a particular culture. In addition to valence and intensity, coders also record the duration of the
expression. Finally, a global expressiveness rating for each segment is made, and judgements about specific
emotions expressed throughout the segment can also be obtained[10].

(7)
How to use Faces
FACES was initially developed to measure facial expressions in response to five minute film clips. The
system can be adapted to other applications, however, and attempts to represent the broad applicability of the
system are made throughout the manual. Generally speaking, the system allows for the examination of a
participant's entire record of expressive behavior. When we videotaped participants viewing emotional films,
the soundtrack from the movie was not included on their videotapes. Thus, coders only viewed participants
facial reactions to the films. We have typically had two raters coding each participant. As will be discussed
below, reliability for FACES has routinely been very high[11].
Detecting an Expression
While viewing a participant's record, an expression is detected if the face changes from a neutral
display to a non-neutral display and then back to a neutral display. It is important to note, however, that a facial
display may not always return to the original neutral display but may instead return to a display that, although
neutral, does not exactly resemble the prior neutral expression. Additionally, if after a participant displays a shift
from a neutral to non- neutral display and, instead of returning to a neutral display, shows a clear change in
affective expression, this change is counted as an additional expression. For example, if while smiling, a partici-
pant then raises his or her eyebrows and stops smiling, indicating more of a surprised look, two expressions
will be coded.
Duration
Once an expression has been detected, the duration is measured. For convenience, a time-mark in
seconds should be included on participants' videotape. The duration measurement should start as soon as the
participant changes from a neutral to non-neutral display. This time should be recorded on a coding form. The
duration measurement should stop as soon as the participant changes back from a non-neutral to neutral
display, and the time should be recorded on the coding form. The duration in seconds can then be calculated by
subtracting the beginning time from the end time and then recorded on the coding form. Mean duration can be
calculated by dividing the total duration by the frequency of expressions. Typically this is done separately for
positive and negative expressions.
Valence
Next, the coder must decide whether the expression was positive or negative and make the appropri-
ate notation on the coding form. If there is doubt as to whether the expression is positive or negative, a
comprehensive list of affect descriptors is presented. Extensive research has established these descriptors as
either positive or negative. They are provided simply as a guide for coders in determining the valence of an
expression. Coders are not asked to supply a descriptor for each expression detected.

(8)
Intensity
Intensity ratings for an individual expression range from one to four (1=low, 2=medium, 3=high, 4=very
high). The low rating is given for those expressions that are mild, such as a smile where a participant slightly
raises the corners of his/her mouth but does not show the teeth, and very little movement around the eyes
occurs. The medium rating is given for those expressions where a participant's expression is more moderate
than mild in intensity, such as a smile bordering on a laugh, with the eyebrows slightly raised and the lips apart,
exposing the teeth. The high rating is given for an expression that involves most, if not all, of the face, such as
laughing with an open mouth and raising the eyebrows and cheeks. The very high rating is reserved for those
expressions that are very intense. An example of such an expression is one where a participant is undeniably
laughing, with the mouth completely open with the eyebrows and cheeks substantially raised.

(9)
4. Classification of Speakers based on their Facial Expression and Speech
Features

Figure 4.1 shows the experimental setup. Our aim is to classify the speaker based on their facial expression.

AU 1
AU 2
Gabor
Wa velet Feature Data .
Decmoposition Extraction Driven .
Classifier .
AU 46
ANN
Classifier
1
Spech
2
Feature MFCC .
Extra ction .
.
16

Fig. 4.1 Experimental Setup

The system automatically detects frontal faces in the video stream and codes each frame with respect
to 20 Action units.We present preliminary results on a task of facial action detection in spontaneous expres-
sions during discourse. At the same time we record the speech and extract 16 coefficients of the MFCC
feature of the person. We focus on the classification of the speaker with their facial features and compare them
in order to get good database for this comparison. The system will operate at 24 frames/second on a 3GHz
Pentium IV for 320x240 images.

(10)
5. Conclusion

As of today there is a need in India to work in the field of Speech and Speaker recognition. India is a
country where there is variation in language and intonation every 50 Kms. It is very difficult to detect the
emotion of a person based on his facial expression. If we compare the speech feature with the facial expression
we can classify the speakers based on their facial expression.

(13)
References :

[1] Stephen C. Cook, “Speech Recognition HOW TO”,2000.


[2] Prabhakar, S., Pankanti, S., and Jain, A. Biometric recognition: security and privacy concerns. IEEE
Security & Privacy Magazine 1 (2003), 33–42.
[3] Kittler, J., and Nixon, M., Eds. 4th Internation Conference on Audio- and Video-Based Biometric
Person Authentication (AVBPA 2003). Lecture Notes in Computer Science. Springer-Verlag, Berlin, 2003.
[4] D. A. Reynolds, “An Overview of Automatic Speaker Recognition Technology”, ICASSP 2002, pp 4072-
4075.
[5] J.P. Campbell, “Speaker Recognition: A Tutorial”, Proc. of the IEEE, vol. 85, no. 9, Sept 1997, pp.
1437-1462.
[6] J. M.Naik,“Speaker Verification: A Tutorial”,IEEE Communications Magazine, January1990,pp.42-
48.
[7] Doddington, G. Speaker recognition - identifying people by their voices. Proceedings of the IEEE 73, 11
(1985), 1651–1164.
[8] Furui, S. Digital Speech Processing, Synthesis, and Recognition, second ed. Marcel Dekker, Inc., New
York, 2001.
[9] Duchenne de Boulogne GB. The Mechanism of Human Facial Expression. New York, NY: Cambridge
University Press; 1990.
[10]Schilbach L, Eickhoff SB, Mojzisch A, Vogeley K. What’s in a smile? Neural correlates of facial embodiment
during social interaction. Soc Neurosci. 2008;3:37–50.
[11] Gosselin P, Kirouac G. Decoding facial emotional prototypes. Can J Exp Psychol. 1995;49:313–329.

You might also like