You are on page 1of 4

Emotional Expressions in Audiovisual Human Computer Interaction

Lawrence S. Chen*& Thomas S. Huang


Beckman Institute & CSL
Univ. of Illinois at Urbana-Champaign,
Urbana, E,61801, U.S.A.
{ 1chen ,huang}@i fp .uiuc . edu

Abstract “You look happy.” or “He sounded angry.” It is in this light


that we propose to make computers more aware of such ex-
Visual and auditory modalities are two of the most com- pressions. Depending on the context, the computer may fol-
monly used media in interactions between humans. In the low up with verification questions or offer help to the user.
present papec we describe a system to continuously moni- Thus by emotional expressions, we mean the externally per-
tor the user’s voice andfacial motionsfor recognizing emo- ceivable expressions that may arise from an emotion. What
tional expressions. Such an ability is crucial for intelligent may be recognized are the apparent emotions.
computers that take on a social role such as a tutor or a In our previous work [2], we found evidence of comple-
companion. We outline methods to extract audio and visual mentarity between the audio and visual modalities. In this
features useful for classifying emotions. Audio and visual paper, we describe new methods to extract features from fa-
information must be handled appropriately in single-modal cial motion tracking and audio processing on a new set of
and bimodal situations. We report audio-only and video- data involving native speakers of American English. When
only emotion recognition on the same subjects, in person- only one modality is available, single-modal methods are
dependent and person-independent fashions, and outline used. When both auditory and visual information are avail-
methods to handle bimodal recognition. able, they must be handled and combined appropriately. We
propose a system to continuously monitor the user’s face
and voice, which requires proper handling and switching
1 Introduction between single-modal and bimodal (auditory and visual)
processing. We compare the recognition of emotional ex-
In human-to-human communications, people often infer pressions from the voice and from the facial motions, and
emotions from perceived facial expressions and voice tones. outline how they should be handled when both modalities
This is an important addition to the linguistic content of the are available.
exchange. As more and more computers are equipped with In the following sections, we describe the data used and
auditory and visual input devices, it becomes conceivable the proposed system to recognize emotional expressions
that computers may be trained to perform similar inference. from video, audio, and how to handle both. Then we present
In some applications where the computer takes on a social the results of some experiments.
role such as a tutor or a companion, the interaction becomes
more natural if the computer were more emotionally intel- 2 Data of Emotional Expressions
ligent.
Work in psychology has shown that there are distinct ex- Following Ekman’s conclusion that emotional expres-
pressions associated with different emotions on the face [3] sions can be convincingly portrayed [3], we give human
and in the voice [8]. Recent works on quantifying facial subjects a list of emotion categories from which they are
expressions also had much success [ 1,4, 61. However, this asked to pose facial expressions and speak sentences with
does not mean people always express their true emotions associated emotions. Three kinds of data are collected: (1)
externally. Often the true emotion can be masked or even audio-only where only the emotional audio data is recorded,
hidden, as in the case of the “poker face.” When people per- (2) video-only where the subjects only pose facial expres-
ceive displayed expressions, they often make comments like sions without speaking, and (3) bimodal where subjects
*LawrenceChen is now with Eastman Kodak Company, Rochester, NY, pose facial expressions and speak with the associated vo-
Emai1:lawrence. chenekodak. com. cal emotional expressions for each of the categories.

0-7803-6536-4/00/$10.00(c) 2000 IEEE 423


3 Proposed System

The scenario is a human user interacting with a computer


equipped with a video camera “looking at” the user, and a
microphone “listening to” the user. Figure 1 shows the pro-
posed system that performs emotion recognition from audio
and visual modalities. There are three modes of operations:
(1) audio-only, (2) video-only, and (3) bimodal. The au-
dio detector and video detector both act as switches (S1 and
S2) to inform the system to switch to the appropriate modes.
The audio detector detects whether the user is speaking, and
the video detector informs the recognition module whether
the video tracking is reliable.

EYOllOW
REC-UER

AUUO DETECTOR

WDEODETECTOR

Figure 1. Bimodal emotion recognizer.

(4 (b)
4 Video-only Mode Figure 2. Sample frames in the database: (a)
disgust: and (b) happiness.
A facial expression can be described with a simple
model: neutral-expression-neutral.The transition is usually
short compared to the duration of the expressions, therefore
With the features measured, the Sparse Network of Win-
each video frame can be labeled as belonging to one of the
nows (SNoW) classifier is used to classify emotions. The
emotions. Here Neutrality is also considered an emotion.
details of SNoW can be found in [7]. One advantage of
A novel tracking algorithm developed by Tao [9]called
SNoW classifier is it does not require a large amount of
Piecewise Bezier Volume Deformation (PBVD) tracking is
training data. Two configurations of the SNoW are used
used for the facial motion measurement. First a 3D face
here: one with winnow output nodes (SNoW), and the other
model embedded in Bezier volume is constructed by man-
with Naive Bayes output nodes (SNOW-NB).Half of the
ual selection of landmark facial features. Then for each ad-
data are used for training and the other half for testing. Re-
jacent pair of frames in the video sequence, optical flow
sults show that SNoW is a good classifier for the current ap-
is computed. To avoid error accumulation, templates from
plication. Recognition accurac-iesare included in the results
the previous frame as well as from the first frame are used.
section. The results are obtained from the face tracker in the
From the motion of many points on the face, 3D motions of
off-line mode, but a near real-time rule-based classifier has
the head and and facial deformations can be recovered us-
also been implemented with good results, demonstrating the
ing least squares. The tracker uses predefined “action units
feasibility of the tracker for real applications.
(AUs)” which describe some basic motions on the face. Fa-
cial motions can be thought of as linear combinations of
these AU’s. The final output of the tracking system is a vec- 5 Audio-only Mode
tor containing the strengths of the AUs. In this work, we use
six AUs for the mouth movements, two for eyebrow move- For the audio, prosodic features including pitch, energy
ments, two for cheek lifting, and two for eyelid motions. and rate of speech, carry information related to emotions.
This is a good framework for analyzing facial expressions. Pitch and energy are computed using the ESPS get@ com-
The tracking msumes the expression is neutral in the first mand. Then the speech rate can be found using a recur-
frame, where all AUs have the value of zero. sive convex-hull algorithm [5] which treats a large peaks

(c) 2000 lEEE


0-7803-6536-4/00/$10.00 424
in the energy envelope of voiced regions as “syllabic” or
“syllable-like’’ units. Global features such as the mean, Table 1. Overall accuracy of person-dependent
standard deviation, and derivatives of the pitch and energy video-only emotion recognition.
are computed. The syllabic rate is used as a feature for
the speech rate. Figure 3 shows the audio processing. Fig- I Subject 1 SNOW-NB I SNoW 1
ure 3(a) shows the original wave form. The computed pitch
contour is in Figure 3(b). The curve in Figure 3(c) is the
energy of the signal, with rectangular pulses indicating the
voiced regions of the speech. Finally, the overlapping rect-
angles in Figure 3(d) represent the syllabic units.

(01
0
motions around the mouth are mainly for voicing or for
1-
0 5 10 15
producing speech, and may not contribute much to facial
expressions. The brow movements provide more infonna-
tions, but sometimes they also move to signal emphasis in
01 I the speech. Therefore, we propose a new way of handling
0 50 100 150 2w 250 300 a50
1
the two modalities. When the user is speaking, use mainly
1wO (e)
the audio features to detect vocal emotions. Often a pure
facial expression is accompanied right before or after the
0
0 50 100 150 200 250 300 350 sentence, which the video-only mode can handle. Then
with these two happening sequentially in time, the infor-
(Q
mation from single-modal modes are fused to produce the
0
0 50 100 150 2W 250 3W J50
final recognition result.

Figure 3. Audio processing: (a) original wave- 7 Results


form; (b) pitch contour; ( c ) R M S energy enve-
lope (with square pulses indicating voiced re- We analyzed data from 5 different subjects, each pos-
gions);(d) syllabic units (each rectangle con-
ing several facial expressions and speaking with vocal emo-
tains one syllabic unit).
tion of six different emotions: Happiness, Sadness, Disgust,
Feal; Angel; and Surprise.
Table 1 shows the recognition results of facial expres-
Once the features are extracted, the classification is ac-
sions. Our framework for facial expression recognition is
complished by modeling each class with a Gaussian distri-
effective as the results indicate. For each subject, half of the
bution, then testing the samples using leave-one-out cross-
video data are used as training data, and the other half for
validation. The reason for using a different classifier than
testing.
the facial expression is that the audio features are more on
a global scale, usually over a sentence or at least a phrase, Table 2 shows the accuracies of audio-only emotion
recognition. They are in general lower than the video-only,
while the video features are available at video rate. The
but still much higher than chance. The results are obtained
amount of audio data is insufficient for SNoW. Sequential
from leave-one-out cross-validation.
forward feature selection is used to select the best features
out of the 18 features. Usually the best performance is In addition to person-dependent tests, we also perform
reached with less than 5 features. person-independent expression recognition for audio-only
and video-only. Data from four subjects are used as train-
ing data, and the data from the remaining subject is used
6 Bimodal Mode for testing, and repeat this process five times. Table 3
shows the person-independent results of emotion recogni-
In this section we discuss how to combine the two tion. In this table, set 1 means the data from subject 1 is
modalities. Note that both modalities are not always avail- used for testing, etc. Even though the overall accuracy are
able. When only one is available, single-modal methods low, two categories-Happiness and Surprise-still maintain
is used. Bimodal mode is when the user speaks, and the above 80% performance in video-only tests.
camera also has good view of the user’s face so the fa- A difference between recognizing facial expression and
cial motion tracking is meaningful. When this happens, vocal emotions is that the data rates are different. In our ex-

0-7803-6536-4/00/$10.00(c) 2000 IEEE 425


recognition, we think that there are more consistent clues
on the face than in the voice.

9 Acknowledgments
Subject Overall Accuracy Features
This work was support in part by National Science Foun-
1 7 1.43% 2
dation Grant CDA 96-24396, in part by the Yamaha Motor
2 61.90% 2
Corporation, and in part by a fellowship from the Eastman
3 66.70% 4
Kodak Company.
4 57.14% 3
5 76.19% 2
References
[l] M.J. Black and Y. Yacoob. Tracking and recognizing rigid
Table 3. Overall accuracy of person- and non-rigid facial motions using Iota1 parametric models of
independent emotion recognition. image motion. In Proc. Intemational Con$ Computer W o n ,
pages 374-381, Cambridge, USA, 1995.
I Set 1 Video-only I Audio-only [2] L. S . Chen, H. Tao, T. S . Huang, T. Miyasato, and R. Nakatsu.
Emotion recognition from audiovisual infomation. In Proc.
40.95% IEEE Workshopon Multimedia Signal Processing, Los Ange-
49.52% les, CA, USA, Dec. 7-9, 1998.
41.14% 55.24% [3] P. Ekman, editor. Emotion In the Human Face. Cambridge
59.05% University Press, Cambridge, 2nd edition, 1982.
58.73% 62.86% [4] I. A. Essa and A. P. Pentland. Coding, analysis, interpretation,
and recognition of facial expressions. IEEE Trans. PAMI,
1997.
[SI P. Mermelstein. Automatic segmentation of speech into syl-
periments, each frame of video is treated as a data sample, labic units. J. Acoust. Soc. Am., 58:880-883, October 1975.
but for audio, only one sample per sentence. It means the 161 T. Otsuka and I. Ohya. Recognizing multiple persons’ facial
audio features are more on a global scale, while the video expressions using hmm based on automatic extraction of sig-
nificant frames from image sequences. In Proc. Int. Con$ on
feature can be obtained at a much finer scale. Also, since the
Image Processing (ICIP-97),pages 546-549, Santa Barbara,
video data are more redundant, the system can also work at CA, USA, Oct 26-29.1997.
reduced frame rates, as seen in our near real-time imple- [71 D. Roth, M.-H. Yang, and N. Ahuja. A SNOW-based face
mentation which operate at about 6 frames per second. detector. In Neural Information Processing Systems-12,1999.
[8] K.R. Scherer. Adding the affective dimension: A new look in
speech analysis and synthesis. In Proc. Intemational Con$ on
8 Conclusions Spoken Language Processing 1996, Philadelphia, PA, USA,
October 3-6, 1996.
In this paper, we discussed recognition of emotional [9] H.Tao. Nonrigid Motion Modeling and Analysis in video
expressions on the face and in the voice. We showed Sequences for Realistic Facial Animation. PhD thesis, Uni-
that single-modality methods are important when only one versity of Illinois at Urbana-Champaign, 1998.
modality is available. Prosodic features contain informa-
tion related to vocal emotions, and facial movementsin
terms of Action Units can provide information for facial
expressions. Then we outlined how to handle both modali-
ties when they are both present. Information from the two
modalities may not be available for emotion recognition
at all times. Particularly, in the case where the subject is
speaking, the facial motions around the mouth are mostly
used to produce speech and not directly related to emotion.
Thus even though we can track the motions well, it is diffi-
cult to use this information to infer emotion. We proposed
a new method to integrate both modalities.
The system performs better when it is trained specifically
for each person. Comparing video recognition and audio

0-7803-6536-4/00/$10.00(c) 2000 IEEE 426

You might also like