Professional Documents
Culture Documents
EYOllOW
REC-UER
AUUO DETECTOR
WDEODETECTOR
(4 (b)
4 Video-only Mode Figure 2. Sample frames in the database: (a)
disgust: and (b) happiness.
A facial expression can be described with a simple
model: neutral-expression-neutral.The transition is usually
short compared to the duration of the expressions, therefore
With the features measured, the Sparse Network of Win-
each video frame can be labeled as belonging to one of the
nows (SNoW) classifier is used to classify emotions. The
emotions. Here Neutrality is also considered an emotion.
details of SNoW can be found in [7]. One advantage of
A novel tracking algorithm developed by Tao [9]called
SNoW classifier is it does not require a large amount of
Piecewise Bezier Volume Deformation (PBVD) tracking is
training data. Two configurations of the SNoW are used
used for the facial motion measurement. First a 3D face
here: one with winnow output nodes (SNoW), and the other
model embedded in Bezier volume is constructed by man-
with Naive Bayes output nodes (SNOW-NB).Half of the
ual selection of landmark facial features. Then for each ad-
data are used for training and the other half for testing. Re-
jacent pair of frames in the video sequence, optical flow
sults show that SNoW is a good classifier for the current ap-
is computed. To avoid error accumulation, templates from
plication. Recognition accurac-iesare included in the results
the previous frame as well as from the first frame are used.
section. The results are obtained from the face tracker in the
From the motion of many points on the face, 3D motions of
off-line mode, but a near real-time rule-based classifier has
the head and and facial deformations can be recovered us-
also been implemented with good results, demonstrating the
ing least squares. The tracker uses predefined “action units
feasibility of the tracker for real applications.
(AUs)” which describe some basic motions on the face. Fa-
cial motions can be thought of as linear combinations of
these AU’s. The final output of the tracking system is a vec- 5 Audio-only Mode
tor containing the strengths of the AUs. In this work, we use
six AUs for the mouth movements, two for eyebrow move- For the audio, prosodic features including pitch, energy
ments, two for cheek lifting, and two for eyelid motions. and rate of speech, carry information related to emotions.
This is a good framework for analyzing facial expressions. Pitch and energy are computed using the ESPS get@ com-
The tracking msumes the expression is neutral in the first mand. Then the speech rate can be found using a recur-
frame, where all AUs have the value of zero. sive convex-hull algorithm [5] which treats a large peaks
(01
0
motions around the mouth are mainly for voicing or for
1-
0 5 10 15
producing speech, and may not contribute much to facial
expressions. The brow movements provide more infonna-
tions, but sometimes they also move to signal emphasis in
01 I the speech. Therefore, we propose a new way of handling
0 50 100 150 2w 250 300 a50
1
the two modalities. When the user is speaking, use mainly
1wO (e)
the audio features to detect vocal emotions. Often a pure
facial expression is accompanied right before or after the
0
0 50 100 150 200 250 300 350 sentence, which the video-only mode can handle. Then
with these two happening sequentially in time, the infor-
(Q
mation from single-modal modes are fused to produce the
0
0 50 100 150 2W 250 3W J50
final recognition result.
9 Acknowledgments
Subject Overall Accuracy Features
This work was support in part by National Science Foun-
1 7 1.43% 2
dation Grant CDA 96-24396, in part by the Yamaha Motor
2 61.90% 2
Corporation, and in part by a fellowship from the Eastman
3 66.70% 4
Kodak Company.
4 57.14% 3
5 76.19% 2
References
[l] M.J. Black and Y. Yacoob. Tracking and recognizing rigid
Table 3. Overall accuracy of person- and non-rigid facial motions using Iota1 parametric models of
independent emotion recognition. image motion. In Proc. Intemational Con$ Computer W o n ,
pages 374-381, Cambridge, USA, 1995.
I Set 1 Video-only I Audio-only [2] L. S . Chen, H. Tao, T. S . Huang, T. Miyasato, and R. Nakatsu.
Emotion recognition from audiovisual infomation. In Proc.
40.95% IEEE Workshopon Multimedia Signal Processing, Los Ange-
49.52% les, CA, USA, Dec. 7-9, 1998.
41.14% 55.24% [3] P. Ekman, editor. Emotion In the Human Face. Cambridge
59.05% University Press, Cambridge, 2nd edition, 1982.
58.73% 62.86% [4] I. A. Essa and A. P. Pentland. Coding, analysis, interpretation,
and recognition of facial expressions. IEEE Trans. PAMI,
1997.
[SI P. Mermelstein. Automatic segmentation of speech into syl-
periments, each frame of video is treated as a data sample, labic units. J. Acoust. Soc. Am., 58:880-883, October 1975.
but for audio, only one sample per sentence. It means the 161 T. Otsuka and I. Ohya. Recognizing multiple persons’ facial
audio features are more on a global scale, while the video expressions using hmm based on automatic extraction of sig-
nificant frames from image sequences. In Proc. Int. Con$ on
feature can be obtained at a much finer scale. Also, since the
Image Processing (ICIP-97),pages 546-549, Santa Barbara,
video data are more redundant, the system can also work at CA, USA, Oct 26-29.1997.
reduced frame rates, as seen in our near real-time imple- [71 D. Roth, M.-H. Yang, and N. Ahuja. A SNOW-based face
mentation which operate at about 6 frames per second. detector. In Neural Information Processing Systems-12,1999.
[8] K.R. Scherer. Adding the affective dimension: A new look in
speech analysis and synthesis. In Proc. Intemational Con$ on
8 Conclusions Spoken Language Processing 1996, Philadelphia, PA, USA,
October 3-6, 1996.
In this paper, we discussed recognition of emotional [9] H.Tao. Nonrigid Motion Modeling and Analysis in video
expressions on the face and in the voice. We showed Sequences for Realistic Facial Animation. PhD thesis, Uni-
that single-modality methods are important when only one versity of Illinois at Urbana-Champaign, 1998.
modality is available. Prosodic features contain informa-
tion related to vocal emotions, and facial movementsin
terms of Action Units can provide information for facial
expressions. Then we outlined how to handle both modali-
ties when they are both present. Information from the two
modalities may not be available for emotion recognition
at all times. Particularly, in the case where the subject is
speaking, the facial motions around the mouth are mostly
used to produce speech and not directly related to emotion.
Thus even though we can track the motions well, it is diffi-
cult to use this information to infer emotion. We proposed
a new method to integrate both modalities.
The system performs better when it is trained specifically
for each person. Comparing video recognition and audio