You are on page 1of 4

Emotion recognition and estimation from tracked lip features

Ulrik Söderström and Haibo Li


Digital Media Lab, Umeå, Sweden
ulrik.soderstrom@tfe.umu.se, haibo.li@tfe.umu.se

Abstract only using the acoustic input of speech. One of the first
who built such a system was Petajan et al. [3], where
We transmit and display information through the they used oral cavity features for their visual speech
visual image of our lips; information about our speech recognition. Mase and Pentland [4] used analysis of
and our emotions and expressions. By tracking the optic flow to examine the motion of the mo uth, while
movement of the lips, computers can make use of the Chiou and Hwang [5] and Storks et al. [6] used neural
visual part of the information. We focus on how to networks for image sequence recognition in their
make use of tracked lip features for emotions. We have lipreading systems.
found that people are better at interpreting basic A modest amount of existing research focus on
emotions displayed through an animation of lips than multimodal communication and combines different
interpreting the same emotions displayed through a modalities into a single system for human
real video sequence that shows the lower part of the communicative reaction analysis. Existing works are
face. We have successfully transferred three basic Chen et al. [7] and De Silva et al. [8], who studied the
emotions from visual information into information effect of combined detection of facial and vocal
from another modality; touch, via vibrations. expressions and emotions. Pantic et al. [9] provided a
good survey about facial expression analysis done by
computers.
1 Introduction Fasel and Luettin [10] provided a survey of research
spent on automatic facial expression analysis and the
Every human being with vision uses lip reading to techniques that can be used for this. Müller et al. [11]
improve the perception of speech. Speech is said to be used hidden Markov models for recognition of facial
multi-modal; depending on both auditory and visual expressions, while Sebe et al. [12] used Bayes
cues, as McGurk and McDonald [1] showed. Emotions classifying to recognize emotions.
and expressions are reflected on our lips and this We have developed an automatic, real-time lip
implies that we can make use of human lip movements tracker. The lip tracker is able to track the four lip
to improve aspects that are associated with speech and contours in up to 25 frames per second. This lip tracker
emotions. These aspects are verbal communication was developed to verify and test the different areas, for
with other humans or machines, i.e., speech driven which tracked lip information can be used. We have
devices. We also use our lips to express emotions and focused on using the tracked lips for emotion
show expressions. By knowing the position of the lip estimation and recognition and will only briefly
contours you have access to the visual part of the describe our lip tracking system here.
spoken language and emotions. A popular area of By making use of the tracked lip contours, we can
usage for visual lip information is speechreading from use it for multimodal integration and to create an
visual lip information. Tracked lip features can also be animation of the lips. We can hereby create
used to improve speech perception [2], not only for improvements in visual communication and
humans, but also for machines. Speech controlled perception, such as emotions that is expressed through
devices, such as mobile phones, can be made more the face.
accurate with the addition of visual input to the This article is organized as follows. Section 2
existing audio input. A computer can perform emotion describes our lip tracking system. Section 3 describes
estimation, to enable visually impaired people to how we have used the tracked lip features for emotion
understand emotions through a different modality then recognition and estimation. Section 4 contains the test
vision or hearing, e.g., touch. The emotion estimation results and section 5 concludes the paper.
can also be used for people with vision, to increase
their perception of emotions.
Many researchers have proposed the use of lip 2 The lip tracking system
tracking, or feature extraction of the lips, to improve
In our lip tracking system we have used a
the performance of speech recognition, compared to
webcamera attached to a light metal arm, which in turn
is mounted to a helmet. The camera was directed so to interpret emotions. What happens if somebody is
that it captures the face of the person who wore the quiet and you still have the need to correctly interpret
helmet. The helmet simplified the calculations by his/hers emotions? A computer can interpret the
removing the calculations of rotations and occlusions emotions for you and then present its interpretation to
from the algorithm. To track the lips with greater you, preferably in a different form than visual
certainty, the nostrils were tracked first, since they are information. Human-computer interaction is another
fairly easy to track. When the nostrils were found, the area where emotion recognition can create
search area for the lips was significantly smaller. By improvement. If the computer can recognize and
classifying the pixels according to their color interpret your emotions correctly, we are moving
information, we got a number of candidates for being a closer to a more human-to-human-like interaction
part of a lip contour. Dynamic programming, i.e. the between computers and humans.
Viterbi algorithm [13], was used to extract the lip
contours from all the candidates. The entire lip A. Emotion estimation
contours can efficiently be represented by a number of Ekman and Friesen [16] proposed that there are six
feature points. We chose to use 16 points, which are basic emotions; happiness, surprise, sadness, anger,
MPEG-4 compliant, meaning that they can be used fear and disgust. We chose to limit our first
with the MPEG-4 facial animation engine [14]. Results implementation to recognizing only three of these basic
from extraction of feature points are shown in Fig. 1. emotions, i.e., happiness, sadness and surprise. These
three were chosen since they have quite different
mouth shapes compared to each other. To discriminate
between all of the basic emotions, you will need, at
least, information about the eyes and the eyebrows, as
well as information from the mouth. The emotion
recognition was performed based on the static shape of
the mouth. The computer used a number of criteria’s to
characterize and recognize the different emotions.
They were opened or closed mouth, the width and
height of the mouth and the relative position between
certain feature points. We also made some assumptions
about the emotions; sadness can only occur when the
Figure 1. Extracted feature points for different mouth is closed and surprise can only occur when the
mouth shapes. mouth is opened. If the mouth was closed, our system
could recognize happiness and surprise. If the mouth
We constructed our own lip animation, based on was opened, the system could recognize happiness and
runout, cubic spline interpolation [15]. The animation sadness. If all criteria’s for an emotion were fulfilled,
consisted of cubic splines for each of the lip contours the emotion was recognized by the system. Examples
and the areas between them were painted with lip of detected emotions are shown in Fig. 3.
color. The animation can be seen in Fig. 2.

Figure 2. Lip animation drawn on top of the facial Figure 3. Emotions detected by the lip tracking
image. The animation is usually drawn against a system, shown both for real video and lip animation.
white background, but is drawn on top the facial
image here to visualize the result better. The recognized emotions can be used to transmit
information about emotions to people with impaired
3 Emotion estimation and recognition vision, people who cannot see visual emotions. This
can be solved by using other modalities than vision,
The visual information from our lips can be used for with touch being the best choice. In our system we
multiple purposes. The information can be used to have solved this with a mouse that can vibrate. The
create more information, or a different kind of different basic emotions were given different vibration
information, enabling information for other modalities patterns. The patterns were chosen to reflect the
than audio. One example can be emotion recognition emotion. Happiness had a high frequency with many
based on visual information, for people with impaired short buzzes, while sadness had a low frequency with a
vision. If you cannot see the visual part of emotions, single, long buzz. Surprise was something in between
you have no choice but to solely rely on your hearing the other two emotions. In this way, the user can sense
emotions without using vision or hearing. This system groups, beginners and experienced, depending on if it
has been demonstrated at the Scandinavian Technical was their first time using the system, of if they had
Fair 2002. Although no result of the demonstration has used the system before. The results from this test are
been recorded, it has clearly been shown that people presented in section IV.
can recognize and distinguish between emotions Our lip tracking system tracked the lips with 25
through the different tactile vibration patterns. In this frames per second, meaning that an animation of 25
way we have transferred the visual emotion recognition frames per second could be achieved. Our 16
into the touch modality. Instead of letting the user parameters could be transmitted with 8 bits/parameter.
recognize emotions through vision, we could abstract This means that the animation parameters could be
information with lip tracking, make an interpretation transferred with a bitrate of 2.5 kbps, without
with a computer and present the emotion to the user considering any redundancy calculation or
through touch. At the same time, the computer is made compression. With use of MPEG-4 facial animation,
aware of the emotion, since it is interpreting it, making the bitrate needed for an animation of the entire face,
this a very interesting application. A computer can and not just the lips, is as low as 5 kbps [17].
automatically detect our emotions and may be able to Consequently, a high quality animation could replace a
take actions according to our emotions. In the low quality video sequence at the same, or lower,
subsequent step the detected emotion can be presented bitrate.
to users in different ways.

B. Emotion recognition for humans


4 Results
We are not only interested in knowing how a The results from the emotion recognition test were
computer can recognize emotions from tracked lip very interesting but the conclusions could only be
features. To see how humans perform this task is also preliminary, due to the simple nature of our test. The
very interesting. To understand the emotions of the results are shown is Fig. 5.
person, to whom we are speaking, is an important
factor in good communication.
In the previous section we described how people
could recognize computer-detected emotions through
vibrations. But how good are people at recognizing
visual emotions? Obviously we can understand
emotions that are expressed in other people’s faces, but
can we understand emotions that are presented as an
animation? The answer is of course yes; just think
about cartoons. Cartoons are usually not very similar to
humans, but we can still understand their emotions.
We created a test to verify that humans can
understand and interpret basic emotions that are
presented by a simple lip animation. Our test consisted
of 3 different types of video sequences, one showing
the entire face of a person, one showing the lower part Figure 5. (a) Recognition results for experienced
of a persons face and one showing an animation of the users (b) Recognition results for beginners.
lips. Each video sequence contained 8-10 emotions
expressed by the person in the video. Examples of the Both groups had a high recognition rate for the video
video sequences are shown in Fig. 4. showing the entire face. Here, there is almost a 100%
recognition rate for both groups, and this can be seen
as a proof that the basic emotions are easy to recognize
for everyone. This is not surprising, since this is
something that humans are trained to recognize in our
daily life. The interesting results began to show for the
video of the lower face. Here both groups showed a
reduced recognition rate compared to the video
Figure 4. Example frames from the three video sequence of the entire face, but there was a difference
sequences used in the emotion recognition test. between the two groups. The experienced users had
about 12.5% better rating than the beginners. This
The subjects where asked, while watching the three could be due to the fact that the experienced users were
video sequences, to fill in the emotions that they could more used to observing emotions in only the lower part
recognize in a chart. The emotions that they were of the face. When recognizing the emotions expressed
asked to recognize where again the three emotions that through the simple animation, the groups differed a lot.
the system is able to recognize, i.e., happiness, sadness The experienced users scored even better for the
and surprise. The subjects were divided into two animation than for the video of the lower part of the
[4]
face, while, as expected, the beginners had a low K. Mase and A. Pentland, “Automatic lipreading
recognition rate. This was a very encouraging result, by optical- flow analysis,” Syst. Comput. Jpn.,
since it implied that if you are used to viewing the vol. 22, pp. 67–76, 1991.
[5]
animation you can correctly recognize the emotion. An G. I. Chiou and J. N. Hwang, “Image sequence
animation of an emotion was somewhat different from classification using a neural network based
how the real emotion looked in a video. The users who active contour model and a hidden Markov
had seen this before could easily recognize the model,” in Proc. IEEE Int. Conf. Image
emotion. This might mean that after a small training Processing, Austin, TX, Nov. 1994, pp. II-926–
session, anyone could understand emotions through II-930.
[6]
this simple animation. It could also mean that for the D. G. Stork, G. Wolff, and E. Levine, “Neural
video of the lower face, there was too much distracting, network lipreading system for improved speech
unimportant information. recognition,” Int. Joint Conf. Neural Networks,
pp. 285–295, 1992.
[7]
5 Conclusions L. S. Chen, T. S. Huang, T. Miyasato and R.
Nakatsu, “Multimodal human
The impact of visual information, in form of tracked emotion/expression recognition,” Proc. Int’l
lip contours, can be expected to be greater in the near Conf. Automatic Face and Gesture Recognition,
future. With the introduction of video cameras in pp 366-371, 1998
[8]
mobile phones, the number of possible applications L. C. De Silva, T. Miyasato and R. Nakatsu,
that make use of the information from tracked facial “Facial emotion recognition using multimodal
features will increase drastically. To make a computer information,” Proc. Information, Comm., and
understand human emotions is an important step Signal Processing Conf., pp. 397-401, 1997
[9]
towards artificial intelligence and a more intelligent M. Pantic and L.J.M. Rothkrantz. “Automatic
and extended human computer interaction. Analysis of Facial Expressions: The State of the
Our test for human emotion recognition showed that Art,” in. IEEE Transactions on Pattern Analysis
after getting used to observing the animations, users and Machine Intelligence, 22(12), pp. 1424--45,
were almost equally good at recognizing basic 2000
[1 ]
emotions through an animation as for a video showing 0 B. Fasel and J Luettin, Automatic facial
the entire face. So, by using a simple animation, expression analysis: a survey, Pattern
emotion recognition and understanding could be vastly recognition 36(1) pp. 259-275, 2003
[1 ]
improved. 1 S. Müller, F. Wallhoff, F. Hülsken and G.
Our system for computer-based emotion recognition Rigoll, “Facial Expression Recognition Using
and tactile display has showed that it is possible to Pseudo 3-D Hidden Markov Models,” presented
transfer visual emotion information into touch at 16th Int. Conference on Pattern Recognition
information about emotions by very simple means. (ICPR), Quebec, Canada, August 2002
[1 ]
With better tactile displays and more tracked facial 2 N. Sebe, M. Lew, I. Cohen, A. Garg, and T.
features, this system can enable both emotion-aware Huang, “Emotion Recognition Using a Cauchy
computers and emotion sensation through other Naive Bayes Classifier,” presented at 16th Int.
modalities than hearing for visually impaired people. Conference on Pattern Recognition (ICPR),
These two aspects are very interesting future research Quebec, Canada, August 2002
[1 ]
areas. 3 G. D. Forney, “TheViterbi algorithm,” Proc.
IEEE, vol. 61, pp. 268– 278, Mar. 1973.
[1 ]
4 F. Pereira and G. Abrantes, “MPEG-4 Facial
6 Acknowledgement Animation Technology: Survey, Implementation
and Results,” IEEE transactions on circuits and
This work was partly financed through the VITAL
systems for video technolgy, vol. 9, no.2, pp.
project, a part of the European Union Structural Funds.
290-305, 1999
[1 ]
5 H. Anton and C. Rorres, Elementary Linear
References Algebra Applications Version, 8th edition, John
[1] Wiley & Sons, Inc, 2000, pp. 565-573
H. McGurk and J. McDonald, “Hearing and [1 ]
6 P. Ekman and W.V. Friesen, Constants across
seeing voices,” Nature, 264:746-748, December cultures in the face and emotion, J. Personality
1976 Social Psychology. 17(2) pp. 124-129, 1971
[ ]
2 T. Chen, and R.R. Rao, "Audio-Visual
Integration in Multimodal Communications,"
Proceedings of the IEEE, vol. 86, no. 5, pp. 837-
-852, May 1998
[3]
E. Petajan, B. Bischoff, D. Bodoff, and N.
Brooke, “An improved automatic lipreading
system to enhance speech recognition,” ACM
SIGCHI, pp. 19– 25, 1988.

You might also like