Professional Documents
Culture Documents
Abstract only using the acoustic input of speech. One of the first
who built such a system was Petajan et al. [3], where
We transmit and display information through the they used oral cavity features for their visual speech
visual image of our lips; information about our speech recognition. Mase and Pentland [4] used analysis of
and our emotions and expressions. By tracking the optic flow to examine the motion of the mo uth, while
movement of the lips, computers can make use of the Chiou and Hwang [5] and Storks et al. [6] used neural
visual part of the information. We focus on how to networks for image sequence recognition in their
make use of tracked lip features for emotions. We have lipreading systems.
found that people are better at interpreting basic A modest amount of existing research focus on
emotions displayed through an animation of lips than multimodal communication and combines different
interpreting the same emotions displayed through a modalities into a single system for human
real video sequence that shows the lower part of the communicative reaction analysis. Existing works are
face. We have successfully transferred three basic Chen et al. [7] and De Silva et al. [8], who studied the
emotions from visual information into information effect of combined detection of facial and vocal
from another modality; touch, via vibrations. expressions and emotions. Pantic et al. [9] provided a
good survey about facial expression analysis done by
computers.
1 Introduction Fasel and Luettin [10] provided a survey of research
spent on automatic facial expression analysis and the
Every human being with vision uses lip reading to techniques that can be used for this. Müller et al. [11]
improve the perception of speech. Speech is said to be used hidden Markov models for recognition of facial
multi-modal; depending on both auditory and visual expressions, while Sebe et al. [12] used Bayes
cues, as McGurk and McDonald [1] showed. Emotions classifying to recognize emotions.
and expressions are reflected on our lips and this We have developed an automatic, real-time lip
implies that we can make use of human lip movements tracker. The lip tracker is able to track the four lip
to improve aspects that are associated with speech and contours in up to 25 frames per second. This lip tracker
emotions. These aspects are verbal communication was developed to verify and test the different areas, for
with other humans or machines, i.e., speech driven which tracked lip information can be used. We have
devices. We also use our lips to express emotions and focused on using the tracked lips for emotion
show expressions. By knowing the position of the lip estimation and recognition and will only briefly
contours you have access to the visual part of the describe our lip tracking system here.
spoken language and emotions. A popular area of By making use of the tracked lip contours, we can
usage for visual lip information is speechreading from use it for multimodal integration and to create an
visual lip information. Tracked lip features can also be animation of the lips. We can hereby create
used to improve speech perception [2], not only for improvements in visual communication and
humans, but also for machines. Speech controlled perception, such as emotions that is expressed through
devices, such as mobile phones, can be made more the face.
accurate with the addition of visual input to the This article is organized as follows. Section 2
existing audio input. A computer can perform emotion describes our lip tracking system. Section 3 describes
estimation, to enable visually impaired people to how we have used the tracked lip features for emotion
understand emotions through a different modality then recognition and estimation. Section 4 contains the test
vision or hearing, e.g., touch. The emotion estimation results and section 5 concludes the paper.
can also be used for people with vision, to increase
their perception of emotions.
Many researchers have proposed the use of lip 2 The lip tracking system
tracking, or feature extraction of the lips, to improve
In our lip tracking system we have used a
the performance of speech recognition, compared to
webcamera attached to a light metal arm, which in turn
is mounted to a helmet. The camera was directed so to interpret emotions. What happens if somebody is
that it captures the face of the person who wore the quiet and you still have the need to correctly interpret
helmet. The helmet simplified the calculations by his/hers emotions? A computer can interpret the
removing the calculations of rotations and occlusions emotions for you and then present its interpretation to
from the algorithm. To track the lips with greater you, preferably in a different form than visual
certainty, the nostrils were tracked first, since they are information. Human-computer interaction is another
fairly easy to track. When the nostrils were found, the area where emotion recognition can create
search area for the lips was significantly smaller. By improvement. If the computer can recognize and
classifying the pixels according to their color interpret your emotions correctly, we are moving
information, we got a number of candidates for being a closer to a more human-to-human-like interaction
part of a lip contour. Dynamic programming, i.e. the between computers and humans.
Viterbi algorithm [13], was used to extract the lip
contours from all the candidates. The entire lip A. Emotion estimation
contours can efficiently be represented by a number of Ekman and Friesen [16] proposed that there are six
feature points. We chose to use 16 points, which are basic emotions; happiness, surprise, sadness, anger,
MPEG-4 compliant, meaning that they can be used fear and disgust. We chose to limit our first
with the MPEG-4 facial animation engine [14]. Results implementation to recognizing only three of these basic
from extraction of feature points are shown in Fig. 1. emotions, i.e., happiness, sadness and surprise. These
three were chosen since they have quite different
mouth shapes compared to each other. To discriminate
between all of the basic emotions, you will need, at
least, information about the eyes and the eyebrows, as
well as information from the mouth. The emotion
recognition was performed based on the static shape of
the mouth. The computer used a number of criteria’s to
characterize and recognize the different emotions.
They were opened or closed mouth, the width and
height of the mouth and the relative position between
certain feature points. We also made some assumptions
about the emotions; sadness can only occur when the
Figure 1. Extracted feature points for different mouth is closed and surprise can only occur when the
mouth shapes. mouth is opened. If the mouth was closed, our system
could recognize happiness and surprise. If the mouth
We constructed our own lip animation, based on was opened, the system could recognize happiness and
runout, cubic spline interpolation [15]. The animation sadness. If all criteria’s for an emotion were fulfilled,
consisted of cubic splines for each of the lip contours the emotion was recognized by the system. Examples
and the areas between them were painted with lip of detected emotions are shown in Fig. 3.
color. The animation can be seen in Fig. 2.
Figure 2. Lip animation drawn on top of the facial Figure 3. Emotions detected by the lip tracking
image. The animation is usually drawn against a system, shown both for real video and lip animation.
white background, but is drawn on top the facial
image here to visualize the result better. The recognized emotions can be used to transmit
information about emotions to people with impaired
3 Emotion estimation and recognition vision, people who cannot see visual emotions. This
can be solved by using other modalities than vision,
The visual information from our lips can be used for with touch being the best choice. In our system we
multiple purposes. The information can be used to have solved this with a mouse that can vibrate. The
create more information, or a different kind of different basic emotions were given different vibration
information, enabling information for other modalities patterns. The patterns were chosen to reflect the
than audio. One example can be emotion recognition emotion. Happiness had a high frequency with many
based on visual information, for people with impaired short buzzes, while sadness had a low frequency with a
vision. If you cannot see the visual part of emotions, single, long buzz. Surprise was something in between
you have no choice but to solely rely on your hearing the other two emotions. In this way, the user can sense
emotions without using vision or hearing. This system groups, beginners and experienced, depending on if it
has been demonstrated at the Scandinavian Technical was their first time using the system, of if they had
Fair 2002. Although no result of the demonstration has used the system before. The results from this test are
been recorded, it has clearly been shown that people presented in section IV.
can recognize and distinguish between emotions Our lip tracking system tracked the lips with 25
through the different tactile vibration patterns. In this frames per second, meaning that an animation of 25
way we have transferred the visual emotion recognition frames per second could be achieved. Our 16
into the touch modality. Instead of letting the user parameters could be transmitted with 8 bits/parameter.
recognize emotions through vision, we could abstract This means that the animation parameters could be
information with lip tracking, make an interpretation transferred with a bitrate of 2.5 kbps, without
with a computer and present the emotion to the user considering any redundancy calculation or
through touch. At the same time, the computer is made compression. With use of MPEG-4 facial animation,
aware of the emotion, since it is interpreting it, making the bitrate needed for an animation of the entire face,
this a very interesting application. A computer can and not just the lips, is as low as 5 kbps [17].
automatically detect our emotions and may be able to Consequently, a high quality animation could replace a
take actions according to our emotions. In the low quality video sequence at the same, or lower,
subsequent step the detected emotion can be presented bitrate.
to users in different ways.