Professional Documents
Culture Documents
net/publication/224165246
The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit
and emotion-specified expression
CITATIONS READS
1,188 2,614
6 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Takeo Kanade on 22 May 2014.
Patrick Lucey1,2 , Jeffrey F. Cohn1,2 , Takeo Kanade1 , Jason Saragih1 , Zara Ambadar2
Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, 152131
Department of Psychology, University of Pittsburgh, Pittsburgh, PA, 152602
patlucey@andrew.cmu.edu, jeffcohn@cs.cmu.edu, tk@cs.cmu.edu, jsaragih@andrew.cmu.edu
ambadar@pitt.edu
Iain Matthews
Disney Research, 4615 Forbes Ave, Pittsburgh, PA 15213
iainm@disneyresearch.com
Abstract 1. Introduction
Automatically detecting facial expressions has become
an increasingly important research area. It involves com-
In 2000, the Cohn-Kanade (CK) database was released puter vision, machine learning and behavioral sciences, and
for the purpose of promoting research into automatically can be used for many applications such as security [20],
detecting individual facial expressions. Since then, the CK human-computer-interaction [23], driver safety [24], and
database has become one of the most widely used test-beds health-care [17]. Significant advances have been made in
for algorithm development and evaluation. During this pe- the field over the past decade [21, 22, 27] with increasing
riod, three limitations have become apparent: 1) While AU interest in non-posed facial behavior in naturalistic contexts
codes are well validated, emotion labels are not, as they [4, 17, 25] and posed data recorded from multiple views
refer to what was requested rather than what was actually [12, 19] and 3D imaging [26]. In most cases, several limi-
performed, 2) The lack of a common performance metric tations are common. These include:
against which to evaluate new algorithms, and 3) Standard
protocols for common databases have not emerged. As a 1. Inconsistent or absent reporting of inter-observer re-
consequence, the CK database has been used for both AU liability and validity of expression metadata. Emo-
and emotion detection (even though labels for the latter tion labels, for instance, have referred to what expres-
have not been validated), comparison with benchmark algo- sions were requested rather than what was actually per-
rithms is missing, and use of random subsets of the original formed. Unless the validity of labels can be quantified,
database makes meta-analyses difficult. To address these it is not possible to calibrate algorithm performance
and other concerns, we present the Extended Cohn-Kanade against manual (human) standards
(CK+) database. The number of sequences is increased by
2. Common performance metrics with which to eval-
22% and the number of subjects by 27%. The target expres-
uate new algorithms for both AU and emotion de-
sion for each sequence is fully FACS coded and emotion
tection. Published results for established algorithms
labels have been revised and validated. In addition to this,
would provide an essential benchmark with which to
non-posed sequences for several types of smiles and their
compare performance of new algorithms.
associated metadata have been added. We present baseline
results using Active Appearance Models (AAMs) and a lin- 3. Standard protocols for common databases to make
ear support vector machine (SVM) classifier using a leave- possible quantitative meta-analyses.
one-out subject cross-validation for both AU and emotion
detection for the posed data. The emotion and AU labels, The cumulative effect of these factors has made bench-
along with the extended image data and tracked landmarks marking various systems very difficult or impossible. This
will be made available July 2010. is highlighted in the use of the Cohn-Kanade (CK) database
[14], which is among the most widely used datasets for
1
developing and evaluating algorithms for facial expression 2.1.1 Posed Facial Expressions
analysis. In its current distribution, the CK (or DFAT)
In the original distribution, CK included 486 FACS-coded
database contains 486 sequences across 97 subjects. Each
sequences from 97 subjects.. For the CK+ distribution,
of the sequences contains images from onset (neutral frame)
we have augmented the dataset further to include 593 se-
to peak expression (last frame). The peak frame was re-
quences from 123 subjects (an additional 107 (22%) se-
liably FACS coded for facial action units (AUs). Due
quences and 26 (27%) subjects). The image sequence vary
to its popularity, most key recent advances in the field
in duration (i.e. 10 to 60 frames) and incorporate the onset
have evaluated their improvements on the CK database
(which is also the neutral frame) to peak formation of the
[16, 22, 25, 2, 15]. However as highlighted above, some
facial expressions
authors employ a leave-one-out cross-validation strategy on
the database, others have chosen another random train/test
set configuration. Other authors have also reported results 2.1.2 Non-posed Facial Expressions
on the task of broad emotion detection even though no vali- During the recording of CK, 84 subjects smiled to the exper-
dated emotion labels were distributed with the dataset. The imenter at one or more times between tasks. These smiles
combination of these factors make it very hard to gauge the were not performed in response to a request. They com-
current-state-of-the-art in the field as no reliable compar- prised the initial pool for inclusion in CK+. Criteria for fur-
isons have been made. This is a common problem across the ther inclusion were: a) relatively neutral expression at start,
many publicly available datasets currently available such as b) no indication of the requested directed facial action task,
the MMI [19] and RUFACS [4] databases (see Zeng et al. c) absence of facial occlusion prior to smile apex, and d) ab-
[27] for a thorough survey of currently available databases). sence of image artifact (e.g., camera motion). One hundred
In this paper, we try to address these three issues by pre- twenty-two smiles from 66 subjects (91% female) met these
senting the Extended Cohn-Kanade (CK+) database, which criteria. Thirty two percent were accompanied by brief ut-
as the name suggests is an extension to the current CK terances, which was not unexpected given the social setting
database. We have added another 107 sequences as well and hence not a criterion for exclusion.
as another 26 subjects. The peak expression for each se-
2.2. Action Unit Labels
quence is fully FACS coded and emotion labels have been
revised and validated with reference to the FACS Investiga- 2.2.1 Posed Expressions
tors Guide [9] confirmed by visual inspection by emotion
For the 593 posed sequences, full FACS coding of peak
researchers. We propose the use of a leave-one-out sub-
frames is provided. Approximately fifteen percent of the se-
ject cross-validation strategy and the area underneath the
quences were comparison coded by a second certified FACS
receiver operator characteristic (ROC) curve for evaluating
coder. Inter-observer agreement was quantified with coef-
performance in addition to an upper-bound error measure.
ficient kappa, which is the proportion of agreement above
We present baseline results on this using our Active Appear-
what would be expected to occur by chance [10].. The mean
ance Model (AAM)/support vector machine (SVM) system.
kappas for inter-observer agreement were 0.82 for action
units coded at apex and 0.75 for frame-by-frame coding. An
inventory of the AUs coded in the CK+ database are given in
2. The Extended Cohn-Kanade (CK+) Dataset Table 1. The FACS code coincide with the the peak frames.
Emotion Criteria
Angry AU23 and AU24 must be present in the AU combination
Disgust Either AU9 or AU10 must be present
Fear AU combination of AU1+2+4 must be present, unless AU5 is of intensity E then AU4 can be absent
Happy AU12 must be present
Sadness Either AU1+4+15 or 11 must be present. An exception is AU6+15
Surprise Either AU1+2 or 5 must be present and the intensity of AU5 must not be stronger than B
Contempt AU14 must be present (either unilateral or bilateral)
Contempt, Disgust, Fear, Happy, Sadness and Surprise. Us- inconsistent with the emotion. (AU 4 is a component
ing these labels as ground truth is highly unreliable as these of negative emotion or attention and not surprise). AU
impersonations often vary from the stereotypical definition 4 in the context of disgust would be considered consis-
outlined by FACS. This can cause error in the ground truth tent, as it is a component of negative emotion and may
data which affects the training of the systems. Conse- accompany AU 9. Similarly, we evaluated whether any
quently, we have labeled the CK+ according to the FACS necessary AU were missing. Table 2 lists the qualify-
coded emotion labels. The selection process was in 3 steps: ing criteria. if Au were missing, Other consideration
included: AU20 should not be present except for but
1. We compared the FACS codes with the Emotion Pre- fear; AU9 or AU10 should not be present except for
diction Table from the FACS maual [9]. The Emo- disgust. Subtle AU9 or AU10 can be present in anger.
tion Prediction Table listed the facial configurations
(in terms of AU combinations) of prototypic and ma- 3. The third step involved perceptual judgment of
jor variants of each emotion, except contempt. If a whether or not the expression resembled the target
sequence satisfied the criteria for a prototypic or major emotion category. This step is not completely indepen-
variant of an emotion, it was provisionally coded as be- dent from the first two steps because expressions that
longing in that emotion category. In the first step, com- included necessary components of an emotion would
parison with the Emotion Prediction Table was done by be likely to appear as an expression of that emotion.
applying the emotion prediction “rule strictly. Apply- However, the third step was because the FACS codes
ing the rule strictly means the presence of additional only describe the expression at the peak phase and do
AU(s) not listed on the table, or the missing of an AU not take into account the facial changes that lead to the
results in exclusion of the clip. peak expression. Thus, visual inspection of the clip
from onset to peak was necessary to determine whether
2. After the first pass, a more loose comparison was per- the expression is a good representation of the emotion.
formed. If a sequence included an AU not included in
the prototypes or variants, we determined whether they As a result of this multistep selection process, 327 of the
were consistent with the emotion or spoilers. For in- 593 sequences were found to meet criteria for one of seven
stance, AU 4 in a surprise display would be considered discrete emotions. The inventory of this selection process is
Figure 1. Examples of the CK+ database. The images on the top level are subsumed from the original CK database and those on the bottom
are representative of the extended data. All up 8 emotions and 30 AUs are present in the database. Examples of the Emotion and AU labels
are: (a) Disgust - AU 1+4+15+17, (b) Happy - AU 6+12+25, (c) Surprise - AU 1+2+5+25+27, (d) Fear - AU 1+4+7+20, (e) Angry - AU
4+5+15+17, (f) Contempt - AU 14, (g) Sadness - AU 1+2+4+15+17, and (h) Neutral - AU0 are included.
Disgust Happy Fear Surprise
Emotion N imum modal response. The 25% maximum endorsement
Angry (An) 45 for the rival type was used to ensure discreteness of the
Anger Contempt (Co) 18 modal response. By this criterion, 19 were classified as
Contempt Sad Contempt
perceived amused, 23 as perceived polite, 11 as perceived
AU codes:
Disgust (Di) 59
embarrassed or nervous, and 1 as other. CK+ includes the
Fear (Fe) 25
modal scores and the ratings for each sequence. For details
Happy (Ha) 69
and for future work using this portion of the database please
Sadness (Sa) 28 see and cite [1].
Surprise (Su)
Saturday, March 20, 2010
83
ï40
ï20
20
40
60
(a) (b)
Figure 2. Block diagram of our automatic system. The face is tracked using an AAM and from this we get both similarity-normalized
shape (SPTS) and canonical appearance (CAPP) features. Both these features are used for classification using a linear SVM.
subtracted from all features). have multiplied the metric by 100 for improved readability of results
lie on the AAM 2-D mesh. For AUs 6, 9 and 11, a lot of tex- An Di Fe Ha Sa Su Co
tural change in terms of wrinkles and not so much in terms An 35.0 40.0 0.0 5.0 5.0 15.0 0.0
of contour movement, which would suggest why the CAPP Di 7.9 68.4 0.0 15.8 5.3 0.0 2.6
features performed better than the SPTS for these. Fe 8.7 0.0 21.7 21.7 8.7 26.1 13.0
Ha 0.0 0.0 0.0 98.4 1.6 0.0 0.0
4.3. Emotion Detection Results Sa 28.0 4.0 12.0 0.0 4.0 28.0 24.0
The performance of the shape (SPTS) features for emo- Su 0.0 0.0 0.0 0.0 0.0 100.0 0.0
tion detection is given in Table 5 and it can be seen that Dis- Co 15.6 3.1 6.3 0.0 15.6 34.4 25.0
gust, Happiness and Surprise all perform well compared to
Table 5. Confusion matrix of emotion detection for the similarity-
the other emotions. This result is intuitive as these are very normalized shape (SPTS) features - the emotion classified with
distinctive emotions causing a lot of deformation within the maximum probability was shown to be the emotion detected.
face. The AUs associated with these emotions also lie on the
AAM mesh, so movement of these areas is easily detected An Di Fe Ha Sa Su Co
by our system. Conversely, other emotions (i.e. Anger, Sad- An 70.0 5.0 5.0 0.0 10.0 5.0 5.0
ness and Fear) that do not lie on the AAM mesh do not per- Di 5.3 94.7 0.0 0.0 0.0 0.0 0.0
form as well for the same reason. However, for these emo-
Fe 8.7 0.0 21.7 21.7 8.7 26.1 13.0
tions textural information seems to be more important. This
Ha 0.0 0.0 0.0 100.0 0.0 0.0 0.0
is highlighted in Table 6. Disgust also improves, as there is
Sa 16.0 4.0 8.0 0.0 60.0 4.0 8.0
a lot of texture information contained in the nose wrinkling
(AU9) associated with this prototypic emotion. Su 0.0 0.0 1.3 0.0 0.0 98.7 0.0
For both the SPTS and CAPP features, Contempt has Co 12.5 12.5 3.1 0.0 28.1 21.9 21.9
a very low hit-rate. However, when output from both the Table 6. Confusion matrix of emotion detection for the canonical
SPTS and CAPP SVMs are combined (through summing appearance (CAPP) features - the emotion classified with maxi-
the output probabilities) it can be seen that the detection of mum probability was shown to be the emotion detected.
this emotion jumps from just over 20% to over 80% as can
be seen in Table 7. An explanation for this can be from An Di Fe Ha Sa Su Co
that fact that this emotion is quite subtle and it gets easily An 75.0 7.5 5.0 0.0 5.0 2.5 5.0
confused with other, stronger emotions. However, the con- Di 5.3 94.7 0.0 0.0 0.0 0.0 0.0
fusion does not exist in both features sets. This also appears Fe 4.4 0.0 65.2 8.7 0.0 13.0 8.7
to have happen for the other subtler emotions such as Fear Ha 0.0 0.0 0.0 100.0 0.0 0.0 0.0
and Sadness, with both these emotions benefitting from the Sa 12.0 4.0 4.0 0.0 68.0 4.0 8.0
fusion of both shape and appearance features. Su 0.0 0.0 0.0 0.0 4.0 96.0 0.0
The results given in Table 7 seem to be inline with re-
Co 3.1 3.1 0.0 6.3 3.1 0.0 84.4
cent perceptual studies. In a validation study conducted on
the Karolinska Directed Emotional Faces (KDEF) database Table 7. Confusion matrix of emotion detection for the combina-
[11], results for the 6 basic emotions (i.e. all emotions in tion of features (SPTS+CAPP) - the emotion classified with maxi-
CK+ except Contempt) plus neutral, were similar to the mum probability was shown to be the emotion detected. The fusion
ones presented here. In this study they used 490 images (i.e. of both systems were performed by summing up the probabilities
70 per emotion) and the hit rates for each emotion were2 : from output of the multi-class SVM.
Angry - 78.81% (75.00%), Disgust - 72.17% (94.74%),
Fear - 43.03% (65.22%), Happy - 92.65% (100%), Sadness
- 76.70% (68.00%), Surprised - 96.00% (77.09%), Neutral need to be performed on the CK+ database and automated
- 62.64% (100%)3 . results need to be conducted on the KDEF database to test
This suggests that an automated system can do just as a out the validity of these claims.
good job, if not better as a naive human observer and suffer
from the same confusions due to the perceived ambiguity 5. Conclusions and Future Work
between subtle emotions. However, human observer ratings
In this paper, we have described the Extended Cohn-
2 our results are in the brackets next to the KDEF human observer re- Kanade (CK+) database for those researchers wanting to
sults prototype and benchmark systems for automatic facial ex-
3 As neutral frame is subtracted, just a simple energy based measure of