Professional Documents
Culture Documents
Diane Litman'
Kate Forbes
University of Pittsburgh
University of Pittsburgh
Department of Computer Science
Learning Research and Development Ctr.
Learning Research and Development Ctr.
Pittsburgh PA, 15260, USA
Pittsburgh PA, 15260, USA
ABSTRACT [ 6 , 7, 8, 9, 10, 11, 121). Some work uses speech read by actors
or native speakers as training data (c.f. ( 6 , 8, 121). but such pro-
We investigate the automatic classification of Student emotional totypical emotional speech does not necessarily reflect naturally-
states in a corpus of human-human spoken tutoring dialogues. We occurring speech [13], such as found in tutoring dialogues. Other
fint annotated student turns in this corpus for negative. neutral and work uses naturally-occurring speech from a variety of corpora
positive emotions. We then automatically extracted acoustic and (c.f. [7,9, 10, 111). but little work to date addresses emotion de-
prosodic features from the student speech, and compared the re tection in computer-based educational settings such as tutoring.
sults of a variety of machine learning algorithms that use 8 differ- Our methodology builds on and generalizes the results of this
ent feature sets to predict the annotated emotions. Our best results prior work while applying them to the new domain of naturally oc-
have an accuracy of 80.53% and show 26.28% relative improve- curring tutoring dialogues. We first annotated student turns in our
ment over a baseline. These results suggest that the intelligent tu- human-human tutoring corpus for emotion. We then automatically
toring spoken dialogue system we are developing can be enhanced extracted acoustic and prosodic features from these turns, and per-
to automatically predict and adapt to student emotional states. formed a variety of machine learning experiments using different
feature combinations to predict our emotion categorizations. Like
1. INTRODUCTION [ 6 ] ,we compare the performance of a variety of modern machine
learning algorithms using the Weka software package [14]; we also
In this paper, we investigate the automatic classification of stu- present a variety of evaluation metrics for our results.
dent emotional states in human-human spoken tutoring dialogues. Although much of the past work in this area predicts only two
Motivation for this investigation comes from the discrepancy be- emotional classes (e.g. negativelnon-negative) [7, 1 0 , l I]. our pre-
tween the performance of human tutors and current machine tutors. liminary experiments produced the best predictions using a three-
While human tutors can respond both to the content of student way distinction between negative, neutral, and positive emotional
speech and to the emotions they perceive to be underlying it, most classes. Like [ I I], we focus here on features that can be computed
intelligent tutoring dialogue systems cannot detect student emo- fully automatically from the student speech and will be available to
tional states, and furthermore, are text-based [I], which may limit our intelligent tutoring dialogue system in real-time. We show that
their success at emotion prediction. Building intelligent tutoring by using these features alone or in combination with features iden-
spoken dialowe systems thus has great potential benefit. Speech tifying specific subjects and tutoring sessions, we can significantly
$ the most niturai and easy to use form bf natural languageinter- improve a baseline performance for emotion prediction. Ow best
action. and studies have shown considerable benefits of sooken hu- results show an accuracy of 80.53% and a relative improvement of
man tutoring [Z].Furthermore, connections between learning and 26.28% over the baseline error rate. These results suggest that our
emotion are well-documented [ 3 ] , and it has been suggested that spoken dialogue tutoring system can be enhanced to automatically
the success of computer tutors could be increased by recognizing predict and adapt to student emotional states.
and responding to student emotion, e.g. reinforcing positive states, Section 2 describes ITSPOKE, our intelligent tutoring spo-
while rectifying negative states [4, 51. We are currently building ken dialogue system. Section 3 describes machine learning ex-
an intelligent tutoring spoken dialogue system with the goal of en- periments in automatic emotion recognition. We first discuss our
hancing it to automatically predict and adapt to student emotional emotion annotation in a human-human spoken tutoring corpus that
states. The larger question motivating this paper is thus whether, is parallel to the corpus that will be produced by ITSPOKE. We
and how, student emotional states can be automatically predicted then discuss how acoustic and prosodic features available in real-
by ow intelligent spoken dialogue tutoring system. time to ITSPOKE are computed from these dialogues. We finally
Speech supplies a rich source of acoustic and prosodic infor- compare the performance of various machine learning algorithms,
mation about the speaker's current emotional state. Research in using our annotations and different feature combinations. Section
the area of spoken dialogue systems has already shown that acous- 4 discusses further directions we are pursuing.
tic and prosodic features can be extracted from the speech signal
and used to develop predictive models of user mental states (c.f.
2. T H E ITSPOKE SYSTEM AND CORPUS
'nsresevch IS rupponsd b) the N S F G m t No. 9720359 10 the Cen.
or faInsrdis;iplmuy Rr,ruch on Con,mcrne Leming Environrnenrs We are developing a spoken dialogue system. called ITSPOKE
(CIRCLE) ai the Uluveniiy of Pmburgh and CmegeXlellon Unrbenity. (Intelligent 'Tutoring SPOKhn dialogue system,. U hich uses as
3.2. Extracting Features from the Speech Signal nlsubj: 11 n l fo,R M S and temporal features + 3 identifier
features
Like [l I], we focus in this paper on acoustic and prosodic features n2speech: 11 n2 fo,RMS and temporal features
of individual turns that can be computed automatically from the
speech signal and that will he available in real-time to ITSPOKE. n2subj: 11 n2 fo,RMS and temporal features + 3 identifier
For each of the 202 agreed student turns. we automatically com- features
puted the 33 acoustic and prosodic features itemized in Figure 2, allspeech 33 fo,RMS, and temporal features (raw, n l . n2)
which prior studies cited above have shown to be useful for pre-
allsubj: all 36 features
dicting emotions in other domains. Except for turn duration and
tempo, which were calculated via the hand-segmented turn bound-
Fig. 3. 8 Feature Sets for Machine Learning Experiments
aries (but will be computed automatically in ITSPOKE via the out-
put of the speech recognizer), the acoustic and prosodic features
were calculated automatically from each turn in isolation. fo and 3.3. Using Machine Learning to Predict Emotions
RMS values, representing measures of pitch excursion and loud-
ness, respectively, were computed using Entropic Research Lab, We next performed machine learning experiments with our fea-
omtory’s pitch tracker, gerfl, with no post-correction. Speaking ture sets and our emotion-annotated data, using the Weka machine
rate was calculated using the Festival synthesizer OALD dictio- learning software [I41 as in [61. This software allows us to com-
21
pare the predictions of some of the most modem machine learning We also see that overall, AdaBoost produces the most robust
algorithms with respect to a variety of evaluation metrics. performance; on every other feature set AdaBoost perfomed as
We selected four machine learning algorithms for comparison. well as or better than both J48 and IBI. In parlicular, AdaEoost
First, we chose a C4.5 decision tree learner, called “148“ in Weka; significantly outperforms IBI on the four normalized feature sets,
by default this algorithm produces a pruned decision tree with a and AdaBoost significantly outperforms 148 on five feature sets:
0.25 confidence threshold, where leaves correspond to a predicted “rawspeech”, the “nl” sets, and the ”all” sets. AdaBoost’s best
emotion label if at least two instances of that path are found in performance is achieved on the “nl” feature sets; in fact AdaIIoost
the training data. An advantage of decision tree algorithms is that significantly outperforms all the other algorithms on these two fea-
they allow automatic feature selection and their tree output pro- ture sets. Moreover, AdaBoost significantly outperforms the base-
vides an intuitive way to gain insight into the data. Second, we line on all feature sets other than the “n2” feature sets. Only on
chose a nearest-neighbor classifier, called “IBI” in Weka; this al- the “nlspeech“ feature set does AdaBoost perform significantly
gorithm uses a distance measure to predict as the class of a test worse than the baseline; however, no algorithm outperformed the
instance the class of the first closest training instance that is found. baseline on this feature set. In fact, we see that IBI performed
Also available in Weka is the “ I B K classifier, where the num- significantly worse than even AdaBoost on this feature set, while
ber of nearest neighbors considered (k) can be specified manually a separate comparison showed that I48 performed statistically the
or determined automatically using leave-one-out cross-validation. same as the baseline on this feature set.
We experimented with altemative.values of k but found k=l to AdaBoost’s improvement for each feature set, relative to the
achieve the best results. Third, we chose a “boosting” algorithm, baseline error of 28.2%, is shown in Table 2. Except for the ”n2”
called “AdaBoostMI” in Weka. Boosting algorithms generally en- feature sets, where improvement is not statistically significant, Ad-
able the accuracy of a “weak” learning algorithm to be improved aBoost’s relative improvement averages near 20%.
by repeatedly applying that algorithm to different distributions or
weightings of training examples, each time generating a new weak
prediction rule, and eventually combining all weak prediction rules
into a single prediction [21]. Following [6], we selected “ 1 4 8 as rawsubj 22.38% 20.64%
AdaBoost’s weak learning algorithm. Finally, we chose a standard nlspeech 20.79% 26.28%
baseline algorithm, called “ZeroR” in Weka; this algorithm simply nlsubj 21.39% 24.15%
predicts the majority class (“neutral“) in the training data, and thus
is used a s a performance benchmark for the other algorithms.
Table 1 shows the accuracy (percent correct) of these algo-
n2speech
nlsubj
allspeech
29.12%
27.64%
22.56%
I1 1
20.00%
rithms on each of our 8 feature sets, as compared to the accuracy of allsubj 22.04% 21.84%
AdaBoost. Significances of accuracy differences are automatically
computed in Weka using a two-tailed t-test and a 0.05 confidence Table 2. Relative Improvement of AdaBoost over Baseline
threshold across I O runs of IO-fold cross-validation. “*’* indicates
that the algorithm performed statistically worse than AdaBoost on Other important evaluation metrics to consider include recall,
that feature set. “v” indicates that an algorithm performed statis- precision, and F-Measure ((2’recall*precision)l(~call+precision)).
tically better than AdaBoost on that feature set. Lack of either Table 3 shows AdaBoost’s performance on each feature set with
symbol indicates that there was no statistical difference between respect to these meuics; in Weka, baseline (ZeroR) precision, re-
the performance of the algorithm as compared to AdaBoost. call and F-measure are all 0%. n o u g h not shown, for recall and
--- precision, AdaBoost performs statistically better than or eqnal to
Feature Set AdaBoost I48
- IB 1
- ZeroR
- both IBI and 148 on every feature set. For F-measure, AdaBoost
71.43 * 75.98 71.80 * performs statistically better than or equal to 148 on every feature
rawsubj 77.62 75.49 80.53 v 71.80 * set, and there is only one feature set (“nlsubj”) where IBI per-
nlspeech 79.21 75.94 * 70.61 * 71.80 * forms significantly better. Overall, AdaBoost produces better pre-
nlsubj 78.61 74.01 * 72.84 * 71.80 * cision than recall, but of the three metrics, precision is the most im-
n2speech 70.88 72.09 64.99 * 71.80 v portant in the intelligent tutoring domain, because it is better if IT-
n2subj 72.36 71.11 68.01 * 71.80 SPOKE predicts that a turn is ”neutral” than if ITSPOKE wrongly
allspeech ~ 77.44 75.00 * 74.57 71.80 * predicts a tum is positive or negative and reacts accordingly.
allsubj 17.96 75.01 * 79.21
- 71.80 *
- -
Feature Set Precision ~
Recall F-Measure
Table 1. Percent Correct, 0.05 Confidence (two-tailed) 0.41 0.40
0.44 0.45
As shown, the single best accuracy of 80.53% is achieved 0.45 0.48
by the nearest-neighbor (IBI) algorithm on the “rawsubj” feature nlsubj 0.45 0.48
set. We see that this is significantly better than the 77.62% cor- n2speech 0.31 0.33
rect achieved by the boosted algorithm (AdaBoost), and a sep- n2subj 0.39 0.28 0.3 1
arate comparison showed that it is also significantly better than allspeech 0.49 0.43 0.44
the 75.49% correct achieved by the decision tree algorithm (148). allsubj 0.42
- 0.44
However, we also see that AdaBoost still performed significantly
better than the baseline (ZeroR) on the ”rawsubj” feature set. and Table 3. Precision, Recall and F-Measure of AdaBoost
a separate comparison showed that the performance of 148 was
statistically the same as AdaBoost on this feature set. As discussed above, we use AdaBoost to ’’boost” the 148 de-
28
cision tree algorithm. Although the output of AdaBoost does not temporal features are also found to be predictive of user mental
include a decision tree, to get an intuition ahout how our features states in [7, 9, IO]. Of our 4 normalized pitch features, however,
are used to predict emotion classes in our domain, we ran the J48 only minimum, maximum and mean f0 values (nlminf0, nlmaxf0,
algorithm on the “nlspeech” feature set, using leave-one-out cross nlmeanf0) are used to predict our emotion classes; as in [IO, 1I],
validation (LOO).We chose “nlspeech” because it yields J48’s !U standard deviation is not used to predict our classes. Of our 4
hest performance using 10x10 cross-validation, and contains only normalized energy features, only RMS mean and standard devi-
acoustic and prosodic features, thereby allowing the results to gen- ation (nlmeanRMS, nlstdRMS) are used to predict our emotion
eralize to different speakers and problems. Although J48’s accu- classes; unlike in (9, IO, 111, RMS maximums and minimums are
racy on “nlspeech” using 10x10 cross-validation was statistically not used. Based on this tree, longer turns with slower tempos,
worse than AdaBoost, a separate comparison showed it was sig- higher RMS mean and standard deviation, median f0 minimums,
nificantly better than both IB1 and the baseline. In Figure 4 we and lower f0 maximums will he predicted positive. The tree will
present the final decision tree for 148 (LOO) and “nlspeech”: In predict negative for shorter tums (but not the shortest turns, e.g.
this structure, a colon introduces the emotion label assigned to groundings) with either slower tempos and higher f0 maximums
a leaf. The stmcture is read by following the (indented) paths or with faster tempos. For longer turns with slower tempos and
through the feature nodes until reaching a leaf. higher RMS means, negative emotions are predicted when RMS
standard deviation is lower or when it is higher and the turn has
nlturnDur 5 2.476193 either a lower f0minimum and higher f0 mean or a higher fo min-
I nltempo 5 0.648936 imum and maximum. In longer tums with fast tempos and high %
I I nlturnDur 5 1.15: negative silence, a low !U mean also predicts negative.
I I nlturnDur> 1.15 The accuracy (percent correct) of this decision tree is 77.23%.
I I I nlmaxfO 5 0.904914: neutral Table 4 shows its performance with respect to precision, recall and
I I I nlmaxf0 > 0.904914: negative F-measure on each of the three emotion classes.
I nlternpo > 0.648936
I I nltempo 5 2.858068: neutral Precision I Recall I F-Measure I Class
I I nltempo > 2.558065 0.538 I 0.848 I 0.841 I -
negative
I I I nlturnDur 50.360655 0.878 0.897 0.887 neutral
I I I 1 n l t e m p o j 111.176611: neutral 0.273 0.200 0.231 positive
1 I I I nltempo> 111.176611
I I I I I nlturnDur ~ 0 . 0 2 0 1 0 6 neutral
: Table 4. J48’s (LOO) Precision, Recall and F-Measure
I I I I I nltumDur >0.020106: negative
I I I nlturnDur > 0.360655: negative A confusion matrix summarizing 148’s (LOO) performance on
nlturnDur > 2.476193 ”nlspeech” is shown in Table 5. The matrix shows how many in-
\ nltempo 5 0.954545 stances of each class have been assigned to each class, where rows
I I nlmeanRMS 5 0.221876: neutral correspond to annotator-assigned labels and columns correspond
I I nlmeanRMS > 0.221876 to predicted labels. For example, 23 negatives were correctly pre-
1 1 I nlstdRMS 5 6.223242: negative dicted, while 13 negatives were incorrectly predicted to be neutral
I I I nlstdRMS > 6.223242 and 6 negatives were incorrectly predicted to he positive.
I I I 1 nlminf050.286831
I I I I I nlmeanf0 50.686086 neutral negative neutral positive
I I I I I nlmeanf0 >0.686086: negative
I I I I nlminf0>0.286531 neutral
I I I I I nlturnDur 53.952382 positive
I I I I I I nltempo 5 0.600001: positive
I I I I 1 I nltempo>0.600001: neutral Table S. Confusion Matrix for 148 (LOO) on “nlspeech”
I I I I I nlturnDur>3.952382
I I I I I I nlminf0 50.38471: positive 4. CONCLUSIONS AND CURRENT DIRECTIONS
I I I I I I nlminf0>0.38471
I I I I I I I n l t e m p o ~ 0 . 3 8 0 8 7 7positive We have addressed the use of machine learning techniques for au-
I I I I I I I nltempo>0.350877 tomatically predicting student emotional states in intelligent tu-
I I I I I I I I nlmaxfO 50.734688: positive toring spoken dialogues. Our methodology extends the results o f
I I I I I I I I nlmaxf0 >0.734688: negative prior research into spoken dialogue systems by applying them to a
I nltempo > 0.954545 new domain. Our emotion annotation schema distinguishes nega-
I I nl%Silence 5 2.026144: neutral tive, neutral and positive emotions, and our inter-annotator agree-
I I nl%Silence > 2.026144 ment is on par with prior emotion annotation in other types of cor-
I I I nlmeanm 5 0.888185: negative pora. From our annotated student turns we automatically exvacted
I I I nlmeanm > 0.888155: neutral 33 acoustic and prosodic features that will be available in real-time
to ITSPOKE, and added 3 identifier features for student, gender,
Fig. 4. Decision Tree for J48 (LOO) on “nlspeech
and problem. We compared the results of a variety of machine
We see that the tree uses all 3 of our normalized temporal fea- learning algorithms in terms of a variety of evaluation metrics; our
tures: turn duration (nlturnDur), speaking rate (nltempo), and best results suggest that ITSPOKE can be enhanced to automati-
amount of silence (nllSilence), to predict our emotion classes; cally predict student emotional states.
29
We are exploring the use of other emotion annotation schemas. [SI R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis,
122, 23, 81 for example, address how more complex emotional S. KoUias, W. Fellenz, and J. Taylor, “Emotion recognition in
categorizations encompassing multiple emotional dimensions (e.g. human-computer interaction:’ IEEE Signal Processing Mag-
“valence” and “activation”) can be incorporated into frameworks azine, vol. 18, pp. 32-80, Jan. 2001.
for automatic prediction. One limiting factor of more complex [91 D. Litman, J. Hirschberg, and M. Swerts, “Predicting user
schemas, however, is the difficulty of keeping inter-annotator agree- reactions to system error,” in Pmc.ojACL. 2001.
ment high; another is the difficulty of making good predictions
about fine-grained emotion categorizations. We are also exper- A. Bather, K. Fischer, R. Huber, J. SpiLker, and E. Noth,
imenting with annotating orthogonal tutoring sub-domains. For “Desperately seeking emotions: Actors, wizards, and human
example, although we’ve found that most expressions of emotion beings,” in ISCA Workhop on Speech and Emotion, 2000.
pertain to the physics material being leamed. they can also pertain C.M. Lee, S . Narayanan, and R. Pieraccini, “Recognil.ion of
to the tutoring process itself (e.g. attitudes toward the tutorbeing negative emotions from the speech signal,” in ASRU, :!OOO.
tutored). We are also addressing the incorporation of “borderline” T. S. Polzin and A. H. Waibel. “Detecting emotions in
cases in our own schema. For example, we have already found speech:’ in Cooperative Multimodal Communication, 1998.
that incorporating “weak” negative and positive labels into the cor-
responding “strong” category increases inter-annotator agreement K. Fischer, “Annotating emotional language data:’ Verbmo-
but decreases machine leaming performance. bil Report 236, December 1999.
We are also expanding our machine-learning investigations. I. H. Witten and E. Frank, Dara Mining: Prncrical Machine
We are extending our set of features to incorporate lexical and Learning Tools and Techniques wirh Java implemcntdons,
dialogue features. [24, 7, 9, IO] have shown that such features Morgan Kaufmann, San Francisco, 1999.
can contribute to emotion recognition, and in a pilot study [25] we K. VanLehn, P. Jordan. C. Rose, D. Bhembe, M. Bjttner,
showed that these features alone predicted emotional states signifi-
A. Gaydos, M. Makatchev, U. Pappuswamy, M. Ringenberg,
cantly better than the baseline. In addition, we are pursuing further A. Roque, S. Siler, R. Srivastava, and R. Wilson. ‘The archi-
comparisons of alternative machine-leaming techniques. Finally,
tecture of Why2-Atlas: A coach for qualitative physics essay
we are in the process of increasing our data set. With more data writing:’ in Proc. of ITS, 2002.
we can try methods such as down-sampling to better balance the
data in each emotion class. When ITSPOKE begins testing in Fall, X. D. Huang, F. Alleva, H. W. Hon, M. Y. Hwang, K. 1;. Lee,
2003, we will address the same questions for our human-computer and R. Rosenfeld, ‘The SphinxlI speech recognition system:
tutoring dialogues that we’ve addressed here for our parallel cor- An overview:’ Compurer; Speech and Language, 1993.
pus of human-human tutoring dialogues. A. Black and P. Taylor, “Festival speech synthesis system:
system documentation (l.l.1);’ Human Communication Re-
search Centre Technical Report 83, U. Edinburgh, 1997.
5. REFERENCES
C. P. R o d , D.Bhembe, A. Roque, and K. VanLehn, “An
[I] C. P. Rose and V. Aleven, “Proc. of the ITS 2002 workshop efficient incremental architecture for robust interpretation:’
on empirical methods for tutorial dialogue systems,” Tech. in Pmc. H m a n Languages Technology Conference, 2002.
Rep., San Sebastian, Spain, June 2002. C. P. Rost, P. Jordan, M. Ringenberg, S . Siler, K. VanLehn,
121 R. Hausmann and M. Chi, “Can a computer interface sup- and A. Weinstein, “Interactive conceptual tutoring in .4tlas-
port self-explaining?:’ The Internarional Journal of Cogni- Andes,” inPmc. ofA.1. in Education, 2001. pp. 25G266.
tive Technology, vol. 7,no. 1,2002. Carolyn Penstein RosC, Diane Litman, Dumisizwe Bhcmbe,
[3] G. Coles, “Literacy, emotions, and the brain,” Reading On- Kate Forbes, Scott Silliman, Ramesh Srivastava, and Kurt
line, March 1999, 1999. Vankhn, “A comparison of tutor and student behavior in
speech versus text based tutoring,” in HLTNAACL Work-
[4] M. Evens, S. Brandle, R. Chang, R. Freedman, M. Glass, shop: Building Educational Applications Using N U ,2003.
Y. H. Lee, L. S . Shim, C. W. Woo, Y. Zhang, Y. Zhou, J. A. Y. Freund and R.E. Schapire, ’=Experiments with a new
Michaeland, and A. A. Rovick. ”Circsim-tutor: An intel-
boosting algorithm,” in Pmc. of the lnremtional Confer-
ligent tutoring system using natural language dialogue,” in
ence on Machine Learning, 1996.
Pmceedings of the TwelfthMidwest A I and Cognitive Science
Conference, MAICS2WI. Oxford, OH, 2001, pp. 16-23. J. Liscombe, J. Venditti, and J.Hirschberg, ”Classifying s u b
ject ratings of emotional speech using acoustic feature!;:’ in
[SI G. Aist, B. Kort, R. Reilly. J. Mostow. and R. Picard, “Ex- Pmc. of EuroSpeech, 2003.
perimentally augmenting an intelligent tutoring system with
human-supplied capabilities: Adding human-provided emo- H. Holzapfel, C. Fuegen, M. Denecke, and A. Waibel. “In-
tional scaffolding to an automated reading tutor that listens:’ tegrating emotional cues into a framework for dialogue man-
in Pmc. of ITS, 2002. agement,” in Pmc. of ICMI, 2002.
C.M. Lee, S. Narayanan, and R. Pieraccini, “Combining
[61 P-Y. Oudeyer, “The production and recognition of emotions
acoustic and language information for emotion recognition:’
in speech: features and algorithms:’ International Journal of
in Proc. of ICSLP, 2002.
Human Compurer Interacrion, in press.
D. Litman, K. Forbes, and S. Silliman, ‘Towards emo-
171 I. Ang, R. Dhillon, A. Krupski, E.Shriberg. and A. Stolcke, tion prediction in spoken tutoring dialogues:’ in Pmc. of
“Prosody-based automatic detection of annoyance and frus- HLTNAACL, 2003.
tration in human-computer dialog:’ in Pmc. of ICSLP, 2002.
30