Professional Documents
Culture Documents
11.1
Multimodal Learning
Analytics: Assessing
Learners’ Mental State
During the Process
of Learning
Sharon Oviatt, Joseph Grafsgaard, Lei Chen, Xavier Ochoa
Introduction
Today, new technologies are making it possible to assess what students know at
fine levels of detail, and to model human learning in ways that support entirely
new insights about the process of learning. These developments are part of a
general recent trend to understand users’ mental state, which is illustrated in this
volume’s chapters on reliably detecting cognitive load (Chapter 10), depression and
related disorders (Chapter 12), and deception (Chapter 13). Progress in all of these
areas has been directly enabled by new computational methods and strategies for
collecting and analyzing multimodal-multisensor data.
In this chapter, we focus on students’ mental state during the process of learn-
ing, which is a complex activity that only can be evaluated over time. We describe
multimodal learning analytics, an emerging field that includes far more powerful
predictive capabilities than preceding work on learning analytics, which has been
based solely on click-stream and linguistic analysis of typed input.
The achievement of learning fundamentally involves consolidating new perfor-
mance skills (e.g., ability to communicate orally), types of domain expertise (e.g.,
knowledge of math), and metacognitive awareness (e.g., ability to diagnose errors).
332 Chapter 11 Multimodal Learning Analytics
In this chapter, we focus mainly on new research involving the first two areas: ana-
lytics for detecting skilled performance and domain expertise. We also briefly discuss
related mental states that are prerequisites for learning, in particular attention, mo-
tivation, and engagement, as well as initial work on the moment of insight during
problem solving. However, research on multimodal learning analytics has yet to in-
vestigate, or attempt to predict, deeper aspects of learning—such as the emergence
of metacognitive awareness, or the ability to transfer learning to new contexts. The
Glossary provides definitions of terminology used throughout this chapter. In ad-
dition, Focus Questions are available to aid readers in understanding important
chapter content.
The purpose of this chapter is not to survey learning technologies more gen-
erally, including first-generation learning analytics based on keyboard-and-mouse
computer interfaces. This chapter also does not cover the development of appli-
cations based on new multimodal learning analytic techniques, which would be
premature, although new applications are discussed in Section 11.6. For a more
in-depth discussion of the modalities typically included in multimodal learning
analytics work, readers are referred to other handbook chapters on speech, writ-
ing, touch, gestures, gaze, image processing, and activity recognition using sen-
sors [Cohen and Oviatt 2017, Hinckley 2017, Katsamanis et al. 2017, Kopp and
Bergmann 2017, Potamianos et al. 2017, Qvarfordt 2017].
Glossary
Glossary (continued)
tional level (e.g., linguistic content), metacognitive level (e.g., intentional use of
face-to-face speech for persuasion), and others. In this sense, collecting speech
provides very rich data beyond simply knowing the content of what students said.
Paralinguistic information, such as speech amplitude, pitch, and accent, can clar-
ify students’ attitude, motivation, and attentional focus—important preconditions
for learning. Multimodal learning analytics also supports multi-level data collec-
11.2 What is Multimodal Learning Analytics? 335
11.2.3 How Does Multimodal Learning Analytics Differ from Earlier Work on
Learning Analytics, and What are its Primary Advantages?
Initial work on learning analytics aimed to collect and analyze data to better un-
derstand and support learners and educational processes. It focused on examining
students’ input during computer-mediated learning while they used keyboard-and-
mouse interfaces. As a result, analytics were limited to activity patterns and linguis-
tic content present in click-stream data. These analytics typically were summarized
on dashboards for professional educators, such as teachers or school administra-
tors. They mainly have been used to assist with managing educational systems
(e.g., attendance monitoring, work completion). Later, they started to be used to
predict and advise students on educational outcomes, often based on individual
student profiles in comparison with a larger group. For example, the system might
track the time spent by a student in different learning activities, her work comple-
tion, and grades received. This could be used to provide advisory information to the
student on the likelihood that she will meet her learning goals, and what actions
she could take to improve her expected educational outcome. One common aim of
these capabilities, for example, has been to create early warning systems to detect
and reduce student dropouts. Learning analytic techniques have been applied to
a variety of technology-mediated learning environments, including learning man-
agement systems [Arnold and Pistilli 2012], intelligent tutoring systems [Baker et al.
2004], massive open online courses [Kizilcec et al. 2013], educational video games
[Serrano-Laguna and Fernández-Manjón 2014], and so forth. In these contexts,
learning analytic results also have been used to study and improve the impact of
educational technologies.
Like multimodal learning analytics, earlier work on learning analytics relied
heavily on data resources, which are expensive and time-consuming to collect. Both
areas have been part of the broader data science and analytics movement. However,
major differences between learning analytics and multimodal learning analytics
include the following:
described above provide a richer collection of data sources that offer more
powerful predictive capabilities. Since they are based on natural commu-
nication modalities that mediate thought [Luria 1961, Vygotsky 1962], they
provide a particularly informative window on human thinking and learning.
In this regard, multimodal learning analytic data are well suited to the task
of predicting students’ mental state and learning.
. Learning analytics does not produce as large a dataset per unit of time as mul-
timodal learning analytics, because the latter typically records multiple
streams of synchronized data during the same recorded time period (e.g., 12
data streams in the Math Data Corpus, described in Section 11.3.1). Analysis
of these multimodal data streams can support a more comprehensive multi-
level analysis of learning, including how different levels interact to produce
emergent learning phenomena. This can generate a qualitatively different
systems-level view of the process of learning, as illustrated in Section 11.5.2.
. Learning analytics has been limited to analyzing computer-mediated learning.
In contrast, multimodal learning analytic techniques can be used to evaluate
both interpersonal and technology-mediated learning. This is an important
distinction because learning occurs in both contexts. In addition, evaluation
of interpersonal learning outcomes is needed to provide a “ground-truth” for
assessing the impact of different types of technology-mediated learning.
. Learning analytics has been more limited to assessment in stationary settings,
compared with multimodal learning analytics, because it typically has been
used on desktop and larger computers that support keyboard input. In con-
trast, multimodal learning analytic techniques originated to accommodate
mobile devices and more natural interfaces, so their usage context extends
to a wider range of field and mobile learning settings.
. Learning analytics in its original form mainly focused on supporting educational
management systems, student profiling and advising to promote academic suc-
cess, researching and improving educational technologies, and similar aims.
In contrast, multimodal learning analytic techniques aim to provide finer-
grained models of student learners and the learning process, including track-
ing individual students’ mental states while they learn. In this regard, it is
learner-centric, with a particular focus on students’ thoughts and mental
states as learning progresses.
. Learning analytics has only been able to collect and analyze data when students
are interacting with a computer (e.g., active typing) and, if some students don’t
11.3 What Data Resources are Available on Multimodal Learning Analytics? 339
Video B
Video C Video A
Video E
Video D
Figure 11.1 Synchronized views from all five video-cameras at the same moment in time during
data collection. Videos A, B and C show close-up views of the three individual students,
video D a wide-angle view of all students, and video E a top-down view of students’
writing and artifacts on the tabletop. (Used with permission from Oviatt [2013a])
Figure 11.2 Synchronized multimodal data collection of students’ images, writing, and speech as
they solved math problems (left); Sample of students’ handwritten work while solving
math problems (right). (Used with permission from Oviatt [2013a])
equations; (3) each student worked individually, writing equations and calculations
on paper as they attempted to solve the problem; (4) one student proposed the
solution, which the students then discussed together until they understood how the
problem had been solved and agreed on it; (5) the group’s solution was submitted,
after which the computer verified whether it was correct or not; (6) if correct, the
computer randomly called upon one student to explain how they had arrived at the
answer; (7) if not correct, the computer could display a worked example of how the
solution had been calculated. Figure 11.2 (right) illustrates a sample of students’
written data while solving the math problems, which averaged 80% nonlinguistic
content such as numbers, symbols, and diagrams [Oviatt and Cohen 2013].
This data resource includes original data files (high-resolution videos, speech
.wav files, digital ink .jpegs and .cvs files), extensive coding of problem segmen-
tation and phases, problem-solving correctness, student expertise ratings, repre-
sentational content of students’ writing, and verbatim speech transcriptions of
lexical content. It also includes data on coding reliabilities, and appendices with
math problem sets and other information required for those interested in conduct-
ing their own data analyses. Detailed documentation on this corpus is available in
Oviatt et al. [2014], and in Table 11.1.
Overall, the Math Data Corpus is a well structured dataset, with authentic learn-
ing and documented improvement in students’ performance over sessions [Oviatt
2013a]. The ground-truth coding of communication and performance during these
342 Chapter 11 Multimodal Learning Analytics
Figure 11.3 Sample recording of a speaker and talk slide (left), Kinect analysis of full-body
movement (middle) and data recording instruments (right) from Educational Testing
Service’s multimodal presentation corpus, which had the same basic features as the
Oral Presentation Corpus infrastructure first developed by Ochoa and colleagues. (From
Leong et al. [2015])
in Figure 11.3, in which the PC version of Kinect was used so multiple data streams
could be recognized jointly on a PC. They used the Public Speaking Competence
Rubric (PSCR) [Schreiber et al. 2012], which is well recognized and potentially more
valid for ground-truth assessment [Chen et al. 2014a]. They also adopted more rig-
orous human rating procedures, with multiple raters systematically adjudicating
presentation scores. These represent important methodological improvements for
future researchers interested in collecting similar data.
Figure 11.4 Gaze estimation based on camera capture of students’ head pose during classroom
lecture. (Used with permission from Raca and Dillenbourg [2014])
(a) MRD in the classroom (b) MRD from the student’s point of view
Figure 11.5 Design of Multimodal Recording Device (MRD) for high-fidelity classroom data
collection. (Used with permission from Dominguez et al. [2015])
Figure 11.6 Student facial expressions and upper-body movements captured with a camera during
interaction with an intelligent tutoring system (left), and showing processing of pose
information (right). (From Ezen-Can et al. [2015])
techniques, and also new hybrid methods. In the area of statistical analysis, earlier
education and cognitive science research on topics like domain expertise reported
a large array of individual “markers” [Ericsson et al. 2006, Purandare and Litman
2008, Worsley and Blikstein 2010]. We refer readers to Ericsson et al. [2006] for a
summary of this body of work, which preceded and has contributed to multimodal
learning analytics. Markers of expert performance, such as the use of linguistic
terms indicating certainty, typically were identified in empirical studies using cor-
relations or standard statistical significance tests.
More recent multimodal learning analytics research has shifted to adopting
statistical techniques that can contribute to prediction of expertise and learning, ei-
ther alone or in combination with the use of machine learning techniques. When
possible predictors are studied, researchers examine the proportion of variance
accounted for to distinguish which factors are more powerful rather than weak
(e.g., accounting for 60% rather than 1% of data variance). More powerful pre-
dictors then are good candidates to consider including in automated machine
learning feature sets [Oviatt et al. 2016]. Researchers also conduct regressions,
multiple regressions, stepwise linear regressions, and similar analyses to build
models and go beyond correlational analysis—instead determining whether fac-
tors are causal ones [Oviatt et al. 2016]. Other research has extracted information
about clusters of related multimodal features that have predictive power, for exam-
ple using principal components analysis [Chen et al. 2014a]. These are examples
of commonly used empirical/statistical techniques, but by no means an exhaus-
tive list.
For a detailed technical overview of signal processing and machine learning
methods currently used to model and analyze multimodal data, readers are referred
to other related chapters in this volume, (Chapters 1–4). The development of new
machine learning techniques for handling multimodal data is an active area of
research now, especially for topics like audio-visual data processing. Another ac-
tive research area is the development of hybrid analysis techniques that combine
empirical and machine learning capabilities, which is promising since they have
different strengths and weaknesses.
When analyzing the Math Data Corpus at a transactional level, research de-
monstrates that patterns of overlapped speech can identify (1) phases of rapid
problem-solving progress within a group, and (2) level of expertise of students in
the group [Oviatt et al. 2015]. The frequency and duration of overlapped speech in-
creased by 120–143% during the brief “moment-of-insight” phase of group problem
solving, compared with matched intervals. In addition, more expert students inter-
rupted non-experts 37% more often than non-experts interrupted them, in spite of
the fact that they talked less overall. These group dynamics involved speech activity
patterns, with no linguistic content analysis required, although linguistic content
analysis can be combined to improve prediction accuracy [Oviatt et al. 2015].
Based on the Oral Presentation Corpus for predicting skilled performance, stu-
dents’ overall score for quality of presentation was predictable based on paralin-
guistic or signal-level features like pitch, pitch variability, speaking rate, speech
fluency (i.e., minimized pauses), and postural and gestural movement energy [Chen
et al. 2014b]. These features likewise did not require any analysis of speakers’ lan-
guage content. However, presentation scores also could be predicted at the repre-
sentational level based on linguistic features like the use of first person pronouns.
increased their total writing most between easy and moderate tasks, experts in-
creased their writing most between hard and very hard tasks [Oviatt and Cohen
2014]. Both of these examples demonstrate accurate prediction based on a single
activity pattern metric, with no content analysis required.
As a third example, recent work has shown that domain experts can be distin-
guished from non-experts with 92% accuracy based simply on signal-level writing
dynamics, like the average distance and pressure of their written strokes [Zhou et
al. 2014, Oviatt et al. 2016] (see Section 11.5.2). Again, this prediction performance
was based exclusively on low-level signal features, and did not rely on any content
analysis of what students wrote.
Also based on the Math Data Corpus, a predictive index of expertise that ana-
lyzed speech activity patterns and linguistic content during interruptions within
a group correctly distinguished expert from non-expert students 95% of the time
within 3 min of interaction [Oviatt et al. 2015], surpassing reliabilities of either sin-
gle data source. Not only did expert students interrupt others more frequently, the
linguistic content of their interruptions more often disagreed, corrected, and di-
rected others on problem-solving actions. In contrast, inexpert students expressed
more uncertainty and apologized more often for errors [Oviatt et al. 2015]. This ex-
ample leverages both speech activity patterns and linguistic content analysis during
small group problem solving.
The multi-level multimodal results outlined above emphasize that there is an
opportunity to strategically combine information sources to achieve far higher
reliabilities in the future. From a pragmatic perspective, these early findings on
the robustness of multimodal prediction also highlight that this is a fertile area
to begin developing analytics systems. From a theoretical viewpoint, they indicate
that new multi-level multimodal findings potentially can lead to more coherent
systems-level views of the learning process, as illustrated in Section 11.5.2
They analyzed dialogue content, facial and postural movements, hand gestures,
and task actions as information sources. For these interrelated modeling tasks,
the results showed improvement in prediction accuracy when shifting from uni-
modal to bimodal features, and from a bimodal to trimodal collection of features
[Grafsgaard et al. 2014]. The trimodal model for predicting affect far exceeded uni-
and bimodal models, with an R 2 of 0.52, compared with a best bimodal R 2 of 0.14.
Most importantly, the trimodal model for predicting students’ actual learning gains
exceeded both uni- and bimodal models, with an R 2 of 0.54. This compared with a
best dialogue-only R 2 of 0.37, and a best bimodal (dialogue and nonverbal behav-
ior) R 2 of 0.47. Now that multimodal learning analytic techniques enable collecting
larger, richer, and more longitudinal datasets, the community will be in a better
position to: (1) examine interrelations among emotion, engagement, and learn-
ing outcomes in different contexts and (2) strategically select the most informative
modalities to assess in order to achieve high enough reliabilities to develop field-
able applications.
In their meta-analysis of multimodal affect detection systems, D’Mello and Kory
[2015] summarized that most existing systems rely on person-dependent mod-
els that fuse audio and visual information to detect acted (rather than natural
spontaneous) expressions of basic emotions. Their meta-analysis of 90 multimodal
systems revealed that multimodal emotion recognition is consistently more accu-
rate than unimodal approaches to identifying emotions, by an average magnitude
of 10%. To further improve their robustness advantage, future multimodal ap-
proaches to emotion recognition need to more carefully strategize which specific
information sources to include so their complementarity and ability to dampen
major sources of noise is optimized. It should certainly not always be assumed that
the same audio-visual sources will be the best solution. This is an emerging field,
and recognition of more spontaneous emotions, complex emotional blends, timing
and intensity of emotions, and cultural variability in emotional expression remain
difficult challenges yet to be fully addressed. Two major trends in recent emotion
recognition research have been an increase in: (1) assessment and modeling of
multimodal information sources and (2) recognition of naturalistic emotional ex-
pressions, rather than acted ones [Zeng et al. 2009].
To facilitate research on emotion recognition, a variety of software toolkits
are available, such as the Computer Expression Recognition Toolbox (CERT) for
tracking fine-grained facial movements (e.g., eyebrow raising, eyelid tightening,
mouth dimpling; [Littlewort et al. 2011]). This toolkit has been used to analyze
video recordings of naturalistic learning interactions [Grafsgaard et al. 2013],
and is based on Ekman and Friesen’s [1978] Facial Action Coding System. Other
off-the-shelf computer vision toolkits for analyzing faces include Visage [Visage
356 Chapter 11 Multimodal Learning Analytics
Limited-resource Theories
The aim of this section is to explain why the multimodal data we observe from
student learners contain certain characteristics when they are novices, but quite
different ones after they develop expertise in a domain, or a higher level of skilled
performance. Cognitive limited-resource theories (Working Memory theory, Cogni-
tive Load theory) have been used to account for the general finding that as people
learn their behavior becomes more fluent, which occurs across modalities (e.g.,
speech, writing). In Cheng’s research, expert mathematicians produced fewer and
shorter pauses when handwriting math formulas, a signal-level adaptation that re-
sulted in greater fluency [Cheng and Rojas-Anaya 2007, 2008]. In Oviatt’s research,
students who were more expert in math also produced fewer disfluencies at the
lexical level, resulting in more fluent communication when speaking and writ-
ing [Oviatt and Cohen 2014]. Limited-resource theories claim these changes occur
through repeated engagement in activity procedures (e.g., writing calculations for
math formulas), which causes people to perceive and group isolated lower-level
units of information into higher-level organized wholes [Baddeley and Hitch 1974].
That is, through repeated engagement in activity procedures, more expert students
learn to apprehend larger patterns and strategies. As a result, domain experts do
not need to retain and retrieve as many units of information from working memory
during a task, which frees up their memory reserves for focusing on more difficult
tasks or acquiring new knowledge.
From a more linguistic perspective, limited-resource theories have character-
ized people as adaptively conserving energy at the signal, lexical, and dialogue
levels. For example, Lindblom’s H & H Theory [1990] maintains that speakers strive
for articulatory economy by using relaxed hypo-clear speech, unless the intelligi-
bility of their communication is threatened. In this case, they will expend more
effort to produce hyper-clear speech. Likewise, at the lexical and dialogue levels
the Theory of Least Collaborative Effort states that people reduce their descrip-
tions as they achieve common ground with a dialogue partner [Clark and Brennan
1991, Clark and Schaefer 1989]. During learning activities, research shows that
expert students’ likewise adapt by becoming more concise. They use more apt
and succinct technical terms, and more concise noun phrase descriptions with
358 Chapter 11 Multimodal Learning Analytics
a lower ratio of words to concepts [Purandare and Litman 2008]. Based on the
Math Data Corpus, experts’ explanations of their math answers were 15% more
concise in number of words, compared with non-experts [Oviatt et al. 2015]. When
writing, expert math students likewise wrote more compact domain-specific sym-
bols [Oviatt et al. 2007], as well as 13% fewer total written representations. One
important impact of this adaptation is to minimize communicators’ cognitive
load, as well as the total energy they expend, so they can focus on more difficult
tasks.
During interpersonal or human-computer interactions, the literature demon-
strates that human performance improves when people combine different modal-
ities to express complementary information, for example sketching a Punnett
square diagram while speaking a genetics explanation, because this multimodal in-
formation can be processed in separate brain regions simultaneously. This likewise
effectively circumvents working memory limitations related to any single region.
Furthermore, this multimodal advantage can accrue whether simultaneous infor-
mation processing involves two input streams, an input and output stream, or
two output streams [Oviatt 2017]. For example, numerous education studies have
shown that a multimodal presentation format supports students’ learning more
successfully than unimodal presentation [Mayer and Moreno 1998, Mousavi et al.
1995]. When using a multimodal format, larger performance advantages also have
been demonstrated on more difficult tasks, compared with simpler ones [Tindall-
Ford et al. 1997]. Flexible multimodal interaction and interfaces are effective partly
because they support students’ ability to self-manage their own working memory in
a way that reduces cognitive load [Oviatt et al. 2004]. For example, students prefer
to interact unimodally when working on easy problems, but upshift to interact-
ing multimodally on harder ones. From a limited-resource theory viewpoint, when
students upshift to multimodal processing on hard problems they are distribut-
ing information processing across different brain regions. This frees up working
memory and minimizes their cognitive load so performance can be enhanced on
these harder tasks [Oviatt et al. 2004].
ceptual and multimodal motor neural circuits in the brain [James et al. 2017b,
Nakamura et al. 2012]. During this feedback loop, multisensory perception of an
action (e.g., the visual, haptic, and auditory cues involved in writing a letter shape)
primes motor neurons in the observer’s brain (e.g., corresponding finger move-
ments), which facilitates related comprehension and learning (e.g., letter recog-
nition and reading). Neuroscience findings have confirmed that actively writing
symbolic representations like letter shapes, compared with passively viewing or
typing them, stimulates greater brain activation, more durable long-term sensori-
motor memories in the reading neural circuit, and more accurate letter recognition
[James and Engelhardt 2012, Longcamp et al. 2008].
Both Activity and Embodied Cognition theories support multimodal learning
analytic findings that students’ increased physical and communicative activity fa-
cilitates learning and the consolidation of expertise. While Activity theorists like
Luria and Vygotsky were particularly interested in how speech activity facilitates
self-regulation and cognition [Luria 1961, Vygotsky 1962], subsequent research
has shown that increased activity in all communication modalities stimulates
thought. As tasks become more difficult, speech, gesture, and writing all increase
in frequency, reducing cognitive load and improving performance [Comblain 1994,
Goldin-Meadow et al. 2001, Xiao et al. 2003].
Furthermore, multisensory perception and multimodal action patterns facili-
tate learning to a greater extent than unimodal ones. They stimulate more total neu-
ral activity across a range of modalities, more intense bursts of neural activity, more
widely distributed activity across the brain’s neurological substrates, and longer
distance connections (see [Oviatt 2017]). As discussed above, research has demon-
strated that this results in demonstrably improved performance and learning.
See [Oviatt 2017] for a more extended discussion of how increased multisensory-
multimodal activity influences human brain structure and processing.
Figure 11.7 Pen stroke data for a domain expert (top) vs. non-expert (bottom) based on writing a
formula in two consecutive sessions (left). The expert shows a 32% relative reduction in
total energy expended (TE), compared with the non-expert. The expert also shows a 17%
relative reduction in total energy expended over sessions, which corresponded with a
10% learning gain (right top). (CDE=cumulative domain expertise, based on ground-truth
coding in Oviatt et al. [2014])
memory and increased learning. In short, the emergent outcome of this limited-
resource systems-level process is that experts’ reduced energy baseline enables a
larger energy reserve and greater adaptive range for expending energy on initiating
new learning activities. This initiative further expands their expertise, generating
a self-sustaining process of learning facilitating preparation and engagement in
further learning [Oviatt et al. 2016]. In fact, multimodal learning analytics research
conducted on the Math Data Corpus confirmed that experts were four-fold more
active initiating solutions, worked more on harder problems, and learned more
between sessions than non-experts [Oviatt et al. 2016]. For a more extended dis-
cussion of this limited-resource view of cross-level interactions between expertise
status, energy expended on cognitive processing and related language, total energy
reserves, and emergent initiative to engage in more learning activities, see Oviatt
et al. [2016].
sources become more richly multimodal, and learning analytic tools more mobile,
privacy violations risk becoming intrusive, pervasive, and potentially damaging to
basic civil liberties. It will be essential that technology designers, educators, and
citizens engage in participatory design of learning analytic tools to ensure that
policy formation protects the privacy, employment, and human rights of students
and teachers from whom data are collected. In the U.S., California already has
passed legislation limiting collection, use, and sale of students’ data by third-party
groups [Singer 2014]. Among the lessons for designers of learning analytic systems
are that: (1) users must be able to understand and control any system developed,
which requires transparency and continuous involvement; (2) a learning analytic
system cannot be gameable, or users will discover contingencies and game it to
their advantage; and (3) users of analytic systems must participate in their design
to protect their own privacy.
This chapter has not focused on multimodal learning analytic applications,
which would be premature given the emergent status of this new field. However,
systems are indeed beginning to be designed and developed to provide real-time
feedback to the learner or instructor about learning processes. As one early exam-
ple, Batrinca et el. [2013] developed a public speaking skill training system based
on advanced multimodal sensing and virtual human technologies. In this work, a
virtual audience of software personas responds to the quality of a speaker’s oral
presentation in real time by providing feedback. Similar systems have been de-
veloped using speech and image processing [Kurihara et al. 2007, Nguyen et al.
2012]. Other researchers are contributing to improving these new systems. For ex-
ample, research has examined how different types of multimodal system feedback
can improve students’ learning of presentation skills [Schneider et al. 2015]. An-
other project is extending this work to applications like personnel hiring [Nguyen et
al. 2014]. Eventually, this type of multimodal technology could transform previous
human-scored assessment of presentations, making them more reliable, objective,
and cost-efficient. A major challenge that lies ahead is the development of other
new systems that use multimodal learning analytic findings to adaptively support
learners in more sensitive and useful ways.
on all of these topics, cognitive, learning and linguistic scientists will be espe-
cially indispensable members of future research teams. For further discussion by
a multidisciplinary panel of experts on challenging topics in multimodal learning
analytics and the design of multimodal educational technologies, see [James et al.
2017a].
Focus Questions
11.1. Why is multimodal learning analytics as a field emerging now?
11.2. What are the main objectives of multimodal learning analytics, and what are
its main advantages compared with previous learning analytics?
11.3. How do the Math Data Corpus and Oral Presentation Corpus differ? What
opportunities do they each provide for examining learning with multimodal data?
11.4. What main computational methods and techniques are currently being used
to analyze large multi-level multimodal data sets? What are each of their strengths
and limitations?
11.5. Existing learning analytics systems often are used for educational or class-
room management, rather than for assessing learning behaviors or domain exper-
tise in individual students. Why is multimodal learning analytics a more promising
avenue for evaluating learners’ mental state?
11.6. What different levels of analysis does multimodal learning analytics enable?
In what ways do they provide a comprehensive and powerful window on students’
learning?
11.7. What types of data could be collected and analyzed automatically to evaluate
student learning using existing commercially available technology?
11.9. What evidence exists that experts conserve communication energy? What
are the implications for exploring further predictors of expertise using multimodal
learning analytics? And for producing new theory?
11.10. What specific types of data will be needed to support next-step research on
multimodal learning analytics?
366 Chapter 11 Multimodal Learning Analytics
11.11. What are the main privacy concerns involved in using multimodal learning
analytics, and how would you approach managing them?
11.12. Using either the Math Data Corpus or Oral Presentation Corpus, design
a multimodal learning analytics study to investigate a question that you think is
interesting. You can plan data analyses using any modality or level of analysis that
would be supportive of your aims. State your hypotheses, and also the potential
uses or impact of any findings.
References
Anoto. 2016. http://pressreleases.triplepointpr.com/2016/02/07/anoto-announces-
acquisition-of-we-inspire-destiny-wireless-and-pen-generations/. Accessed October
21, 2016. 346
K. E. Arnold and M. D. Pistilli. 2012. Course signals at purdue: using learning analytics to
increase student success. Proceedings of the 2nd International Conference on Learning
Analytics and Knowledge, ACM/SOLAR, 267–270. DOI: 10.1145/2330601.2330666. 337
A. Arthur, R. Lunsford, M. Wesson, and S. L. Oviatt. 2006. Prototyping novel collaborative
multimodal systems: Simulation, data collection and analysis tools for the next
decade. In Eighth International Conference on Multimodal Interfaces (ICMI’06), pp.
209–226. ACM, New York. DOI: 10.1145/1180995.1181039. 340
A. D. Baddeley and G. J. Hitch. 1974. Working memory. G. A. Bower, editor, The Psychology
of Learning and Motivation: Advances in Research and Theory, Volume 8, pp. 47–89.
Academic Press, New York. 357
R. S. Baker, S. K. D’Mello, M. M. Rodrigo, and A. C. Graesser. 2010. Better to be frustrated than
bored: The incidence, persistence, and impact of learners’ cognitive-affective states
during interactions with three different computer-based learning environments.
International Journal of Human-Computer Studies, 68(4):223–241. DOI: 10.1016/j.ijhcs
.2009.12.003. 356
R. S. Baker, A. T. Corbett, K. R. Koedinger, and A Z. Wagner. 2004. Off-task behavior in the
cognitive tutor classroom: When students game the system. Proceedings of the SIGCHI
Conference on Human factors in Computing Systems (CHI), pp. 383–390. ACM, New
York. DOI: 10.1145/985692.985741. 337
T. Baltrusaitis, P. Robinson, and L. Morency. 2016. OpenFace: An open source facial behavior
analysis toolkit. In Proceedings of the 2016 IEEE Winter Conference on Applications of
Computer Vision (WACV), IEEE, New York. DOI: 10.1109/WACV.2016.7477553. 356
L. Batrinca, G. Stratou, A. Shapiro, L.-P. Morency, and S. Scherer. 2013. Cicero: Towards a
multimodal virtual audience platform for public speaking training. Intelligent Virtual
Agents, 116–128. DOI: 10.1007/978-3-642-40415-3_10. 363
P. Boersma. 2002. Praat, a system for doing phonetics by computer. Glot International,
5(9/10):341–345. 362
References 367
L. Chen, G. Feng, J. Joe, C. Leong, C. Kitchen, C. and Lee. 2014a. Toward automated
assessment of public speaking skills using multimodal cues. In Proceedings of the
International Conference on Multimodal Interaction (ICMI), pp. 200–203. ACM, New
York. DOI: 10.1145/2663204.2663265. 345, 347, 349, 351
L. Chen, C. Leong, G. Feng, and C. Lee. 2014b. Using multimodal cues to analyze MLA’14 Oral
Presentation Quality Corpus: Presentation delivery and slides quality. In Proceedings
of the 2014 ACM Grand Challenge Workshop on Multimodal Learning Analytics, pp.
45–52. ACM, New York. DOI: 10.1145/2666633.2666640. 347, 350, 351
L. Chen, C. Leong, G. Feng, C. Lee, and S. Somasundaran. 2015. Utilizing multimodal
cues to automatically evaluate public speaking performance. In Proceedings of the
International Conference on Affective Computing and Intelligent Interaction, pp. 394–
400. DOI: 10.1109/ACII.2015.7344601. 343, 347
P. C. H. Cheng, and H. Rojas-Anaya. 2007. Measuring mathematical formula writing
competence: An application of graphical protocol analysis. In D. S. McNamara
and J. G. Trafton editors, Proceedings of the 29th Conference of the Cognitive Science
Society, pp. 869–874. Cognitive Science Society, Austin, TX. 352, 357
P. C. H. Cheng, and H. Rojas-Anaya. 2008. A graphical chunk production model: Evaluation
using graphical protocol analysis with artificial sentences. In B. C. Love, K. McRae
and V. M. Sloutsky editors, Proceedings of 30th Annual Conference of the Cognitive
Science Society, pp. 1972–1977. Cognitive Science Society, Austin, TX. 352, 357
H. Clark and S. Brennan. 1991. Grounding in communication. In Resnick, Levine and
Teasley, editors, Perspectives on Socially Shared Cognition, pp. 127–149, APA,
Washington. 357
H. Clark, and E. F. Schaefer. 1989. Contributing to discourse. Cognitive Science, 13: 259–294.
DOI: 10.1016/0364-0213(89)90008-6. 357
P. Cohen and S. Oviatt. 2017. Multimodal speech and pen interfaces. In S. Oviatt, B. Schuller,
P. Cohen, D. Sonntag, G. Potamianos and A. Krueger, editors, The Handbook of
Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling and Common
Modality Combinations. Morgan & Claypool Publishers, San Rafael, CA. 332
A. Comblain. 1994. Working memory in Down’s Syndrome: Training the rehearsal strategy.
Down’s Syndrome Research and Practice, 2(3):123–126. DOI: 10.3104/reports.42. 359
S. D’Mello, S. Craig, A. Witherspoon, B. McDaniel, and A. Graesser 2008. Automatic
detection of learner’s affect from conversational cues. User Modeling and User Adapted
Interaction, 18(1-2):45–80. DOI: 10.1007/s11257-007-9037-6. 356
S. K. D’Mello and A. C. Graesser. 2010. Multimodal semi-automated affect detection from
conversational cues, gross body language, and facial features. User Modeling and
User- Adapted Interaction, 20(2):147–187. DOI: 10.1007/s11257-010-9074-4. 356
S. K. D’Mello and J. Kory. 2015. A Review and meta-analysis of multimodal affect detection
systems. ACM Computing Surveys, 47(3), article 43, ACM, New York. DOI: 10.1145/
2682899. 352, 354, 355
368 Chapter 11 Multimodal Learning Analytics
P. Ekman and W. V. Friesen. 1978. Facial Action Coding System, Consulting Psychologists
Press, Palo Alto, CA. 355
J. H. Friedman. 2002. Stochastic gradient boosting. Computational Statistics & Data Analysis,
38(4):367–378. DOI: 10.1016/S0167-9473(01)00065-2. 351
N. Hill and W. Schneider. 2006. Brain changes in the development of expertise: Neu-
roanatomical and neurophysiological evidence about skill-based adaptations. In
Ericsson, Charness, Feltovich & Hoffman, editors, The Cambridge Handbook of
Expertise and Expert Performance. Cambridge University Press, 37:653–682. DOI:
10.1017/CBO9780511816796.037. 360
K. James and L. Engelhardt. 2012. The effects of handwriting experience on functional brain
development in pre-literate children. Trends in Neuroscience and Education, 1: 32–42.
DOI: 10.1016/j.tine.2012.08.001. 359
K. James, J. Lester, S. Oviatt, K. Cheng, and D. Schwartz. 2017a. Perspectives on learning
with multimodal technologies. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag,
G. Potamianos and A. Krueger, editors, The Handbook of Multimodal-Multisensor
Interfaces, Volume 1: Foundations, User Modeling, and Common Modality Combinations,
Chapter 13. Morgan & Claypool Publishers, San Rafael, CA. 365
K. James, S. Vinci-Booher, and F. Munoz-Rubke. 2017b. The impact of multimodal-
multisensory learning on human performance and brain activation patterns. In S.
Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos and A. Krueger, editors, The
Handbook of Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling,
and Common Modality Combinations, Chapter 2. Morgan & Claypool Publishers, San
Rafael, CA. DOI: 10.1145/3015783.3015787. 359
B. Jee, D. Gentner, K. Forbus, B. Sageman, and D. Uttal. 2009. Drawing on experience: Use
of sketching to evaluate knowledge of spatial scientific concepts. In Proceedings of
the 31st Conference of the Cognitive Science Society, Amsterdam. 350
B. Jee, D. Gentner, D. Uttal, B. Sageman, K. Forbus, C. Manduca, C. Ormand, T. Shipley,
and B. Tikoff. 2014. Drawing on experience: How domain knowledge is reflected in
sketches of scientific structures and processes. Research in Science Education, 44(6),
859–883 Springer, Dordrecht. DOI: 10.1007/s11165-014-9405-2. 350
A. Jones, A. Kendira, D. Lenne, T. Gidel, and C. Moulin. 2011. The tatin-pic project: A multi-
modal collaborative work environment for preliminary design. In Proceedings of the
Conference on Computer Supported Cooperative Work in Design (CSCWD), pp. 154–161.
DOI: 10.1109/CSCWD.2011.5960069. 346
D. Kahneman, P. Slovic, and A. Tversky, editors. 1982. Judgment under Uncertainty: Heuristics
and Biases. Cambridge University Press, New York. 364
M. Kapoor and R. Picard. 2005. Multimodal affect recognition in learning environments. In
Proceedings of the ACM Multimedia Conference, ACM, New York. DOI: 10.1145/1101149
.1101300. 352, 354
A. Katsamanis, V. Pitsikalis, S. Theodorakis, and P. Maragos. 2017. Multimodal gesture
recognition. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos and
A. Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume
1: Foundations, User Modeling, and Common Modality Combinations, Chapter 11.
Morgan & Claypool Publishers, San Rafael, CA. DOI: 10.1145/3015783.3015796. 332
R. F. Kizilcec, C. Piech, and E. Schneider. 2013. Deconstructing disengagement: analyzing
learner subpopulations in massive open online courses. In Proceedings of the Third
International Conference on Learning Analytics and Knowledge, ACM, New York, pp.
170–179. DOI: 10.1145/2460296.2460330. 337
S. Kopp and K. Bergmann. 2017. Using cognitive models to understand multimodal
processes: The case for speech and gesture production. In S. Oviatt, B. Schuller,
370 Chapter 11 Multimodal Learning Analytics
S. Y. Mousavi, R. Low, and J. Sweller. 1995. Reducing cognitive load by mixing auditory and
visual presentation modes. Journal of Educational Psychology, 87(2):319–334. DOI:
10.1037/0022-0663.87.2.319. 358
K. Nakamura, W-J. Kuo, F. Pegado, L. Cohen, O. Tzeng, and S. Dehaene. 2012. Universal
brain systems for recognizing word shapes and handwriting gestures during reading.
In Proceedings of the National Academy of Science, 109(50):20762–20767. DOI: 10.1073/
pnas.1217749109. 359
Y. I. Nakano and R. Ishii. 2010. Estimating user’s engagement from eye-gaze behaviors
in human-agent conversations. In Proceedings of the International Intelligent User
Interface Conference, pp. 139–148. ACM, New York. DOI: 10.1145/1719970.1719990.
354
A.-T. Nguyen, W. Chen, and M. Rauterberg. 2012. Online feedback system for public
speakers. IEEE Symposium on e-Learning, e-Management and e-Services. Citeseer,
Malaysia, 46–51. DOI: 10.1109/IS3e.2012.6414963. 343, 363
L. S. Nguyen, D. Frauendorfer, M. S. Mast, and D. Gatica-Perez. 2014. Hire me: Computational
inference of hirability in employment interviews based on nonverbal behavior. IEEE
Transactions on Multimedia, 16(4):1018–1031. 363
X. Ochoa. Description of the presentation quality dataset and coded documents,
unpublished manuscript. 343
X. Ochoa, K. Chiluiza, G. Méndez, G. Luzardo, B. Guamán, and J. Castells. 2013. Expertise
estimation based on simple multimodal features. In Proceedings of the 15th ACM
International Conference on Multimodal Interaction, pp. 583–590. ACM, New York, NY.
DOI: 10.1145/2522848.2533789. 350
S. L. Oviatt. 2013a. Problem solving, domain expertise, and learning: Ground-truth
performance results for Math Data Corpus. In International Data-Driven Grand
Challenge Workshop on Multimodal Learning Analytics. ACM, New York. DOI: 10.1145/
2522848.2533791. 340, 341
S. Oviatt. 2013b. The Design of Future Educational Interfaces. Routledge Press, New York, NY.
DOI: 10.4324/9780203366202. 333, 356
S. Oviatt. 2017. Theoretical foundations of multimodal interfaces and systems. In S. Oviatt,
B. Schuller, P. Cohen, D. Sonntag, G. Potamianos and A. Krueger, editors, The
Handbook of Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling
and Common Modality Combinations, Chapter 1. Morgan & Claypool Publishers, San
Rafael, CA. 356, 358, 359
S. L. Oviatt, A. Arthur, Y. Brock, and J. Cohen. 2007. Expressive pen-based interfaces for math
education. In Chinn, Erkens and Puntambekar, editors, Proceedings of the Conference
on Computer-Supported Collaborative Learning, International Society of the Learning
Sciences, 8(2):569–578. 350, 358
S. Oviatt and A. Cohen. 2013. Written and multimodal representations as predictors of
expertise and problem-solving success in mathematics. In Proceedings of the 15th
372 Chapter 11 Multimodal Learning Analytics
Combinations, Chapter 9. Morgan & Claypool Publishers, San Rafael, CA. DOI: 10
.1145/3015783.3015794. 332
M. Raca and P. Dillenbourg. 2013. System for assessing classroom attention. In Proceedings
of the Third International Conference on Learning Analytics and Knowledge, pp. 265–269.
ACM, New York. DOI: 10.1145/2460296.2460351. 345, 354
M. Raca and P. Dillenbourg. 2014. Holistic analysis of the classroom. In Proceedings of the
ACM International Data-Driven Grand Challenge workshop on Multimodal Learning
Analytics, pp. 13–20. ACM, New York. DOI: 10.1145/2666633.2666636. 346
C. Rich, B. Ponsler, A. Holroyd, and C. Sidner. 2010. Recognizing engagement in human-
robot interaction. In Proceedings of the International Conference on Human-Robot
Interaction, pp, 375–382. ACM, New York. 354
J. Sanghvi, A. Pereira, G. Castellano, P. McOwan, I. Leite, and A. Paiva. 2011. Automatic
analysis of affective postures and body motion to detect engagement with a game
companion. In Proceedings of the International Conference on Human-Robot Interaction.
ACM, New York. DOI: 10.1145/1957656.1957781. 354
J. Schneider, D. Börner, P. Van Rosmalen, and M. Specht. 2015. Presentation trainer,
your public speaking multimodal coach. In Proceedings of the ACM on International
Conference on Multimodal Interaction, pp. 539–546. ACM, New York. DOI: 10.1145/
2818346.2830603. 363
L. M. Schreiber, G. D. Paul, and L. R. Shibley. 2012. The development and test of the
public speaking competence rubric. Communication Education. 61(3):205–233. DOI:
10.1080/03634523.2012.670709. 345
A. Serrano-Laguna and B. Fernández-Manjón. 2014. Applying learning analytics to simplify
serious games deployment in the classroom. Global Engineering Education Conference
(EDUCON), pp. 872–877. IEEE, New York. DOI: 10.1109/EDUCON.2014.6826199. 337
N. Singer. 2014. With tech taking over schools, worries rise. New York Times. September 15,
2014. 363
S. Tindall-Ford, P. Chandler, and J. Sweller. 1997. When two sensory modes are better than
one. Journal of Experimental Psychology: Applied, 3(3):257–287. DOI: 10.1037/1076-
898X.3.4.257. 358
Visage Technologies. 2016. http://visagetechnologies.com. Accessed October 22, 2016. 356
L. Vygotsky. 1962. Thought and Language. MIT Press, Cambridge, MA. (Translated by E.
Hanfmann and G. Vakar from 1934 original). 338, 359
B. Woolf, W. Burleson, I. Arroyo, T. Dragon, D. Cooper, and R. Picard. 2009. Affect-aware
tutors: recognising and responding to student affect. International Journal of Learning
Technology, 4(3-4):129–164. DOI: 10.1504/IJLT.2009.028804. 356
M. Worsley and P. Blikstein. 2010. Toward the development of learning analytics: Student
speech as an automatic and natural form of assessment. Annual Educational Research
Association Conference. 349, 350
374 Chapter 11 Multimodal Learning Analytics
Latent Trees for facial expressions, 87 userstate and trait recognition, 152
Laughter LIWC (Linguistic Inquiry and Word Count),
depression behavioral signals, 382 424–427, 435
incongruent modalities, 273 Local Binary Patterns (LBPs)
recognizing, 245–246 facial expressions, 86
Layers multimodal frameworks, 397
hidden, 5, 8 Locally Linear Reconstruction, 239
intermediate fusion, 104–105 Log-based models for affect detection, 183
late fusion, 103–104 Logistic Regression for facial expressions,
LBP-TOP method, 384–386 86
LBPs (Local Binary Patterns) Long-short term memory (LSTM) models
facial expressions, 86 affect detection, 177–180
multimodal frameworks, 397 coordinated representations, 30–31
Learners non-parallel data, 35
description, 51 sequential representations, 29
mental state assessment. See Multimodal text representation, 118–119
learning analytics Long Short-Term Memory Neural Networks
multimodal data, 49–51 (LSTM-NNs), 236
training, 52–54 Long-term traits, defined, 133
Learning Analytics and Knowledge Longer-term traits, defined, 133
Conference (LAK), 336 Longitudinal data
Learning and learning analytics, 331 body movement and depression, 392
defined, 333 defined, 377
multimodal. See Multimodal learning Loosely coupled components in affect
analytics detection, 172
userstate and trait recognition, 141–143, Loss function in classification, 51
145 Low-rank matrices in RCICA, 80
Learning-centered affective states, 182–183 LSTM models. See Long-short term memory
Learning-oriented behaviors, 339 (LSTM) models
Leave-one-out cross validation LSTM-NNs (Long Short-Term Memory
deception detection, 435 Neural Networks), 236
defined, 422 Lying. See Deception detection
Life and human sciences, multimodal
communication in, 205–208 M.I.N.I. interviews, 379
Light pens, 1 Machine learning
Limited-resource theories in multimodal analytics, 349
learning analytics, 357–358 deep learning, 459–460
Lindblom, B., H&H Theory, 357 learners, 51
Linguistic Inquiry and Word Count (LIWC), multimodal. See Multimodal machine
424–427, 435 learning
Linguistic modality, description, 19 Machine translation in encoder-decoder
Linking models, 111
affective experience-expression link, 168, Macro-bimodality in communication, 208
170–171 MAHNOB Mimicry Database, 231–232
Index 487
W5+ sextuplet for multimodal interfaces, 74 Working memory based on cognitive load
Walking speed in cognitive load measure- theory, 301
ment, 301 Working memory resources limitations,
Wearable-based assessment of depression, 292, 294
400 Working Memory theory, 310
Web scale annotation by image embedding Wrapper searches in userstate and trait
(WSABIE) model, 30 recognition, 138
Weight matrices for dense layers, 101 Writing
Weighted voting in training learners, 55–56 AI usage, 2
WEKA 3 tool, 150 cognitive load indicators, 302–303
Whirlwind project, 1 multimodal learning analytics, 360–361
White-boxing in deep learning, 464 userstate and trait recognition, 145–146
Wizard-of-Oz scenario WSABIE (web scale annotation by image
databases, 230 embedding) model, 30
real-time sensing, 251
Word statistics in deception detection, 427 Zeno robots, 270
Wordnet Affect Zero shot learning (ZSL), 14
database, 146 description, 5
deception detection, 425, 427 non-parallel data, 36–37
Biographies
Editors
Philip R. Cohen (Monash University) is Director of the Laboratory for Dialogue Re-
search and Professor of Artificial Intelligence in the Faculty of Information Technol-
ogy at Monash University. His research interests include multimodal interaction,
human-computer dialogue, and multi-agent systems. He is a Fellow of the Amer-
ican Association for Artificial Intelligence, past President of the Association for
Computational Linguistics, recipient (with Hector Levesque) of the Inaugural In-
fluential Paper Award by the International Foundation for Autonomous Agents and
Multi-Agent Systems, and recipient of the 2017 Sustained Achievement Award from
the International Conference on Multimodal Interaction.
He was most recently Chief Scientist in Artificial Intelligence and Senior Vice
President for Advanced Technology at Voicebox Technologies, where he led efforts
on semantic parsing and dialogue. Cohen founded Adapx Inc. to commercialize
multimodal interaction, deploying multimodal systems to civilian and government
organizations. Prior to Adapx, he was Professor and Co-Director of the Center for
Human-Computer Communication in the Computer Science Department at Ore-
gon Health and Science University and Director of Natural Language in the Artificial
Intelligence Center at SRI International. Cohen has published more than 150 arti-
cles and has 5 patents. He co-authored the book The Paradigm Shift to Multimodality
in Contemporary Computer Interfaces (2015, Morgan & Claypool Publishers) with Dr.
Sharon Oviatt. (Contact: philip.cohen@monash.edu)
IUI, and Pervasive Computing. He is also on the Steering Committee of the Inter-
national Conference on Intelligent User Interfaces (IUI) and an Associate Editor of
the journals User Modeling and User-Adapted Interaction and ACM Interactive, Mobile,
Wearable and Ubiquitous Technologies. (Contact: krueger@dfki.de)
Daniel Sonntag (German Research Center for Artificial Intelligence, DFKI) is a Prin-
cipal Researcher and Research Fellow. His research interests include multimodal
and mobile AI-based interfaces, common-sense modeling, and explainable ma-
chine learning methods for cognitive computing and improved usability. He has
published over 130 scientific articles and was the recipient of the German High
Tech Champion Award in 2011 and the AAAI Recognition and IAAI Deployed Appli-
cation Award in 2013. He is the Editor-in-Chief of the German Journal on Artificial
Intelligence (KI) and editor-in-chief of Springer’s Cognitive Technologies book se-
ries. Currently, he leads both national and European projects from the Federal
Ministry of Education and Research, the Federal Ministry for Economic Affairs and
Energy, and Horizon 2020. (Contact: daniel.sonntag@dfki.de)
signal processing. Elisabeth André has served as a General and Program Co-Chair
of major international conferences, including ACM International Conference on
Intelligent User Interfaces (IUI), ACM International Conference on Multimodal
Interfaces (ICMI), and International Conference on Autonomous Agents and Mul-
tiagent Systems (AAMAS). In 2010, Elisabeth André was elected a member of the
prestigious Academy of Europe, the German Academy of Sciences Leopoldina, and
AcademiaNet. To honor her achievements in bringing artificial intelligence tech-
niques to HCI, she was awarded a EurAI (European Coordinating Committee for
Artificial Intelligence) fellowship in 2013. Most recently, she was elected to the CHI
Academy, an honorary group of leaders in the field of human-computer interaction.
Syed Z. Arshad (University of New South Wales) has a Ph.D. in computer science
from the University of New South Wales, Australia. He is a member of IEEE, ACM,
SIGCHI, and SIGKDD. His research interests include cognitive computing, intelli-
gent user interfaces, machine learning, and data visualization techniques.
Samy Bengio (Google) has been a research scientist at Google since 2007. Before
that, he had been a senior researcher in statistical machine learning at IDIAP Re-
search Institute, where he supervised Ph.D. students and postdoctoral fellows. His
research interests span many areas of machine learning such as deep architec-
tures, representation learning, sequence processing, speech recognition, image
understanding, support vector machines, mixture models, large-scale problems,
multi-modal (face and voice) person authentication, brain–computer interfaces,
and document retrieval. He was the program chair of NIPS 2017 and ICLR 2015
and 2016; was the general chair of BayLearn 2012–2015, the Workshops on Machine
Learning for Multimodal Interactions (MLMI) 2004–2006, and the IEEE Workshop
on Neural Networks for Signal Processing (NNSP) in 2002; and served on the pro-
gram committees of several international conferences such as NIPS, ICML, ICLR,
ECML and IJCAI.
504 Biographies
Huili Chen (MIT) is a Ph.D. student at the MIT Media Lab. She obtained a Bachelor
of Arts degree in Psychology and earned a Bachelor of Science degree in Computer
Science from the University of Notre Dame in 2016. She is very interested in human-
machine interaction and interactive artificial intelligence.
Matthieu Courgeon (École Nationale d’Ingénieurs de Brest) defended his Ph.D. the-
sis in 2011 on affective computing and interactive autonomous virtual characters.
He is the creator of the Multimodal Affective and Reactive Characters toolkit, used
by several research teams around the world. His research area spans interactive ex-
pressive artificial humans and robots to collaborative immersive virtual reality. He
506 Biographies
earned a Best Paper Award at Ubicomp 2013 for his work with expressive virtual
humans with the Affective Computing team at the MIT Medialab.
Li Deng (Citadel) has been the Chief Artificial Intelligence Officer of Citadel since
May 2017. Prior to Citadel, he was the Chief Scientist of AI, the founder of Deep
Learning Technology Center, and Partner Research Manager at Microsoft (2000–
2017). Prior to Microsoft, he was a tenured full professor at the University of Water-
loo and held teaching and research positions at Massachusetts Institute of Technol-
ogy (1992–1993), Advanced Telecommunications Research Institute (1997–1998),
and HK University of Science and Technology (1995). He has been a Fellow of the
IEEE since 2004, a Fellow of the Acoustical Society of America since 1993, and a
Fellow of the ISCA since 2011. He has also been an Affiliate Professor at Univer-
sity of Washington, Seattle, since 2000. He was elected to the Board of Governors
of the IEEE Signal Processing Society and served as editor-in-chief of the IEEE Sig-
nal Processing Magazine and IEEE/ACM Transactions on Audio, Speech, and Language
Processing from 2008–2014, for which he received the IEEE SPS Meritorious Service
Award. In recognition of his pioneering work on disrupting the speech recognition
industry using large-scale deep learning, he received the 2015 IEEE SPS Technical
Achievement Award for “Outstanding Contributions to Automatic Speech Recogni-
tion and Deep Learning.” He also received numerous best paper and patent awards
for contributions to artificial intelligence, machine learning, information retrieval,
multimedia signal processing, speech processing and recognition, and human lan-
guage technology. He is an author or co-author of six technical books on deep
learning, speech processing, pattern recognition and machine learning, and natu-
ral language processing.
Julien Epps (UNSW Sydney and Data61, CSIRO) is an Associate Professor of Signal
Processing with UNSW Sydney and a Contributed Principal Researcher at Data61,
CSIRO. He has authored or co-authored more than 200 publications and 4 patents,
mainly on topics related to emotion and mental state recognition of speech and
behavioral signals. He is serving as an Associate Editor for IEEE Transactions on
Affective Computing and Frontiers in ICT (the Human-Media Interaction and Psy-
chology sections), and he recently served as a member of the Advisory Board of
the ACM International Conference on Multimodal Interaction. He has delivered
invited tutorials on topics related to this book for major conferences, including
INTERSPEECH 2014 and 2015 and APSIPA 2010, and invited keynotes for the 4th
International Workshop on Audio-Visual Emotion Challenge (part of ACM Multi-
media 2014) and the 4th International Workshop on Context-Based Affect Recog-
nition (part of AAAC/IEEE Affective Computing and Intelligent Interaction 2017).
Marc Ernst (Ulm University) heads the Department of Applied Cognitive Psychology
at Ulm University. He studied physics in Heidelberg and Frankfurt/Main. In 2000,
he received his Ph.D. from the Eberhard-Karls-University Tübingen for investiga-
tions into the human visuomotor behavior that he conducted at the Max Planck In-
stitute for Biological Cybernetics. For this work, he was awarded the Attempto-Prize
from the University of Tübingen and the Otto-Hahn-Medaille from the Max Planck
Society. He was a research associate at the University of California, Berkeley working
with Prof. Martin Banks on psychophysical experiments and computational mod-
els investigating the integration of visual-haptic information before returning to
the MPI in Tübingen and becoming principle investigator of the Sensorimotor Lab
in the Department of Professor Heinrich Bülthoff. In 2007, he became leader of
the Max Planck Research Group on Human Multisensory Perception and Action,
before joining the University of Bielefeld and the Cognitive Interaction Technology
Center of Excellence (CITEC) in 2011.
508 Biographies
Anna Esposito (Wright State University) received her Laurea degree summa cum
laude in Information Technology and Computer Science from the Università di
Salerno in 1989 with the thesis “The Behavior and Learning of a Deterministic Neu-
ral Net” (published in Complex System, 6(6), 507–517, 1992). She received her Ph.D.
in Applied Mathematics and Computer Science from the Università di Napoli Fed-
erico II in 1995. Her Ph.D. thesis “Vowel Height and Consonantal Voicing Effects:
Data from Italian” (published in Phonetica, 59(4),197–231, 2002) was developed at
the MIT Research Laboratory of Electronics (RLE), under the supervision of profes-
sor Kenneth N. Stevens. She completed a postdoctoral program at the International
Institute for Advanced Scientific Studies (IIASS), and was Assistant Professor in the
Department of Physics at Università di Salerno, where she taught classes on cyber-
netics, neural networks, and speech processing (1996–2000). She was a Research
Professor in the Department of Computer Science and Engineering at Wright State
University (WSU) (2000–2002). She is currently a Research Affiliate at WSU and Asso-
ciate Professor in Computer Science in the Department of Psychology at Università
della Campania Luigi Vanvitelli. She has authored more than 170 peer-reviewed
publications in international journals, books, and conference proceedings and
edited or co-edited over 25 international books with Italian, EU, and overseas col-
leagues.
Amr El-Desoky Mousa (Apple Inc.) received his Ph.D. in 2014 from the Chair of
Human Language Technology and Pattern Recognition at RWTH Aachen Univer-
sity, Germany, where his research was focused on automatic speech recognition. In
2014–2015, he worked as a postdoctoral researcher at the Machine Intelligence and
Signal Processing Group at the Technical University of Munich, Germany. In 2016–
2017, he worked as a Research and Teaching Associate at the Chair of Complex
and Intelligent Systems, University of Passau, Germany. In June 2017, he moved
to Apple Inc. to work as a senior Machine Learning Engineer taking part in the re-
search and development related to Siri, one of the most well-known speech-based
Intelligent Assistants in the world. His main research interests include deep learn-
Biographies 511
Xavier Ochoa (Escuela Superior Politécnica del Litoral) is currently a Full Profes-
sor in the Faculty of Electrical and Computer Engineering at Escuela Superior
Politécnica del Litoral (ESPOL) in Guayaquil, Ecuador. He directs the Information
Technology Center (CTI) and the research group on Teaching and Learning Tech-
nologies (TEA) at ESPOL. He is currently the Vice President of the Society for the
Research in Learning Analytics (SoLAR), member of the coordination team of the
Latin American Community on Learning Technologies (LACLO), and president of
the Latin American Open Textbooks Initiative (LATIn). He is editor of the Journal of
Learning Analytics and member of the Editorial Board of the IEEE Transactions on
Learning Technologies. He coordinates several regional and international projects
in the field of learning technologies. His main research interests revolve around
learning analytics, multimodal learning analytics, and data science.
in Artificial Intelligence in 1997. Pantić earned a Ph.D. in 2001 at the Delft Uni-
versity of Technology, with a thesis on facial expression analysis by computational
intelligence techniques.
Olivier Pietquin (Google) completed his Ph.D. under the Faculty of Engineering at
l’Universiste de Mons, Belgium, and the University of Sheffield, UK. He then did a
postdoctoral program with Philips in Germany before joining the Ecole Supérieure
d’Electricité, France, in 2005, as an Associate Professor and, later, Professor. He
headed the computer science program of the UMI GeorgiaTech-CNRS (joint lab
in Metz) and also worked with the INSERM IADI team. In 2013, he moved to
the University of Lille, France, as a Full Professor and joined the Inria SequeL
(Sequential Learning) team. Since 2016, Olivier has been on leave with Google, first
with DeepMind in London, and then with Brain in Paris. His current research is
about direct and inverse reinforcement learning, learning from demonstrations,
and applications to human-machine interaction.
techniques for large-scale data fusion with applications in robotics, mining, envi-
ronmental monitoring, and healthcare.
Ognjen (Oggi) Rudovic (MIT) is a Postdoctoral, Marie Curie, Fellow in the Affective
Computing Group at MIT Media Lab, working on personalized machine learning
for affective robots and analysis of human data. His background is in automatic
control theory, computer vision, artificial intelligence, and machine learning. In
2014, he received a Ph.D. from Imperial College London, UK, where he worked on
machine learning and computer vision models for automated analysis of human
facial behavior. His current research focus is on developing machine-learning-
driven assistive technologies for individuals on the autism spectrum.
Stefan Scherer (Embodied, Inc. and USC) is the CTO of Embodied, Inc. and a Re-
search Assistant Professor in the Department of Computer Science at the University
of Southern California (USC) and the USC Institute for Creative Technologies (ICT),
where he leads research projects funded by National Science Foundation and the
765Army Research Laboratory. Stefan Scherer directs the lab for Behavior Analytics
514 Biographies
and Machine Learning and is the Associate Director of Neural Information Process-
ing at the ICT. He received the degree of Dr. rer. nat. from the faculty of Engineering
and Computer Science at Ulm University in Germany with the grade summa cum
laude (i.e. with distinction) in 2011. His research aims to automatically identify
characterize, model, and synthesize individuals’ multimodal nonverbal behavior
within both human-machine as well as machine-mediated human-human inter-
action. His research was recently featured in the Economist, the Atlantic, and the
Guardian, and was awarded a number of best paper awards in renowned interna-
tional conferences. His work is focused on machine learning, multimodal signal
processing, and affective computing. Stefan Scherer serves as an Associate Editor
of the Journal IEEE Transactions on Affective Computing and is the Vice Chair of the
IAPR Technical Committee 9 on Pattern Recognition in Human-Machine Interac-
tion.
Accumulative GSR refers to the summation of GSR values over the task time. If the
GSR is considered as a continuous signal over time, the accumulative GSR is the
integration of GSR values over the task time.
Action units and action descriptors are the smallest visually discriminable facial
movements. Action units are movements for which the anatomic basis is known
[Cohn, Ambadar, & Ekman, 2007]. They are represented as either binary events
(presence vs. absence) or with respect to five levels of ordinal intensity. Action
units individually or in combinations can represent nearly all possible facial
expressions.
Active Appearance Model (AAM). An AAM is a statistical model of shape and
grey-level appearance that can generalize to almost any face [Edwards, Taylor,
& Cootes, 1998; Matthews & Baker, 2004]. An AAM seeks to find the model
parameters that can generate a synthetic image as close as possible to a
target image [Cootes & Taylor, 2004]. AAMs are learned from hand-labelled
training data.
Affect. Broad term encompassing constructs such as emotions, moods, and
feelings. Is not the same as personality, motivation, and other related terms.
Affect and social signals can be described as temporal patterns of a multiplicity of
non-verbal behavioral cues which last for a short time [Vinciarelli et al. 2009] and
are expressed as changes in neuromuscular and physiological activity [Vinciarelli
et al. 2008a]. Sometimes, we consciously draw on affect and social signals to alter
the interpretation of a situation, e.g. by saying something in a sarcastic voice
to signal that we actually mean the opposite. At another time, we use them
without being aware of it, e.g., by showing sympathy towards our counterpart
by mimicking his or her verbal and nonverbal expressions. See Chapter 7 of
this volume for an overview of affective and social signals for which automated
recognition approaches have been proposed.
518 Volume 2 Glossary
Affect annotation. The process of assigning affective labels (e.g., bored, confused,
aroused) or values (e.g., arousal = 5) to data (e.g., video, audio, text).
Affective computing. Computing techniques and applications involving emotion
or affect.
Affective computing and social signal processing aims at perceiving and interpret-
ing nonverbal behavior with a machine by detecting affect and social signals
just as humans do among themselves. This will lead to a new generation of com-
puters that is perceived as more natural, efficacious and trustworthy [Vinciarelli
et al. 2008b, Vinciarelli et al. 2009], and it makes room for a more human-like
and intuitive interaction [Pantic et al. 2007].
Affective experience-expression link. The relationship between experiencing an
affective state (e.g., feeling confused) and expressing it (e.g., displaying a
furrowed brow).
Affective ground truth. Objective reality involving the “true” affective state. Is a
misleading term for psychological constructs like affect.
Alignment A third challenge is to identify the direct relations between (sub)-
elements from two or more different modalities. For example, we may want
to align the steps in a recipe to a video showing the dish being made. To tackle
this challenge we need to measure similarity between different modalities and
deal with possible long-range dependencies and ambiguities.
Classifier: in pattern recognition and machine learning, it is a function that maps
an object of interest (represented through a set of physical measurements called
features) into one of the classes or categories such an object can belong to (the
number of classes or categories is finite).
Bag-of-words (BoW) is a data-driven algorithm for summarizing large volumes of
features. It can be thought of as a histogram whose bins are determined by
partitions (or clusters) of the feature space.
Canonical Correlation Analysis (CCA) is a tool to infer information based on cross-
covariance matrices. It can be used to identify correlation across heterogeneous
modalities or sensor signals. Let each modality be represented by a feature
vector and let us assume there is correlation across these, such as is in
audiovisual speech recognition. Then, CCA will identify linear combinations
of the individual features with maximum correlation amongst each other.
Classifier Combination: in pattern recognition and machine learning, it is a body of
methodologies aimed at jointly using multiple classifiers to achieve a collective
Volume 2 Glossary 519
Classifier Diversity: in a set of classifiers that are being combined combined, the
diversity is the tendency of different classifiers to have different performance in
different regions of the input space.
in terms of the category or the dimension (e.g., the facial expressions express
anger and the hand gestures also express anger).
Cooperative learning in machine learning is a combination of active learning and
semi-supervised learning. In the semi-supervised learning part, the machine
learning algorithm labels unlabeled data based on its own previously learned
model. In the active learning part, it identifies which unlabeled data is most
important to be labeled by humans. Usually, cooperative learning tries to
minimize the amount of data to be labeled by humans while maximizing the gain
in accuracy of a learning algorithm. This can be based on confidence measures
such that the machine labels unlabeled data itself as long as it is sufficiently
confident in its decisions. It asks humans for help only where its confidence is
insufficient, but the data seem to be highly informative.
Construct. A conceptual variable that cannot be directly observed (e.g., intelligence,
personality).
Continuous or discrete (i.e., categorical) representation refer to the modeling of a
user state or trait. As an example, the age of a user can be modeled as continuum
such as the age in years. As opposed to this, a discretized representation would
be broader age classes such as “young,” “adult’s”, and “elderly.” In addition, the
time can be discretized or continuous (in fact, it is always discretized in some
respect—at least by the sample rate of the digitized sampling of the sensor
signals). However, one would speak of continuous measurement if processing
is delivering a continuous output stream on a (short) frame-by-frame basis rather
than an asynchronous processing of (larger) segments or chunks of the signal
such as per spoken word or per body gesture.
A Convolutional Neural Network (CNN) is a neural network that contains one or
more convolutional layers. A convolutional layer is a layer that processes an
image (or any other data comprised of points with a notion of distance between
these points, such as an audio signal) by convolving it with a number of kernels.
A correlation is a single number that describes the degree of relationship between
two variables (signals). It most often refers to how close two variables are to
having a linear relationship with each other.
A dense layer is the basic type of layer in a neural network. The layer takes a
one-dimensional vector as input and transforms it to another one-dimensional
vector by multiplying it by a weight matrix and adding a bias vector.
Depression refers broadly to the persistence over an extended period of time of sev-
eral of the following symptoms: lowered mood, interest, or pleasure; psychomo-
Volume 2 Glossary 521
for each modality, and the outputs of all unimodal models are then combined
to a final prediction based on all modalities.
Encoder-decoder architectures in deep learning start with an encoder neural
network which—based on its input—usually outputs a feature map or vector.
The second part—the decoder—is a further network that—based on the
feature vector from the encoder—provides the closest match either to the
input or an intended output. The decoder is in most cases employing the
same network structure but in opposite orientation. Usually, the training is
carried out unsupervised data, i.e., without labels. The target for learning is
to minimize the reconstruction error, i.e., the delta between the input to the
encoder and the output of the decoder. A typical application is to use encoder-
decoder architectures for sequence-to-sequence mapping, such as in machine
translation where the encoder is trained on sequences (phrases) in one language
and the decoder is trained to map its representation to a sequence (phrase) in
another language.
An ensemble is a set of models and we want the models in the set to differ in their
predictions so that they make different errors. If we consider the space defined
by the three dimensions that define a model as we defined above, the idea is
to sample smartly from that space of learners. We want the individual models
to be as accurate as possible individually, and at the same time, we want them
to complement each other. How these two criteria affect the accuracy of the
ensemble depends on the way we do the combination.
From another perspective, we can view each particular model as one noisy
estimate to the real (unknown) underlying problem. For example, in a classi-
fication task, each base classifier, depending on its model, hyper-parameters,
and input features, learns one noisy estimator to the real discriminant. In
such a perspective, the ensemble approach corresponds to constructing a fi-
nal estimator from these noisy estimators—for example, voting corresponds to
averaging them.
When the different models use inputs in different modalities, there are three
ways in which the predictions of models can be combined, namely, early, late,
and intermediate combination/integration/fusion.
Extraneous load refers to the level of working memory load that a person experi-
ences due to the properties of materials or computer interfaces they are using
[Oviatt 2017].
FACS refers to the Facial Action Coding System [Ekman & Friesen, 1978; Ekman,
Friesen, & Hager, 2002]. FACS describes facial activity in terms of anatomically
Volume 2 Glossary 523
based action units (AUs). Depending on the version of FACS, there are 33 to 44
AUs and a large number of additional “action descriptors” and other movements.
Feature-level multimodal fusion. The process of integrating features from different
modalities using diverse methodologies such as concatenating the features
together (early fusion) or combining the models obtained from each modality
at decision level (late fusion).
Fusion A fourth challenge is to join information from two or more modalities to
perform a prediction. For example, for audio-visual speech recognition, the
visual description of the lip motion is fused with the speech signal to predict
spoken words. The information coming from different modalities may have
varying predictive power and noise topology, with possibly missing data in at
least one of the modalities.
Galvanic Skin Response (GSR) refers to galvanic skin response which is a measure
of the conductivity of human skin, and can provide an indication of changes in
the human sympathetic nervous system during the cognitive task time.
Gaussian mixture models (GMMs) are probability density functions comprising
a weighted sum of individual Gaussian components, each with their own
mean and covariance. They are commonly employed to compactly characterize
arbitrary distributions (e.g. of features) that are not well-fitted by a single
Gaussian.
Germane load refers to the level of a person’s effort and activity compatible with
mastering new domain content during learning. It pertains to the cognitive
resources dedicated to constructing new schema in long-term memory.
Gross errors refer to non-Gaussian noise of large magnitude. Gross errors are often
in abundance in audio-visual data due to incorrect localisation and tracking,
presence of partial occlusions, enviromental noise etc.
Hand-crafted features refer to features developed to extract a specific type of
information, usually as part of a hypothesis-driven research study. By contrast,
data-driven features are those extracted automatically from raw signal data by
algorithms (e.g., neural networks), whose physical interpretation often cannot
easily be described.
Incongruent expressions of affects. A multimodal combination is said to involve
incongruent expressions of affects if the combined modalities are conveying
different affects in terms of the category or the dimension (e.g., the facial
expressions express joy, while hand gestures express anger). Such combinations
are also called blends of emotions. They might occur even in a single modality
524 Volume 2 Glossary
such as the facial expressions (e.g., the upper part of the face may express a
certain emotion, while the bottom part of the face conveys a different emotion).
Researchers exploring the perception of human (or computer-generated)
multimodal expressions of affects usually consider the following attributes
related to perception and affects that are impacted by or do impact multimodal
perception:
Recognition rate
Reaction time
Affect categories
Affect dimensions
Multimodal integration patterns
Inter-individual differences and personality
Timing: synchrony vs. sequential presentation of the signals in different
modalities
Modality dominance
Context: environment and task related information (e.g., food or
violent scenes, and associated applications for specific users with food
disorders or PTSD)
Task difficulty
Intrinsic load is the inherent difficulty level and related working memory load
associated with the material being processed during a user’s primary task.
Volume 2 Glossary 525
Learning analytics involves the collection, analysis, and reporting of data about
learners, including their activities and contexts of learning, in order to under-
stand and provide better support for effective learning. First-generation learning
analytics focused exclusively on computer-mediated learning using keyboard-
and-mouse computer interfaces. This limited analyses to click-stream activity
patterns and linguistic content. Early learning analytic techniques mainly have
been applied to managing educational systems (e.g., attendance tracking, track-
ing work completion), advising students based on their individual profiles (e.g.,
courses taken, grades received, time spent in learning activities), and improving
educational technologies. Learning analytics data typically are summarized on
dashboards for educators, such as teachers or administrators.
Leave-one-out cross validation. Cross validation is the process of dividing a dataset
into batches where one batch is reserved for testing and all the other batches
are used for training a system. Leave-one-out means each batch is formed of a
single instance.
The user (long-term) traits include biological trait primitives (e.g., age, gender,
height, weight), cultural trait primitives in the sense of group/ethnicity mem-
bership (e.g,. culture, race, social class, or linguistic concepts such as dialect
or first language), personality traits (e.g., the “OCEAN big five” dimensions
openness, conscientiousness, extraversion, agreeableness, and neuroticism or
likability), and traits that constitute subject idiosyncrasy, i.e., ID.
A longer-term state can subsume (partly self-induced) non-permanent, yet longer-
term states (e.g., sleepiness, intoxication, mood such as depression (see also
Chapter 12 the health state such as having a flu), structural (behavioral,
interactional, social) signals (e.g., role in dyads and groups, friendship and
identity, positive/negative attitude, intimacy, interest, politeness), and (non-
verbal) social signals (see Chapters 7 and 8 and discrepant signals (e.g.,
deception (see also Chapter 13) irony, sarcasm, sincerity).
Longitudinal data refers to multiple recordings of the same type from the same
individual at different points in time, between which it is likely that the
individual’s state (e.g., depression score) has changed.
L() is the loss function that measures how far the prediction g(x t |θ) is from the
desired value r t . The complexity of this optimization problem depends on the
particular g() and L(). Different learning algorithms in the machine learning
literature differ either in the model they use, the loss function they employ, or
the how the optimization problem is solved.
526 Volume 2 Glossary
This step above optimizes the parameters given a model. Each model has an
inductive bias that is, it comes with a set of assumptions about the data and the
model is accurate if its assumptions match the characteristics of the data. This
implies that we also need a process of model selection where we optimize the
model structure. This model structure depends on dimensions such as (i) the
learning algorithm, (ii) the hyper-parameters of the model (that define model
complexity), and (iii) the input features and representation, or modality. Each
model corresponds to one particular combination of these dimensions.
In machine learning, the learner is a model that takes an input x and learns to give
out the correct output y. In pattern recognition, typically we have a classification
task where y is a class code; for example in face recognition, x is the face image
and y is the index of the person whose face it is we are classifying.
In building a learner, we start from a data set X = {x t , r t }, t = 1, . . . , N that
contains training pairs of instances x t and the desired output values r t (e.g.,
class labels) for them. We assume that there is a dependency between x and
r but that it is unknown—If it were known, there would be no need to do any
learning and we would just write down the code for the mapping.
Typically, x t is not enough to uniquely identify r t ; we call x t the observables
and there may also be unobservables that affect r t and we model their effect
as noise. This implies that each training pair gives us only a limited amount of
information. Another related problem is that in most applications, x has a very
high dimensionality and our training set samples this high dimensional space
very sparsely.
Our prediction is given by our predictor g(x t |θ) where g() is the model and θ
is its set of parameters. Learning corresponds to finding that best θ ∗ that makes
our predictions as close as possible to the desired values on the training set:
N
θ ∗ = arg min L(r t , g(x t |θ))
θ
t=1
Mel frequency cepstral coefficients (MFCCs) are features that compactly represent
the short-term speech spectrum, including formant information, and are widely
used to characterize both spoken content (for automatic speech recognition)
and speaker-specific qualities (for automatic speaker verification). Briefly, a
mel-scale frequency-domain filterbank is applied to the spectrum to obtain mel
filterbank energies, the log of which is transformed to a lower-dimensional
representation using the discrete cosine transform.
Volume 2 Glossary 527
A Recurrent Neural Network (RNN) is a neural network that contains one or more
recurrent layers. A recurrent layer is a layer that takes a sequence x indexed by t
and processes it element by a element, while maintaining a hidden state for each
unit in the layer: ht = RN N (ht−1 , xt ), where ht is the hidden state at step t, and
RNN is a transition function to compute the next hidden state, that depends
on the type of hidden layer.
Redundancy: tendency of multiple signals or communication channels to carry the
same or widely overlapping information.
Representation A first fundamental challenge is learning how to represent and
summarize multimodal data in a way that exploits the complementarity and
redundancy of multiple modalities. The heterogeneity of multimodal data
makes it challenging to construct such representations. For example, language is
often symbolic while audio and visual modalities will be represented as signals.
A sequence-to-sequence model is a neural network that processes a sequence as
its input and produces another sequence as its output. Example of such models
include neural machine translation and end-to-end speech recognition models.
Shared hidden layer is a layer within a neural network which is shared within
the topology. For example, different modalities, or different output classes, or
even different databases could be trained within parts of the network mostly.
In the shared hidden layer, however, they would share neurons by according
connections. This can be an important approach to model diverse information
types largely independently but provide mutual information exchange at some
point in the topology of a neural network.
A short-term state includes the mode (e.g., speaking style and voice quality),
emotions, and affects (e.g., confidence, stress, frustration, pain, uncertainty,
see also Chapters 6 and 8.
Skilled performance involves the acquisition of skilled action patterns, for example
when learning to become an expert athlete or musician. Skilled performance also
can involve the acquisition of communication skills, as in developing expertise
in writing compositions or giving oral presentations. The role of deliberate
practice has been emphasized in acquiring skilled performance.
Social Signals: constellations of nonverbal behavioural cues aimed at conveying
socially relevant information such as attitudes, personality, intentions, etc.
Social Signal Processing: computing domain aimed at modeling, analysis and
synthesis of social signals in human-human and human-machine interactions.
530 Volume 2 Glossary
A softmax layer is a dense layer followed by the softmax nonlinearity. The softmax
nonlinearity takes a one-dimensional vector v of real numbers and normalizes
vj
it into a probability distribution, by applying softmax(v)j = e vi , where the
i
e
sum is over all coordinates of the vector v.
Spatio-temporal features are those which have both a spatial and a time dimension.
For example, the intensity of pixels in a video vary both in terms of their position
within a given frame (spatial dimension) and in terms of the frame number for
a given pixel coordinate (temporal dimension).
Support vector machine (SVM) is a widely used discriminative classification
method, which defines a separating hyperplane between two classes of features,
which is defined in terms of particular feature instances that are close to the
class boundaries, called support vectors.
Support vector regression (SVR) is a commonly used method for multivariate
regression, which concentrates on fitting a model by considering only training
features that are not very close to the model prediction.
Temporal dynamics of facial expression: rather than being like a single snapshot,
facial appearance changes as a function of time. Two main factors affecting
temporal dynamics of facial expression is the speed with which they unfold and
the changes of their intensity over time.
Transfer learning helps to reuse knowledge gained in one task in another task in
machine learning. It can be executed on different levels, such as the feature or
model level. For example, a neural network can be trained on a related task to
the task of interest at first. Then, the actual task of interest is trained “on top" of
this pre-training of the network. Likewise, rather than starting to train the target
task of interest based on a random initialization of a network, related data could
be used to provide a better starting point.
Transfer of learning refers to students’ ability to recognize parallells and make
use of learned information in new contexts, for example to apply learned
information outside of the classroom, in contexts not resembling the original
framing of the problem, with different people present, and so forth. This
requires generalizing the learned information beyond its original concrete
context.
Translation A second challenge addresses how to translate (map) data from one
modality to another. Not only is the data heterogeneous, but the relationship
between modalities is often open-ended or subjective. For example, there exist
Volume 2 Glossary 531
a number of correct ways to describe an image and one perfect translation may
not exist.
T-unit analysis. The analysis of terminable units of language (T-unit), which is the
smallest group of words that could be considered as a grammatical sentence,
regardless of how is punctuated. T-unit analysis is used extensively to measure
the overall complexity of both speech and writing samples and consists mainly
on measuring different aspects of their syntactic construction in text such as
mean length of the t-units, and number of clauses present in each unit, among
others.
User-independent model A model that generalizes to a different set of users beyond
those used to develop the model.
Voice quality refers to the type of phonation during voiced speech. Depending
on the physical movement of the vocal folds during phonation, the perceived
quality of speech can change, even for the same speech sound uttered at the
same pitch. Descriptors such as “creaky” and “breathy” are applied to specific
modes of vocal fold vibration.
Vowel space area is a term given to the two dimensional area enclosed by lines
connecting pairs of vowels in the formant (F1/F2) space.
Zero-shot learning is a method in machine learning to learn a new task without
any training examples for this task. An example could be recognizing a new type
of object without any visual example but based on a semantic description such
as specific features that describe the object.