Multimodal Learning Analytics PDF

11
11.1
Multimodal Learning
Analytics: Assessing
Learners’ Mental State
During the Process
of Learning
Sharon Oviatt, Joseph Grafsgaard, Lei Chen, Xavier Ochoa
Introduction
Today, new technologies are making it possible to assess what students know at
fine levels of detail, and to model human learning in ways that support entirely
new insights about the process of learning. These developments are part of a
general recent trend to understand users’ mental state, which is illustrated in this
volume’s chapters on reliably detecting cognitive load (Chapter 10), depression and
related disorders (Chapter 12), and deception (Chapter 13). Progress in all of these
areas has been directly enabled by new computational methods and strategies for
collecting and analyzing multimodal-multisensor data.
In this chapter, we focus on students’ mental state during the process of learn-
ing, which is a complex activity that only can be evaluated over time. We describe
multimodal learning analytics, an emerging field that includes far more powerful
predictive capabilities than preceding work on learning analytics, which has been
based solely on click-stream and linguistic analysis of typed input.
The achievement of learning fundamentally involves consolidating new perfor-
mance skills (e.g., ability to communicate orally), types of domain expertise (e.g.,
knowledge of math), and metacognitive awareness (e.g., ability to diagnose errors).
332 Chapter 11 Multimodal Learning Analytics
In this chapter, we focus mainly on new research involving the first two areas: ana-
lytics for detecting skilled performance and domain expertise. We also briefly discuss
related mental states that are prerequisites for learning, in particular attention, mo-
tivation, and engagement, as well as initial work on the moment of insight during
problem solving. However, research on multimodal learning analytics has yet to in-
vestigate, or attempt to predict, deeper aspects of learning—such as the emergence
of metacognitive awareness, or the ability to transfer learning to new contexts. The
Glossary provides definitions of terminology used throughout this chapter. In ad-
dition, Focus Questions are available to aid readers in understanding important
chapter content.
The purpose of this chapter is not to survey learning technologies more gen-
erally, including first-generation learning analytics based on keyboard-and-mouse
computer interfaces. This chapter also does not cover the development of appli-
cations based on new multimodal learning analytic techniques, which would be
premature, although new applications are discussed in Section 11.6. For a more
in-depth discussion of the modalities typically included in multimodal learning
analytics work, readers are referred to other handbook chapters on speech, writ-
ing, touch, gestures, gaze, image processing, and activity recognition using sen-
sors [Cohen and Oviatt 2017, Hinckley 2017, Katsamanis et al. 2017, Kopp and
Bergmann 2017, Potamianos et al. 2017, Qvarfordt 2017].
What is Multimodal Learning Analytics?

11.2 Multimodal learning analytics typically collects synchronized multi-stream data on
students’ natural communication patterns (e.g., speech, writing, gaze, gesturing,
facial expressions, non-verbal behavior), activity patterns (e.g., number of problem
solutions proposed), and sometimes also physiological and neural patterns (e.g.,
EEG, heart rate, fNIRS). These rich data are then analyzed with the aim of mod-
eling, predicting, and supporting learning-oriented behaviors during educational
activities. The dominant approaches to analysis include various types of empirical,
machine learning, and hybrid techniques. These second-generation multimodal
learning analytic techniques are capable of predicting mental states during the pro-
cess of learning, in some cases automatically and in real time, as will be discussed
later in this chapter.
In addition, these rich multimodal data can be analyzed across multiple lev-
els, which is referred to as multi-level multimodal learning analytics. For example,
a communication modality like speech can be analyzed at the signal level (e.g.,
amplitude, duration), activity level (e.g., number of speaking turns), representa-
11.2 What is Multimodal Learning Analytics? 333
Glossary
Cognitive load is a multidimensional construct that refers to the momentary working

memory load experienced by a person while performing a cognitive task (Chapter
10). It can be increased by a wide variety of factors, including the difficulty of a task,
the materials or tools used (e.g., computer interface), the situational context (e.g.,
distracting vs. quiet setting), the social context (e.g., working individually, vs. jointly
with a group), a person’s expertise in the domain, a person’s physiological stress level,
and so forth. For a detailed discussion of the dynamic and often non-linear interplay
between cognitive load and domain expertise, including how cognitive load can
either expand or minimize the performance gap between low- and high-performing
students, see Oviatt [2013b].
Domain expertise refers to the level of working knowledge and problem-solving
competence within a specific subject matter, such as algebra. It is a relatively stable
state that influences how a person perceives and strategizes solving a problem. For
the same task, a more expert student will group elements within it into higher-
level patterns, or perceive the problem in a more integrated manner. A person’s
level of domain expertise influences a variety of problem-solving behaviors (e.g.,
fluency, effort expended, accuracy). A more expert person also will experience lower
cognitive load when working on the same problem as a less expert person. Most
people experience domain expertise in some subjects. This everyday experience of
domain expertise is distinct from elite expert performance that occurs in a small
minority of virtuosos or prodigies, which can take a decade or lifetime to achieve.
Learning analytics involves the collection, analysis, and reporting of data about learners,
including their activities and contexts of learning, in order to understand and
provide better support for effective learning. First-generation learning analytics
focused exclusively on computer-mediated learning using keyboard-and-mouse
computer interfaces. This limited analyses to click-stream activity patterns and
linguistic content. Early learning analytic techniques mainly have been applied to
managing educational systems (e.g., attendance tracking, tracking work completion),
advising students based on their individual profiles (e.g., courses taken, grades
received, time spent in learning activities), and improving educational technologies.
Learning analytics data typically are summarized on dashboards for educators, such
as teachers or administrators.
Metacognitive awareness involves higher-level self-regulatory behaviors that guide the
learning process, such as an individual’s awareness of what type of problem they are
working on and how to approach solving it, the ability to diagnose error states, or
understanding what tools are best suited for a particular task.
Moment of insight refers to one of several phases during the process of problem solving.
It involves the interval of time immediately before and after a person consciously
realizes the solution to a problem they’ve been working on. This idea represents what
the person believes is the solution, although it may or may not be correct.
Glossary (continued)
Multi-level multimodal learning analytics refers to the different levels of analysis

enabled by multimodal data collection. For example, speech and handwriting
can be analyzed at the signal, activity pattern, representational, or metacognitive
levels. During research on learning, it frequently is valuable to analyze data across
behavioral and physiological/neural levels for a deeper understanding of any learning
effects. In this regard, multi-level multimodal learning analytics can support a more
comprehensive systems-level view of the complex process of learning.
Multimodal learning analytics is an emerging area that analyzes students’ natural
communication patterns (e.g., speech, writing, gaze, non-verbal behavior), activity
patterns (e.g., number of hours spent studying), and physiological and neural
patterns (e.g., EEG, fNIRS) in order to predict learning-oriented behaviors during
educational activities. These rich data can be analyzed at multiple levels, for example
at the signal level (e.g., speech amplitude), activity level (e.g., frequency of speaking),
representational level (e.g., linguistic content), and others. These second-generation
learning analytic techniques are capable of predicting mental states during the
process of learning, in some cases automatically and in real time.
Prerequisites for learning are precursors for successful learning to occur, which can
provide early markers. They include attention to the learning materials, emotional
and motivational predisposition to learn, and active engagement with the learning
activitites.
Skilled performance involves the acquisition of skilled action patterns, for example
when learning to become an expert athlete or musician. Skilled performance also
can involve the acquisition of communication skills, as in developing expertise in
writing compositions or giving oral presentations. The role of deliberate practice has
been emphasized in acquiring skilled performance.
Transfer of learning refers to students’ ability to recognize parallells and make use
of learned information in new contexts, for example to apply learned information
outside of the classroom, in contexts not resembling the original framing of the
problem, with different people present, and so forth. This requires generalizing the
learned information beyond its original concrete context.
tional level (e.g., linguistic content), metacognitive level (e.g., intentional use of
face-to-face speech for persuasion), and others. In this sense, collecting speech
provides very rich data beyond simply knowing the content of what students said.
Paralinguistic information, such as speech amplitude, pitch, and accent, can clar-
ify students’ attitude, motivation, and attentional focus—important preconditions
for learning. Multimodal learning analytics also supports multi-level data collec-
tion across behavioral and physiological/neural measures, which can be used to

both corroborate and more fully understand the nature of any learning effects. The
richness of multi-level data are further expanded when sensor data is collected,
which can provide contextual information about a learning setting (e.g., GPS for lo-
cation) or a learner’s activities and interactions with others who are part of a group
(e.g., physical proximity).
One important impact of multi-level multimodal learning analytics is that it can
support a more comprehensive systems-level view of the complex process of learn-
ing. We will return to this topic in Section 11.5.2 of the chapter. A second impact
is that multi-level multimodal learning analytics enables a fuller understanding of
the trade-offs associated with evaluating new learning interventions or computer
technologies. That is, rather than assessing an educational intervention or com-
puter tool with a single metric, using multiple modalities potentially can expose
a range of advantages and disadvantages that can provide: (1) more valuable col-
lective information about the pros, cons, and overall impact on learning, and (2)
better diagnostic information as a basis for improving the intervention or com-
puter tool.
11.2.1 Why is Multimodal Learning Analytics Emerging Now,

and What is its History?
As discussed in detail elsewhere [Oviatt and Cohen 2015], during the past decade
multimodal-multisensor interfaces have become the dominant computer interface
worldwide, surpassing keyboard-based interfaces. These interfaces have prolifer-
ated especially rapidly in support of small mobile devices, such as cell phones. As
a result, the dominant educational technology platform worldwide now is mobile
cell phones, which are no longer keyboard-centric. Instead, their input capabilities
include speech, touch, gesturing, pen input, images, virtual keyboard, and an in-
creasing array of sensors. In short, computer interfaces have become increasingly
multimodal-multisensor and, as such, input is dominated by a collection of more
expressively powerful communication modalities.
The impact of this paradigm shift is that learning analytics techniques that
rely exclusively on click-stream analysis have become limited in their utility. They
eventually will become obsolete as the trend toward small mobile devices and richer
human-computer interfaces expands in developing regions and elsewhere. In their
place, new techniques now are beginning to collect, analyze, model, and predict
learning-oriented behaviors based on multimodal-multisensor input.
The First International Workshop on Multimodal Learning Analytics and Second

International Workshop and Data-Driven Grand Challenge on Multimodal Learning An-
alytics, held in 2012 and 2013 respectively, represented initial multidisciplinary
working meetings in this new field. These events were associated with the ACM In-
ternational Conference on Multimodal Interaction (ICMI), which provided valuable
expertise in multimodal technology, infrastructure, and databases that was re-
quired to organize the new field of multimodal learning analytics. As efforts were
made to broaden disciplinary participation to include more cognitive and learn-
ing scientists, subsequent multimodal learning analytic tutorials then were held
in conjunction with the annual Learning Analytics and Knowledge Conference (LAK)
in 2015, 2016, and 2017.
11.2.2 What are the Main Objectives of Multimodal Learning Analytics?

A primary objective of multimodal learning analytics is to analyze coherent sig-
nal, activity pattern, representational, and metacognitive data in order to uncover
entirely new learning-oriented phenomena and to advance learning theory. Multi-
modal learning analytics aims to identify domain expertise and changes in domain
expertise rapidly, reliably, and objectively. It also aims to identify change during the
consolidation of skilled performance, transitions in metacognitive awareness, and
prerequisites or precursors of learning that involve attention, emotion and motiva-
tion, and engagement in learning activities.
Advances in multimodal learning analytics are expected to contribute new em-
pirical findings, theories, methods, and metrics for understanding how students
learn. They also can contribute to improving pedagogical support for students’
learning by assessing the impact of new digital tools, teaching materials, strategies,
and curricula on students’ learning outcomes. In addition, multimodal learning
analytics is expected to advance new computational and engineering techniques
related to machine learning and hybrid forms of analytics that co-process two or
more data sources (e.g., speech and images).
Virtually all assessments of student learning rely critically on the availability of
accurate metrics as forcing functions. Multimodal learning analytics is expected
to yield a large array of more sensitive, reliable, objective, and richly contextual-
ized metrics of learning-oriented behaviors and expertise. This includes ones that
could be analyzed unobtrusively, continuously, automatically, in natural field set-
tings, and more affordably than traditional educational assessments. Progress in
multimodal learning analytics will transform our ability to accurately identify and
stimulate effective learning, support more rapid feedback and responsive interven-
tions, and facilitate learning in a more diverse range of students and contexts. One
long-term objective of this research is to develop automated prediction of learning

(e.g., expertise and skill level) that is more accurate and less affected by biases in so-
cial perception than human decision makers. For example, this includes accurately
distinguishing domain experts from individuals who are simply socially dominant
(e.g., speaking more frequently and loudly).
11.2.3 How Does Multimodal Learning Analytics Differ from Earlier Work on
Learning Analytics, and What are its Primary Advantages?
Initial work on learning analytics aimed to collect and analyze data to better un-
derstand and support learners and educational processes. It focused on examining
students’ input during computer-mediated learning while they used keyboard-and-
mouse interfaces. As a result, analytics were limited to activity patterns and linguis-
tic content present in click-stream data. These analytics typically were summarized
on dashboards for professional educators, such as teachers or school administra-
tors. They mainly have been used to assist with managing educational systems
(e.g., attendance monitoring, work completion). Later, they started to be used to
predict and advise students on educational outcomes, often based on individual
student profiles in comparison with a larger group. For example, the system might
track the time spent by a student in different learning activities, her work comple-
tion, and grades received. This could be used to provide advisory information to the
student on the likelihood that she will meet her learning goals, and what actions
she could take to improve her expected educational outcome. One common aim of
these capabilities, for example, has been to create early warning systems to detect
and reduce student dropouts. Learning analytic techniques have been applied to
a variety of technology-mediated learning environments, including learning man-
agement systems [Arnold and Pistilli 2012], intelligent tutoring systems [Baker et al.
2004], massive open online courses [Kizilcec et al. 2013], educational video games
[Serrano-Laguna and Fernández-Manjón 2014], and so forth. In these contexts,
learning analytic results also have been used to study and improve the impact of
educational technologies.
Like multimodal learning analytics, earlier work on learning analytics relied
heavily on data resources, which are expensive and time-consuming to collect. Both
areas have been part of the broader data science and analytics movement. However,
major differences between learning analytics and multimodal learning analytics
include the following:
. Learning analytics has been limited to click-stream data, based on keyboard-

and-mouse interfaces. In contrast, the multi-level multimodal data sources
described above provide a richer collection of data sources that offer more
powerful predictive capabilities. Since they are based on natural commu-
nication modalities that mediate thought [Luria 1961, Vygotsky 1962], they
provide a particularly informative window on human thinking and learning.
In this regard, multimodal learning analytic data are well suited to the task
of predicting students’ mental state and learning.
. Learning analytics does not produce as large a dataset per unit of time as mul-
timodal learning analytics, because the latter typically records multiple
streams of synchronized data during the same recorded time period (e.g., 12
data streams in the Math Data Corpus, described in Section 11.3.1). Analysis
of these multimodal data streams can support a more comprehensive multi-
level analysis of learning, including how different levels interact to produce
emergent learning phenomena. This can generate a qualitatively different
systems-level view of the process of learning, as illustrated in Section 11.5.2.
. Learning analytics has been limited to analyzing computer-mediated learning.
In contrast, multimodal learning analytic techniques can be used to evaluate
both interpersonal and technology-mediated learning. This is an important
distinction because learning occurs in both contexts. In addition, evaluation
of interpersonal learning outcomes is needed to provide a “ground-truth” for
assessing the impact of different types of technology-mediated learning.
. Learning analytics has been more limited to assessment in stationary settings,
compared with multimodal learning analytics, because it typically has been
used on desktop and larger computers that support keyboard input. In con-
trast, multimodal learning analytic techniques originated to accommodate
mobile devices and more natural interfaces, so their usage context extends
to a wider range of field and mobile learning settings.
. Learning analytics in its original form mainly focused on supporting educational
management systems, student profiling and advising to promote academic suc-
cess, researching and improving educational technologies, and similar aims.
In contrast, multimodal learning analytic techniques aim to provide finer-
grained models of student learners and the learning process, including track-
ing individual students’ mental states while they learn. In this regard, it is
learner-centric, with a particular focus on students’ thoughts and mental
states as learning progresses.
. Learning analytics has only been able to collect and analyze data when students
are interacting with a computer (e.g., active typing) and, if some students don’t
11.3 What Data Resources are Available on Multimodal Learning Analytics? 339
type or interact much, then their learning-oriented behaviors potentially

cannot be analyzed over time. In contrast, multimodal learning analytics
can track multiple data sources, some active (e.g., speaking) and some pas-
sive (e.g., camera images, sensors), to jointly provide more continuous and
lengthier periods of monitoring learners, compared with a single informa-
tion source like typing.
What Data Resources are Available on Multimodal

11.3 Learning Analytics?
In this section, we describe two examples of different data resources that have been
available to the research community in this area: one that provides data on domain
expertise and learning in mathematics high-school students (Math Data Corpus),
and a second that provides data on skilled performance in undergraduate students
at giving oral presentations (Oral Presentation Corpus). Both of these datasets have
been used by participants in the series of data-driven grand challenge workshops
on multimodal learning analytics that were described in Section 11.2.1.
11.3.1 Math Data Corpus

The Math Data Corpus was the first dataset that became available to multimodal
learning analytics researchers participating in the First International Data-Driven
Grand Challenge Workshop on Multimodal Learning Analytics in 2013 [Oviatt and
Cohen 2013]. The primary goal of the challenge using this dataset was to ana-
lyze coherent signal, activity pattern, and representational content in an effort to
predict each student’s level of math domain expertise and changes in students’
expertise as rapidly, reliably, and objectively as possible. A secondary goal was
to identify learning-oriented behaviors during these problem-solving interactions.
Participants in this data-driven challenge could use empirical/statistical, machine
learning, or hybrid methods as they pursued these objectives.
The Math Data Corpus includes high-fidelity time-synchronized multimodal
data recordings on collaborating groups of high-school students as they worked
together to solve mathematics problems varying in difficulty. Data were collected on
students’ multimodal communication and activity patterns, including (1) speech,
(2) digital pen input on large sheets of writing paper, and (3) camera capture
of their facial expressions, gestures, and physical movements (see Figure 11.2,
left). The dataset includes six small student groups (i.e., three students apiece),
who each met twice to solve math problems and peer tutor one another. In total,
approximately 29 student-hours of recorded multimodal data are available during
Video B
Video C Video A
Video E
Video D
Figure 11.1 Synchronized views from all five video-cameras at the same moment in time during
data collection. Videos A, B and C show close-up views of the three individual students,
video D a wide-angle view of all students, and video E a top-down view of students’
writing and artifacts on the tabletop. (Used with permission from Oviatt [2013a])
these collaborative problem-solving sessions. Figure 11.1 illustrates image capture

during data collection.
Participants in this dataset included 18 high school students, 9 female and 9
male, who had recently completed Introductory Geometry. They ranged from low to
high performers. During each session, students worked on 16 geometry and algebra
problems, 4 apiece representing easy, moderate, hard, and very hard difficulty
levels. A computer displayed the math problem sets and facilitated the interaction,
which was controlled by a simulation environment detailed in Arthur et al. [2006].
All students could use pen and paper and a calculator as tools to draw diagrams,
make calculations, etc. while working on each problem. Students worked on each
problem in phases: (1) they asked to see the problem, read it and discussed what
it meant; (2) if needed, they asked the computer to display related math terms or
Figure 11.2 Synchronized multimodal data collection of students’ images, writing, and speech as
they solved math problems (left); Sample of students’ handwritten work while solving
math problems (right). (Used with permission from Oviatt [2013a])
equations; (3) each student worked individually, writing equations and calculations
on paper as they attempted to solve the problem; (4) one student proposed the
solution, which the students then discussed together until they understood how the
problem had been solved and agreed on it; (5) the group’s solution was submitted,
after which the computer verified whether it was correct or not; (6) if correct, the
computer randomly called upon one student to explain how they had arrived at the
answer; (7) if not correct, the computer could display a worked example of how the
solution had been calculated. Figure 11.2 (right) illustrates a sample of students’
written data while solving the math problems, which averaged 80% nonlinguistic
content such as numbers, symbols, and diagrams [Oviatt and Cohen 2013].
This data resource includes original data files (high-resolution videos, speech
.wav files, digital ink .jpegs and .cvs files), extensive coding of problem segmen-
tation and phases, problem-solving correctness, student expertise ratings, repre-
sentational content of students’ writing, and verbatim speech transcriptions of
lexical content. It also includes data on coding reliabilities, and appendices with
math problem sets and other information required for those interested in conduct-
ing their own data analyses. Detailed documentation on this corpus is available in
Oviatt et al. [2014], and in Table 11.1.
Overall, the Math Data Corpus is a well structured dataset, with authentic learn-
ing and documented improvement in students’ performance over sessions [Oviatt
2013a]. The ground-truth coding of communication and performance during these
Table 11.1 Math data corpus summary
Assessment Math domain expertise: Correct solutions, learning

over sessions
Educational activity Math problem solving, peer tutoring within small
group, authentic learning documented
Modalities Speech, digital pen, images (face, gestures, gaze,
upper body)
Duration Short-term longitudinal
Data & coding 29 hours recorded: Transcribed verbatim speech
and 10,000 written representations (symbols,
numbers, diagrams, words);
Percent correct solutions on problems at 4 levels
difficulty;
Ground-truth coding of each student’s expertise
level;
Problem segmentation and phases
Participants 18 high school students (gender balanced, high &
low performers)
Language English
sessions is documented in detail. One important feature of this dataset is high-

fidelity data capture of the speech, writing, and images, which involves 12 simul-
taneous data streams with tight synchronization among them. Another strength is
uniquely detailed coding of students’ written representations, including the type of
representation, its semantic content, and the presence of disfluencies. This coding,
which involved over 10,000 representations, is unusual because relevant automated
analysis tools are lacking. Perhaps the most important strength of this dataset is
that the math problems were finely calibrated by difficulty level (easy to very hard),
and all problems had canonical correct answers, such that students’ expertise level
could be rated on a scale from 0–100. This ensured a sensitive and solid ground-
truth foundation for documenting predictors of domain expertise and learning,
which is especially important in initial research. The limitations of this corpus are
that it does not include long-term longitudinal data on individual students over
time in a naturalistic setting like a classroom. It also is limited to the single con-
tent domain of mathematics. Future data collection could model the strengths of
the Math Data Corpus, while planning a longer-term data collection covering more
content domains and an expanded student population. While daunting, this task
could be accomplished by a dedicated team of researchers and professional edu-

cators.
11.3.2 Oral Presentation Corpus

The Oral Presentation Corpus was the second dataset to become available for mul-
timodal learning analytics researchers participating in grand challenge workshops.
This dataset documents students’ competence in an area of skilled performance—
that of giving effective oral presentations. It also differs from the Math Data Corpus
in documenting native speakers of Spanish.
The Oral Presentation Corpus includes 441 oral presentations given by Spanish-
speaking undergraduate students. The students worked in small groups of up to six
members on projects, which varied in topics such as discussing entrepreneurship
ideas or software design. Each group of students collaborated on making one
slide presentation, but individual students all gave part of the oral presentation.
During these presentations, data were collected for individual students on their
natural multimodal communication in a regular classroom setting, including: (1)
speech, (2) camera images of facial expressions and physical movements, and
(3) skeletal full-body movement data gathered using Kinect (for data recording
details, see Chen et al. [2015] and Nguyen et al. [2012]). The emergence of 3D
motion tracking devices, such as the Microsoft Kinect used in this work [Zhang
2012], has greatly facilitated multimodal research. In addition, data was available
on the group’s corresponding (4) slide presentation files. In total, approximately
19 hours of multimodal data is available for analysis from these presentations.
Figure 11.3 illustrates a sample of the type of data recorded in the Oral Presentation
Corpus, based on very similar infrastructure developed at Educational Testing
Service.
The oral presentation quality for each student was evaluated by two human
experts, who used a rubric based on the following six performance criteria: (1)
speech coherence and organization, (2) pronunciation and voice control (e.g., vol-
ume), (3) language (e.g., audience appropriate), (4) eye contact and body language
(e.g., expressiveness), (5) confidence and enthusiasm, and (6) accompanying slide
presentation (e.g., readability, grammar, visual design). These criteria were rated
on a scale of 0–4 and totalled. Based on this assessment, each student received a
grade for their oral presentation. The accompanying slides were assigned a separate
group grade, which all students in a given group received. Table 11.2 summarizes
the main features of the Oral Presentation Corpus, Ochoa [unpubl].
Figure 11.3 Sample recording of a speaker and talk slide (left), Kinect analysis of full-body
movement (middle) and data recording instruments (right) from Educational Testing
Service’s multimodal presentation corpus, which had the same basic features as the
Oral Presentation Corpus infrastructure first developed by Ochoa and colleagues. (From
Leong et al. [2015])
Overall, the Oral Presentation Corpus contains authentic information on stu-

dents’ skilled performance collected in actual classrooms. It expands the multi-
modal learning analytics community’s available data to include Spanish speakers.
Another strength of this corpus is its synchronized information on each speaker’s
speech and non-verbal movements—with the latter including facial expressions,
gestures, and full-body articulated movements. These latter data were collected us-
ing a combination of camera images and Kinect. In addition, accompanying slide
presentations were available for evaluation. On the other hand, since oral presen-
tations do not have a canonical correct form, it was necessary to judge them on a
more qualitative basis using grades.
From the viewpoint of advancing multimodal learning analytics research and
related system development, the major limitation of this dataset is that it evalu-
ated students’ competence at this type of skilled performance in a single session,
but it does not include any data on student learning over time. Another limita-
tion of this corpus is that grading of oral presentations relied on six particular
criteria, and it can be disputed whether these are the best metrics for judging
presentation quality with respect to validity and reliability. One aim of conducting
multimodal learning analytics research on corpora like this is to automate what are
now qualitative judgment ratings in an effort to eventually produce more objective
and reliable scoring of presentation quality. Although not publicly available, Edu-
cational Testing Service collected a very similar corpus, using infrastructure shown
Table 11.2 Oral presentation corpus summary
Assessment Skilled oral presentation

Educational activity Authentic in-classroom presentations (small
groups)
Modalities Speech, images (face, gestures, body), Kinect
(skeletal movements), with accompanying slide
presentations
Duration Single session
Data & coding 10 hours recorded, 130 oral presentations;
Teacher grades for individual student’s oral
presentation quality and group’s slide quality;
Ratings based on 2 teachers and 6 performance
criteria
Participants Undergraduate students
Language Spanish
in Figure 11.3, in which the PC version of Kinect was used so multiple data streams
could be recognized jointly on a PC. They used the Public Speaking Competence
Rubric (PSCR) [Schreiber et al. 2012], which is well recognized and potentially more
valid for ground-truth assessment [Chen et al. 2014a]. They also adopted more rig-
orous human rating procedures, with multiple raters systematically adjudicating
presentation scores. These represent important methodological improvements for
future researchers interested in collecting similar data.
11.3.3 Other Infrastructure for Recording Classroom and

Computer-based Multimodal Data
In the above datasets, infrastructure was developed for multimodal recording of
individuals and small groups. However, multimodal recording also has focused on
more conventional classroom settings and student-teacher interactions. As exam-
ples of different approaches, Raca and Dillenbourg [2013] used a relatively tradi-
tional approach with multiple cameras in a classroom to collect data on students’
head positions, body movements (e.g., writing, postural shifts), question-answer
exchanges, along with coordinated teacher slides and recordings. Figure 11.4 il-
lustrates students’ gaze estimated from head pose data. Their primary aim was to
track students’ engagement and concentration during classes. The infrastruture
included a visual analytic summary of students’ spatial location in the classroom,
which displayed each student’s level of attention and whether they wrote notes.
Figure 11.4 Gaze estimation based on camera capture of students’ head pose during classroom
lecture. (Used with permission from Raca and Dillenbourg [2014])
Tabletop-enhanced classrooms also have been used to capture and analyze

students’ interactions. This infrastructure places more emphasis on collabora-
tive activities [Martinez Maldonado et al. 2012]. They are capable of collecting
students’ touch, pen input during writing, and manual interaction with learning
objects through table surfaces enabled with multi-touch and multi-stylus capabili-
ties. Some tabletop environments also have been developed to understand speech
commands to interact with the system [Jones et al. 2011]. Others have flexible con-
figurations that support adapting displays from horizontal tables to vertical white
boards [Leitner et al. 2009], and writing and interacting across different types of
surface [Anoto 2016].
In order to improve on the quality of classroom data collection, other re-
searchers have proposed the design of a personal Multimodal Recording Device
(MRD), which currently is being tested. This MRD could be used for close-range
recording of an individual’s audio, visual, and writing behaviors in a classroom
using a desktop recording device [Dominguez et al. 2015]. Figure 11.5 (left) illus-
trates that the camera inside the device would collect video-images of the students’
face, upper torso, and hands when writing, gesturing, and so forth. A close-range
microphone would collect the student’s speech. As shown in Figure 11.5 (right) the
student’s digital pen would be integrated with the recording device to collect digi-
tal stroke data while the student writes in a paper notebook on their desktop. The
MRD would connect to classroom WiFi to upload and share data among students.
Apart from producing higher-fidelity data, the MRD would give individual students
control over whether and when their own data was collected to ensure acceptable
(a) MRD in the classroom (b) MRD from the student’s point of view
Figure 11.5 Design of Multimodal Recording Device (MRD) for high-fidelity classroom data
collection. (Used with permission from Dominguez et al. [2015])
privacy. In addition, video-recordings would capture the teacher’s classroom pre-

sentation and synchronized slides, which would be available for students in the
class [Dominguez et al. 2015].
Although most desktop computers are keyboard-and-mouse enabled, for multi-
modal learning analytics research student-computer interactions can be recorded
using supplementary sensors, as shown in Figure 11.6 (left). In this example, the
desk is instrumented with audiovisual sensors: a webcam with microphone, an in-
frared depth sensor (for body posture and motion), and an infrared eye-tracking
device. The student also wears a wireless bracelet on her wrist to track electro-
dermal activity. These sensors record directly onto the computer workstation, syn-
chronizing multimodal data streams. Together, these devices provide non-invasive
measures of eye gaze, nonverbal behavior, and physiology in a lab setting.
11.3.4 Why Publicly Available Multimodal Learning Datasets are Limited

Publicly available multimodal data resources, such as those described in Sections
11.3.1 and 11.3.2, are rare for several reasons. One is that ethics clearance has not
been conducted in a manner that permits sharing the rich multimodal data for
privacy reasons. In addition, since corpora are expensive to collect and ground-
truth code, in some cases they have not been shared for commercial reasons. For
example, Educational Testing Service has collected corpora related to oral presenta-
tion planning [Chen et al. 2014a, 2014b, 2015], but these activities were supported
Figure 11.6 Student facial expressions and upper-body movements captured with a camera during
interaction with an intelligent tutoring system (left), and showing processing of pose
information (right). (From Ezen-Can et al. [2015])
by corporate investments to produce future assessment products. Another rea-

son is that collection and analysis of synchronized multi-stream multimodal data
requires knowledge of the component technologies and an ability to customize
software, since off-the-shelf tools are limited.
Many corpora on interactions involving educational technologies have been
collected in the past, but the vast majority are limited to keyboard-and-mouse input
or a single input modality like speech. For example, CMU LearnLab’s well-known
data repository, DataShop (http://pslcdatashop.web.cmu.edu), has many publicly
available datasets—but they are not multimodal ones. In specific areas of education
like second language learning, digital databases are available involving continuous
written or spoken input produced by foreign or second language learners of many
world languages (for example, see http://www.uclouvain.be/en-cecl-lcworld.html).
However, these also are not multimodal databases. Furthermore, when written
composition skills in a foreign language are available for study, students’ input
typically has been typed, written with non-digital tools and then scanned in, or
transcribed into digital form from non-digital writing materials. In these cases,
digital databases that capture both the content and dynamic formation of students’
handwriting are not available.
11.3.5 Computational Methods and Techniques

for Analyzing Multi-level Multimodal Data
The data analytic techniques typically applied to multimodal learning analytics cor-
pora range from standard empirical and statistical techniques, to machine learning
11.4 What are the Main Themes from Research Findings on Multimodal Learning Analytics? 349
techniques, and also new hybrid methods. In the area of statistical analysis, earlier
education and cognitive science research on topics like domain expertise reported
a large array of individual “markers” [Ericsson et al. 2006, Purandare and Litman
2008, Worsley and Blikstein 2010]. We refer readers to Ericsson et al. [2006] for a
summary of this body of work, which preceded and has contributed to multimodal
learning analytics. Markers of expert performance, such as the use of linguistic
terms indicating certainty, typically were identified in empirical studies using cor-
relations or standard statistical significance tests.
More recent multimodal learning analytics research has shifted to adopting
statistical techniques that can contribute to prediction of expertise and learning, ei-
ther alone or in combination with the use of machine learning techniques. When
possible predictors are studied, researchers examine the proportion of variance
accounted for to distinguish which factors are more powerful rather than weak
(e.g., accounting for 60% rather than 1% of data variance). More powerful pre-
dictors then are good candidates to consider including in automated machine
learning feature sets [Oviatt et al. 2016]. Researchers also conduct regressions,
multiple regressions, stepwise linear regressions, and similar analyses to build
models and go beyond correlational analysis—instead determining whether fac-
tors are causal ones [Oviatt et al. 2016]. Other research has extracted information
about clusters of related multimodal features that have predictive power, for exam-
ple using principal components analysis [Chen et al. 2014a]. These are examples
of commonly used empirical/statistical techniques, but by no means an exhaus-
tive list.
For a detailed technical overview of signal processing and machine learning
methods currently used to model and analyze multimodal data, readers are referred
to other related chapters in this volume, (Chapters 1–4). The development of new
machine learning techniques for handling multimodal data is an active area of
research now, especially for topics like audio-visual data processing. Another ac-
tive research area is the development of hybrid analysis techniques that combine
empirical and machine learning capabilities, which is promising since they have
different strengths and weaknesses.
What are the Main Themes from Research Findings

11.4 on Multimodal Learning Analytics?
This section discusses findings and major themes from research on the pre-
requisites of learning (e.g., attention, engagement), skilled performance, domain
expertise, and insight during problem solving.
11.4.1 New Literature on Domain Expertise, Skilled Performance, and Insight

During Problem Solving
Within the past five years, several general themes have emerged from research
on multimodal learning analytics, including from the two databases described in
Section 11.3. The findings discussed in this section focus on the consolidation of
domain expertise and skilled performance. They can be summarized as follows.
Analysis of information present in all modalities can yield

fertile prediction results.
Students’ level of domain expertise in the Math Data Corpus has been predictable
based on analyses involving a variety of different rich communication modalities,
including speech, vision, and writing [Ochoa et al. 2013, Oviatt and Cohen 2013].
For example, the dominant math expert in a group of collaborating students can be
identified through visual analysis of who controls the calculator, or by audio analy-
sis of who speaks the most math terms [Ochoa et al. 2013]. Based on handwriting,
expertise can be identified by dynamic signal-level parameters [Oviatt et al. 2016].
It also has been identified from the frequency of diagramming, and the structural
elements and causal features in diagrams [Jee et al. 2009, 2014]. Using the Oral
Presentation Corpus, student scores on presentation quality are predictable using
either speech or visual analysis [Chen et al. 2014b].
Valuable predictive information is present at multiple levels of analysis,

and does not necessarily require any linguistic content analysis.
Multiple levels of analyzing communication modalities all contain surprisingly
fertile information about domain expertise, including the signal, activity pattern,
lexical/representational and transactional levels [Jee et al. 2009, 2014, Martin and
Schwartz 2009, Ochoa et al. 2013, Oviatt et al. 2015, Oviatt and Cohen 2013, Worsley
and Blikstein 2010]. At the activity pattern level, how frequently a student con-
tributes problem solutions, irrespective of whether they are actually correct or not,
can accurately predict the dominant expert in a group over 90% of the time. In
fact, the most expert students in the Math Data Corpus initiated four-fold more
problem solutions than non-experts [Oviatt and Cohen 2013]. This indicates that
reliable identification of domain expertise can be accomplished by analyzing ac-
tivity patterns, without any content analysis. At the representational level, higher-
performing experts make more structured diagrams [Martin and Schwartz 2009],
and they increase their diagramming more as tasks become more difficult [Oviatt
et al. 2007].
When analyzing the Math Data Corpus at a transactional level, research de-
monstrates that patterns of overlapped speech can identify (1) phases of rapid
problem-solving progress within a group, and (2) level of expertise of students in
the group [Oviatt et al. 2015]. The frequency and duration of overlapped speech in-
creased by 120–143% during the brief “moment-of-insight” phase of group problem
solving, compared with matched intervals. In addition, more expert students inter-
rupted non-experts 37% more often than non-experts interrupted them, in spite of
the fact that they talked less overall. These group dynamics involved speech activity
patterns, with no linguistic content analysis required, although linguistic content
analysis can be combined to improve prediction accuracy [Oviatt et al. 2015].
Based on the Oral Presentation Corpus for predicting skilled performance, stu-
dents’ overall score for quality of presentation was predictable based on paralin-
guistic or signal-level features like pitch, pitch variability, speaking rate, speech
fluency (i.e., minimized pauses), and postural and gestural movement energy [Chen
et al. 2014b]. These features likewise did not require any analysis of speakers’ lan-
guage content. However, presentation scores also could be predicted at the repre-
sentational level based on linguistic features like the use of first person pronouns.
Multimodal combination of information sources yields higher

reliabilities than individual ones.
Analysis of combined multimodal information provides more accurate identifica-
tion of domain expertise than unimodal sources [Oviatt and Cohen 2013]. In one
study based on the Math Data Corpus, the dominant domain expert in a group was
correctly identifiable 100% of the time by co-processing speech and written infor-
mation on who initiated the most problem solutions, irrespective of whether or
not they were correct solutions. This multimodal prediction rate surpassed that for
either speech or writing alone, which did not exceed 90% [Oviatt and Cohen 2013].
During prediction of skilled performance involving oral presentation quality,
models based on multimodal speech and image processing likewise outperform
unimodal models [Chen et al. 2014a, 2014b]. As an example, automated scores
based on speech vs. visual data using SVM resulted in prediction performance (i.e.,
measured as Pearson correlation coefficient between machine-predicted scores
and human-rated scores of 0.375 and 0.388, respectively—compared with 0.511
using combined multimodal features). This pattern of results was replicated using
Generalized Boosted Regression Models (GBM) [Friedman 2002], with prediction
correlation based on speech alone 0.457, improving to 0.546 based on multimodal
features [Chen et al. 2014b].
With respect to prerequisites of learning, multimodal recognition models have

been demonstrated to substantially outperform unimodal models at detecting stu-
dents’ interest (vs. disinterest) in problem-solving activities [Kapoor and Picard
2005]. On the more challenging task of identifying emotional state, past meta-
analyses of emotion recognition based on 90 multimodal systems demonstrated
that in 85% of these cases multimodal models produced more accurate results
than the best unimodal models, with an average improvement of approximately
10% [D’Mello and Kory 2015]. In recent research that developed models to predict
students’ engagement, affect, and pre-to-post-test learning gains during computer-
mediated tutoring sessions, prediction accuracy for all three types of models im-
proved when shifting from unimodal to bimodal features, and again from bimodal
to trimodal features [Grafsgaard et al. 2014].
Major findings that reflect change in expertise are evident

across modalities and levels of analysis.
As one example, domain expertise has consistently been associated with increased
fluency of communication patterns. Based on the Math Data Corpus, non-expert
students’ writing and speech were 42% and 103% more disfluent than experts, re-
spectively [Oviatt and Cohen 2014]. In addition to lexical evidence of greater fluency,
at the signal level Cheng and colleagues showed that expert mathematicians pro-
duce fewer and briefer pauses when writing math formulas [Cheng and Rojas-Anaya
2007, 2008]. That is, the microstructure of pausing between written strokes within
a math formula shifts to become more fluent. This transition reflects increased
chunking of information in working memory as expertise develops in mathemati-
cians (see Section 11.5.1).
Early multimodal learning analytics research already has

achieved predictions of 90% and above.
Analysis of activity pattern data alone, with no analysis of lexical content, has cor-
rectly classified students by domain expertise at levels exceeding 90%. As discussed
above, more expert students in mathematics have been correctly identified sim-
ply by analyzing their frequency of proposing problem solutions, irrespective of
whether or not they were correct. When unimodal speech data vs. combined speech
and writing data provided the basis for analyses, correct identification was 90% and
100%, respectively [Oviatt and Cohen 2013].
As another example, students who were domain experts in math were reliably
distinguishable from non-experts 90% of the time simply based on the total amount
they wrote when solving problems at different difficulty levels. Whereas non-experts
increased their total writing most between easy and moderate tasks, experts in-
creased their writing most between hard and very hard tasks [Oviatt and Cohen
2014]. Both of these examples demonstrate accurate prediction based on a single
activity pattern metric, with no content analysis required.
As a third example, recent work has shown that domain experts can be distin-
guished from non-experts with 92% accuracy based simply on signal-level writing
dynamics, like the average distance and pressure of their written strokes [Zhou et
al. 2014, Oviatt et al. 2016] (see Section 11.5.2). Again, this prediction performance
was based exclusively on low-level signal features, and did not rely on any content
analysis of what students wrote.
Also based on the Math Data Corpus, a predictive index of expertise that ana-
lyzed speech activity patterns and linguistic content during interruptions within
a group correctly distinguished expert from non-expert students 95% of the time
within 3 min of interaction [Oviatt et al. 2015], surpassing reliabilities of either sin-
gle data source. Not only did expert students interrupt others more frequently, the
linguistic content of their interruptions more often disagreed, corrected, and di-
rected others on problem-solving actions. In contrast, inexpert students expressed
more uncertainty and apologized more often for errors [Oviatt et al. 2015]. This ex-
ample leverages both speech activity patterns and linguistic content analysis during
small group problem solving.
The multi-level multimodal results outlined above emphasize that there is an
opportunity to strategically combine information sources to achieve far higher
reliabilities in the future. From a pragmatic perspective, these early findings on
the robustness of multimodal prediction also highlight that this is a fertile area
to begin developing analytics systems. From a theoretical viewpoint, they indicate
that new multi-level multimodal findings potentially can lead to more coherent
systems-level views of the learning process, as illustrated in Section 11.5.2
11.4.2 Prior Literature on Prerequisites of Learning: Attention, Engagement,

Emotional State
A large and multidisciplinary body of research has investigated learning-oriented
behaviors that are pre-requisites for learning, including attention, engagement,
and emotional state. This research has been based on a wide range of behavioral,
physiological, sensor-based, and in some cases neuroimaging technologies. Dur-
ing learning activities, early evidence of these prerequisite behaviors increases the
likelihood that subsequent learning will occur. Furthermore, combining predic-
tive features based on attention and emotional state potentially can yield faster
prediction and improved reliability. In this section, we discuss a sample of recent
findings in this area, which is by no means exhaustive. For a current summary of

the state-of-the-art on emotion recognition using multimodal-multisensor infor-
mation sources, including emotion recognition during learning-oriented activities,
readers are referred to other chapters in this volume (Chapters 5–8). In addition,
a recent survey article provides a meta-analysis of multimodal affect detection sys-
tems, including a comparison of their multimodal vs. unimodal accuracies and
other information [D’Mello and Kory 2015].
In past research, multimodal systems have been built that aim to detect indi-
vidual students’ attention and interest (vs. disinterest) while engaging in problem-
solving activities. This very basic prediction goal can yield relatively high reliabili-
ties. For example, one system reported 86% accuracy in unconstrained naturalistic
contexts, based on using a multimodal combination of facial expressions, postu-
ral movements, and human-computer activity patterns. The multimodal system
accuracy outperformed alternative unimodal recognition models [Kapoor and Pi-
card 2005]. In other research, children’s engagement with a robotic partner was
assessed while playing a game of chess in a naturalistic setting. Postural cues and
body motion were evaluated, with various models achieving accuracies of 70–79%
[Sanghvi et al. 2011].
Others have aimed to develop infrastructure and assess learners’ attention
in classrooms, in particular to detect decreased concentration using a multiple-
camera approach [Raca and Dillenbourg 2013]. For greater accuracy in assessing
engagement, gaze direction, duration, scan patterns, and similar metrics are other
sources for tracking learners’ attention to objects and tasks, as well as their level
of engagement with human, virtual, or robotic interlocutors [Nakano and Ishii
2010, Peters et al. 2010, Rich et al 2010]. However, in naturalistic contexts like
classrooms head position often is used as a proxy instead of assessing a learner’s
actual gaze [Raca and Dillenbourg 2013]. In addition, direct information about
on-task behavior, or activity patterns compatible with an educational activity (e.g.,
touching and manipulating an object), presents more definitive follow-on informa-
tion that can be combined with gaze to improve prediction. This would require a
more temporally-cascaded approach to prediction, in which brief behavioral slices
encompassing gaze + subsequent manual activity are extracted to more reliably
calibrate a student’s level of engagement.
The large body of research on students’ attention and engagement has too often
studied these precursors of learning as an end goal, without pursuing a more mean-
ingful analysis of their relation to actual learning outcomes. As a counter-example
to this limitation in the literature, Grafsgaard et al. [2014] built multimodal mod-
els that aim to more comprehensively predict students’ affect, engagement, and
also pre-to-post-test learning gains during computer-mediated tutoring sessions.
They analyzed dialogue content, facial and postural movements, hand gestures,
and task actions as information sources. For these interrelated modeling tasks,
the results showed improvement in prediction accuracy when shifting from uni-
modal to bimodal features, and from a bimodal to trimodal collection of features
[Grafsgaard et al. 2014]. The trimodal model for predicting affect far exceeded uni-
and bimodal models, with an R 2 of 0.52, compared with a best bimodal R 2 of 0.14.
Most importantly, the trimodal model for predicting students’ actual learning gains
exceeded both uni- and bimodal models, with an R 2 of 0.54. This compared with a
best dialogue-only R 2 of 0.37, and a best bimodal (dialogue and nonverbal behav-
ior) R 2 of 0.47. Now that multimodal learning analytic techniques enable collecting
larger, richer, and more longitudinal datasets, the community will be in a better
position to: (1) examine interrelations among emotion, engagement, and learn-
ing outcomes in different contexts and (2) strategically select the most informative
modalities to assess in order to achieve high enough reliabilities to develop field-
able applications.
In their meta-analysis of multimodal affect detection systems, D’Mello and Kory
[2015] summarized that most existing systems rely on person-dependent mod-
els that fuse audio and visual information to detect acted (rather than natural
spontaneous) expressions of basic emotions. Their meta-analysis of 90 multimodal
systems revealed that multimodal emotion recognition is consistently more accu-
rate than unimodal approaches to identifying emotions, by an average magnitude
of 10%. To further improve their robustness advantage, future multimodal ap-
proaches to emotion recognition need to more carefully strategize which specific
information sources to include so their complementarity and ability to dampen
major sources of noise is optimized. It should certainly not always be assumed that
the same audio-visual sources will be the best solution. This is an emerging field,
and recognition of more spontaneous emotions, complex emotional blends, timing
and intensity of emotions, and cultural variability in emotional expression remain
difficult challenges yet to be fully addressed. Two major trends in recent emotion
recognition research have been an increase in: (1) assessment and modeling of
multimodal information sources and (2) recognition of naturalistic emotional ex-
pressions, rather than acted ones [Zeng et al. 2009].
To facilitate research on emotion recognition, a variety of software toolkits
are available, such as the Computer Expression Recognition Toolbox (CERT) for
tracking fine-grained facial movements (e.g., eyebrow raising, eyelid tightening,
mouth dimpling; [Littlewort et al. 2011]). This toolkit has been used to analyze
video recordings of naturalistic learning interactions [Grafsgaard et al. 2013],
and is based on Ekman and Friesen’s [1978] Facial Action Coding System. Other
off-the-shelf computer vision toolkits for analyzing faces include Visage [Visage
Technologies 2016], SHORE [Fraunhofer 2016], GAVAM (Generalized Adaptive

View-based Appearance Model [Morency et al. 2010], and newly released Open-
Face ([Baltrusaitis et al. 2016]. These tools provide solutions for measuring head
orientation, eye gaze direction, and facial expressions.
During learning activities, researchers have attempted to understand the rela-
tion between emotional state, on- and off-task activity patterns, and constructive
learning [Woolf et al. 2009]. To learn effectively, students’ activity level in relevant
tasks clearly needs to increase, which is facilitated by positive affect and curiousity
[D’Mello et al. 2008]. The approach of current systems more often is to focus on
detecting negative emotional states that impede learning, in particular boredom,
confusion, and frustration—and then to use this information to adapt a learning
system’s presentation of educational content in order to facilitate positive flow and
problem-solving progress. For example, Grafsgaard et al. [2014] singled out frus-
tration during their emotion modeling and recognition work. They pointed out
that reliable identification of different specific emotions can require different in-
formation sources. Baker and colleagues emphasized recognition of boredom and
confusion, and advocated pedagogical interventions to disrupt vicious cycles that
occur when students become bored and stay bored for long periods of time [Baker
et al. 2010]. In previous work on interaction with intelligent tutoring systems, mul-
tiple regression analyses confirmed that learner’s dialogue features are predictive
of affective states such as boredom, confusion, flow, and frustration, with up to
54% accuracy [D’Mello et al. 2008]. During recognition of spontaneous emotional
expressions, as would be the case for students in an actual classroom using an edu-
cational system, facial features provide additional valuable predictive information
[D’Mello and Graesser 2010].
What is the Theoretical Basis of Multimodal Learning Analytics?

11.5 This section discusses the explanatory power of existing learning theories in ac-
counting for multimodal learning analytics data, and also how multimodal learning
analytics data can generate more coherent systems-level learning theories in the
future.
11.5.1 How Existing Learning Theories Explain Data Observed in Multimodal

Learning Analytics
Previous work has discussed theories of learning that are most relevant to guid-
ing the design of effective educational interfaces [Oviatt 2013b]. In this handbook,
Oviatt [2017] provides background on theories that are specifically relevant to de-
11.5 What is the Theoretical Basis of Multimodal Learning Analytics? 357
signing successful multimodal-multisensor interfaces, which includes limited re-

source theories (e.g., Working Memory and Cognitive Load theories) and perception-
action dynamic theories (e.g., Activity and Embodied Cognition theories). Both of
these theoretical perspectives are relevant to the present discussion on multimodal
learning analytics.
Limited-resource Theories
The aim of this section is to explain why the multimodal data we observe from
student learners contain certain characteristics when they are novices, but quite
different ones after they develop expertise in a domain, or a higher level of skilled
performance. Cognitive limited-resource theories (Working Memory theory, Cogni-
tive Load theory) have been used to account for the general finding that as people
learn their behavior becomes more fluent, which occurs across modalities (e.g.,
speech, writing). In Cheng’s research, expert mathematicians produced fewer and
shorter pauses when handwriting math formulas, a signal-level adaptation that re-
sulted in greater fluency [Cheng and Rojas-Anaya 2007, 2008]. In Oviatt’s research,
students who were more expert in math also produced fewer disfluencies at the
lexical level, resulting in more fluent communication when speaking and writ-
ing [Oviatt and Cohen 2014]. Limited-resource theories claim these changes occur
through repeated engagement in activity procedures (e.g., writing calculations for
math formulas), which causes people to perceive and group isolated lower-level
units of information into higher-level organized wholes [Baddeley and Hitch 1974].
That is, through repeated engagement in activity procedures, more expert students
learn to apprehend larger patterns and strategies. As a result, domain experts do
not need to retain and retrieve as many units of information from working memory
during a task, which frees up their memory reserves for focusing on more difficult
tasks or acquiring new knowledge.
From a more linguistic perspective, limited-resource theories have character-
ized people as adaptively conserving energy at the signal, lexical, and dialogue
levels. For example, Lindblom’s H & H Theory [1990] maintains that speakers strive
for articulatory economy by using relaxed hypo-clear speech, unless the intelligi-
bility of their communication is threatened. In this case, they will expend more
effort to produce hyper-clear speech. Likewise, at the lexical and dialogue levels
the Theory of Least Collaborative Effort states that people reduce their descrip-
tions as they achieve common ground with a dialogue partner [Clark and Brennan
1991, Clark and Schaefer 1989]. During learning activities, research shows that
expert students’ likewise adapt by becoming more concise. They use more apt
and succinct technical terms, and more concise noun phrase descriptions with
a lower ratio of words to concepts [Purandare and Litman 2008]. Based on the
Math Data Corpus, experts’ explanations of their math answers were 15% more
concise in number of words, compared with non-experts [Oviatt et al. 2015]. When
writing, expert math students likewise wrote more compact domain-specific sym-
bols [Oviatt et al. 2007], as well as 13% fewer total written representations. One
important impact of this adaptation is to minimize communicators’ cognitive
load, as well as the total energy they expend, so they can focus on more difficult
tasks.
During interpersonal or human-computer interactions, the literature demon-
strates that human performance improves when people combine different modal-
ities to express complementary information, for example sketching a Punnett
square diagram while speaking a genetics explanation, because this multimodal in-
formation can be processed in separate brain regions simultaneously. This likewise
effectively circumvents working memory limitations related to any single region.
Furthermore, this multimodal advantage can accrue whether simultaneous infor-
mation processing involves two input streams, an input and output stream, or
two output streams [Oviatt 2017]. For example, numerous education studies have
shown that a multimodal presentation format supports students’ learning more
successfully than unimodal presentation [Mayer and Moreno 1998, Mousavi et al.
1995]. When using a multimodal format, larger performance advantages also have
been demonstrated on more difficult tasks, compared with simpler ones [Tindall-
Ford et al. 1997]. Flexible multimodal interaction and interfaces are effective partly
because they support students’ ability to self-manage their own working memory in
a way that reduces cognitive load [Oviatt et al. 2004]. For example, students prefer
to interact unimodally when working on easy problems, but upshift to interact-
ing multimodally on harder ones. From a limited-resource theory viewpoint, when
students upshift to multimodal processing on hard problems they are distribut-
ing information processing across different brain regions. This frees up working
memory and minimizes their cognitive load so performance can be enhanced on
these harder tasks [Oviatt et al. 2004].
Perception-action Dynamic Theories

Perception-action dynamic theories provide a holistic systems view of interaction
between humans and their environment, including feedback processes as part
of a dynamic loop. Currently, Embodied Cognition theory is most actively be-
ing researched, often in the context of human learning or neuroscience research.
It claims that representations involve activating neural processes that recreate a
related action-perception experiential loop, which is based on multisensory per-
11.5 What is the Theoretical Basis of Multimodal Learning Analytics? 359
ceptual and multimodal motor neural circuits in the brain [James et al. 2017b,
Nakamura et al. 2012]. During this feedback loop, multisensory perception of an
action (e.g., the visual, haptic, and auditory cues involved in writing a letter shape)
primes motor neurons in the observer’s brain (e.g., corresponding finger move-
ments), which facilitates related comprehension and learning (e.g., letter recog-
nition and reading). Neuroscience findings have confirmed that actively writing
symbolic representations like letter shapes, compared with passively viewing or
typing them, stimulates greater brain activation, more durable long-term sensori-
motor memories in the reading neural circuit, and more accurate letter recognition
[James and Engelhardt 2012, Longcamp et al. 2008].
Both Activity and Embodied Cognition theories support multimodal learning
analytic findings that students’ increased physical and communicative activity fa-
cilitates learning and the consolidation of expertise. While Activity theorists like
Luria and Vygotsky were particularly interested in how speech activity facilitates
self-regulation and cognition [Luria 1961, Vygotsky 1962], subsequent research
has shown that increased activity in all communication modalities stimulates
thought. As tasks become more difficult, speech, gesture, and writing all increase
in frequency, reducing cognitive load and improving performance [Comblain 1994,
Goldin-Meadow et al. 2001, Xiao et al. 2003].
Furthermore, multisensory perception and multimodal action patterns facili-
tate learning to a greater extent than unimodal ones. They stimulate more total neu-
ral activity across a range of modalities, more intense bursts of neural activity, more
widely distributed activity across the brain’s neurological substrates, and longer
distance connections (see [Oviatt 2017]). As discussed above, research has demon-
strated that this results in demonstrably improved performance and learning.
See [Oviatt 2017] for a more extended discussion of how increased multisensory-
multimodal activity influences human brain structure and processing.
11.5.2 How Multimodal Learning Analytics Can Generate New

Systems-level Learning Theory
Rich multi-level multimodal data analytics research has the potential to support
more coherent systems-level theories of the learning process. As this field contin-
ues to develop, such theories could explain how multiple levels interact (e.g., neural,
signal, activity pattern, representational) to produce emergent learning phenom-
ena. In what follows we describe recent research findings which, combined with an
analysis of past literature, produced a new systems-level theory about the process
of learning.
As introduced in Section 11.4.1, recent multimodal learning analytics work has

revealed that math domain experts can be distinguished from non-experts with
92% accuracy based simply on signal-level handwriting dynamics, like the average
distance, duration, and pressure of their written strokes [Zhou et al. 2014, Oviatt
et al. 2016]. When more expert math students write math formulas, their average
strokes are shorter, briefer, and lighter in pressure as shown in Figure 11.7. Model-
ing of these multiple signal features revealed that students who are domain experts
in mathematics expend 46% less total energy forming written strokes, compared
with non-expert students. When within-subject longitudinal data were analyzed
over time, students who demonstrated learning also showed a decrease in written
total energy expended, but no change was evident in students who did not learn.
In summary, students dynamically co-adapt the total writing energy they expend
as a function of their level of domain expertise when thinking and reasoning in
that domain. Specifically, there is an inverse relation between level of expertise and
metrics reflecting communicative energy expended—low expertise/high energy,
high expertise/low energy. This inverse relation is evident between individuals (i.e.,
non-experts vs. experts), and also within individuals over time as they learn (i.e.,
beginning student, mastery student). Domain novices expend more energy during
initial stages of learning, but they adapt by reducing it as they gradually acquire
more expertise.
While these signal-level predictive results are perhaps surprising, parallel find-
ings have been obtained across multiple levels of analysis—including neural, sig-
nal, activity pattern, and linguistic. For example, related literature has shown that
neural activation is reduced up to 90% when performing a task after becoming a
domain expert [Hill and Schneider 2006]. As summarized in Section 11.5.1, domain
experts also reduce the total amount they communicate, for example speaking or
writing fewer words when they explain how a math problem is solved. The research
described above on dynamic adaptation of signal-level writing energy, combined
with this and similar related literature, has generated a unique view of what occurs
during the process of learning.
From a theoretical viewpoint, one premise is that the total energy an organism
expends is a limited resource that they adapt in a homeostatic manner automati-
cally and continuously in order to establish equilibrium. A major systems-level aim
of organisms is to conserve energy resources for vital functions across all levels,
which is an evolutionary advantage. During learning, students will reduce energy
expenditure on communication so they can flexibly allocate it to master harder
tasks or prepare for learning new content [Oviatt et al. 2016]. Briefer and less com-
plex communication while solving a problem frees up working memory resources
for thinking and reasoning about it, which can facilitate further integration of
11.6 What are the Main Challenges and Limitations of Multimodal Learning Analytics? 361
Session 1 (S1) Session 2 (S2) Learning No learning
Expert CDE = 77% CDE = 87% +10%

(E) TE = 8,057 TE = 6,711 TE change = –17%
Non-expert CDE = 28% CDE = 26% –2%

(NE) TE = 11,780 TE = 12,034 TE change = +2%
Figure 11.7 Pen stroke data for a domain expert (top) vs. non-expert (bottom) based on writing a
formula in two consecutive sessions (left). The expert shows a 32% relative reduction in
total energy expended (TE), compared with the non-expert. The expert also shows a 17%
relative reduction in total energy expended over sessions, which corresponded with a
10% learning gain (right top). (CDE=cumulative domain expertise, based on ground-truth
coding in Oviatt et al. [2014])
memory and increased learning. In short, the emergent outcome of this limited-
resource systems-level process is that experts’ reduced energy baseline enables a
larger energy reserve and greater adaptive range for expending energy on initiating
new learning activities. This initiative further expands their expertise, generating
a self-sustaining process of learning facilitating preparation and engagement in
further learning [Oviatt et al. 2016]. In fact, multimodal learning analytics research
conducted on the Math Data Corpus confirmed that experts were four-fold more
active initiating solutions, worked more on harder problems, and learned more
between sessions than non-experts [Oviatt et al. 2016]. For a more extended dis-
cussion of this limited-resource view of cross-level interactions between expertise
status, energy expended on cognitive processing and related language, total energy
reserves, and emergent initiative to engage in more learning activities, see Oviatt
et al. [2016].
What are the Main Challenges and Limitations of Multimodal

11.6 Learning Analytics?
The field of multimodal learning analytics is data-intensive and demanding in
unique ways. Perhaps the most demanding aspect of conducting research in this
area is the requirement to obtain a relevant dataset, or else to collect it yourself.
Often multimodal datasets collected in learning contexts cannot be shared with

others due to IRB restrictions. The need to collect and analyze data involving mul-
tiple high-fidelity input streams in a tightly time-synchronized manner can require
technical expertise spanning multiple component technologies (e.g., speech, im-
age, and ink processing), fine-grained time-stamping of data, and custom design of
retrieval and analysis tools. For some modalities, well-developed tools are available,
such as Praat for speech data analysis [Boersma 2002]. However, for other modali-
ties like writing and manipulation of ink data, researchers often must develop their
own analysis tools.
After the initial data collection, transcription and coding of multimodal learn-
ing analytics data can be extremely time-consuming, with few automated tools
available. All such datasets require high-quality ground-truth coding, in particular
to solidly document students’ level of performance and learning progress. These
data collection and coding burdens are compounded by the fact that multimodal
learning analytics ideally should be supported by extended longitudinal datasets,
preferably sampling data over months, which can involve vast amounts of rich
multimodal data. Once collected and coded, an excellent longitudinal dataset is
exceedingly valuable, and could support extensive exploratory analysis by an entire
community of researchers for years.
Another major challenge is that, without stakeholder approval and control, the
best learning analytics technology will not be adopted by the education system
and society. The opaque nature of machine learning creates a problem for social-
organization control and the likelihood that valuable new prediction technologies
will be adopted. For technology adoption, different stakeholders need to be in-
volved in designing and vetting applications involving the technology. For example,
learning analytic technology stands to play a definitive role in guiding the future of
educational practice. Students, teachers, parents, government funding agencies,
and society at large all will need to provide input on the goals, metrics, applica-
tions, privacy protections, and other issues involved in adopting learning analytics.
To support community discussions, learning analytic techniques must be trans-
parent, conceptually clear, and easy to understand and communicate in order to
support consensus building and social policy formation. Participatory design in-
volving stakeholder input will be required to initially design, and then continually
modify, learning analytic techniques for educational purposes. The long-term aim
is end-user appropriation of learning analytic tools by the education community
itself.
An additional major challenge to the widespread use of multimodal learning
analytics in schools is concern about protecting student and teacher privacy. As data
11.7 Conclusions and Future Directions 363
sources become more richly multimodal, and learning analytic tools more mobile,
privacy violations risk becoming intrusive, pervasive, and potentially damaging to
basic civil liberties. It will be essential that technology designers, educators, and
citizens engage in participatory design of learning analytic tools to ensure that
policy formation protects the privacy, employment, and human rights of students
and teachers from whom data are collected. In the U.S., California already has
passed legislation limiting collection, use, and sale of students’ data by third-party
groups [Singer 2014]. Among the lessons for designers of learning analytic systems
are that: (1) users must be able to understand and control any system developed,
which requires transparency and continuous involvement; (2) a learning analytic
system cannot be gameable, or users will discover contingencies and game it to
their advantage; and (3) users of analytic systems must participate in their design
to protect their own privacy.
This chapter has not focused on multimodal learning analytic applications,
which would be premature given the emergent status of this new field. However,
systems are indeed beginning to be designed and developed to provide real-time
feedback to the learner or instructor about learning processes. As one early exam-
ple, Batrinca et el. [2013] developed a public speaking skill training system based
on advanced multimodal sensing and virtual human technologies. In this work, a
virtual audience of software personas responds to the quality of a speaker’s oral
presentation in real time by providing feedback. Similar systems have been de-
veloped using speech and image processing [Kurihara et al. 2007, Nguyen et al.
2012]. Other researchers are contributing to improving these new systems. For ex-
ample, research has examined how different types of multimodal system feedback
can improve students’ learning of presentation skills [Schneider et al. 2015]. An-
other project is extending this work to applications like personnel hiring [Nguyen et
al. 2014]. Eventually, this type of multimodal technology could transform previous
human-scored assessment of presentations, making them more reliable, objective,
and cost-efficient. A major challenge that lies ahead is the development of other
new systems that use multimodal learning analytic findings to adaptively support
learners in more sensitive and useful ways.
Conclusions and Future Directions

11.7 Learning is a complex human activity, and it cannot be understood fully with-
out collaboration between learning scientists, cognitive and linguistic scientists,
and computer scientists. Over the last five years, researchers with expertise in
handling multimodal data and infrastructure have pioneered new computational
methods, strategic prediction techniques, and fertile findings in the emerging

area of multimodal learning analytics. With new methods now available, cogni-
tive, linguistic, and learning scientists can participate in organizing the collection
of meaningful datasets, and analyzing them to improve the modeling and predic-
tion of students’ mental states during educational activities. Eventually, practicing
educators and students themselves will be able to participate by tailoring valu-
able aspects of the technology for use in classrooms and on their own mobile
devices, which will improve the likelihood of adoption and valuable utility to end
users.
Major efforts will be required to explore the vast exploratory space of multi-level
multimodal learning analytics, and then to combine the most fertile predictors into
ultra-reliable learning analytic systems. The collection of new multimodal data-
sets, and research studies based on them, will need to focus on tracking learners in
situated contexts longitudinally over time. Accomplishing this will require major in-
vestments in new computational tools and software to automate this data-intensive
work. The days of investigating single “markers” of expertise and learning during
one-shot testing are long past. To advance learning analytics now, researchers must
engage in teamwork to transparently and efficiently collect multi-stream data over
extended periods of time in classrooms and other natural contexts. These data
will be valuable in supporting continuous assessment of a person’s mental sta-
tus, and real-time support with responsive intervention by teachers and adaptive
educational systems.
While multimodal learning analytics has progressed quickly toward predict-
ing students’ level of domain expertise and skilled performance, future work
must be initiated to examine the more challenging topics of how students gain
insight during problem solving, metacognitive awareness for self-regulating learn-
ing, and transfer of learned information. To develop a deeper understanding of
these processes, multimodal learning analytics offers more powerful data and
context-sensitive metrics for revealing the multi-level process of learning, which
is especially needed across behavioral and neuropsychological indices. New re-
search also must clarify the relation between students’ level of domain expertise
and the transient cognitive load they experience when solving problems. Other top-
ics that need to be addressed are how we can more objectively assess a student’s
level of domain expertise during learning, independent of conflating personal-
ity attributes like extroversion and dominance. This raises the question of what
multimodal learning analytic systems may be able to assess more accurately and
objectively than experienced teachers who, like the rest of us, are limited by hu-
man biases in inferential reasoning [Kahneman et al. 1982]. To make progress
Focus Questions 365
on all of these topics, cognitive, learning and linguistic scientists will be espe-
cially indispensable members of future research teams. For further discussion by
a multidisciplinary panel of experts on challenging topics in multimodal learning
analytics and the design of multimodal educational technologies, see [James et al.
2017a].
Focus Questions
11.1. Why is multimodal learning analytics as a field emerging now?
11.2. What are the main objectives of multimodal learning analytics, and what are
its main advantages compared with previous learning analytics?
11.3. How do the Math Data Corpus and Oral Presentation Corpus differ? What
opportunities do they each provide for examining learning with multimodal data?
11.4. What main computational methods and techniques are currently being used
to analyze large multi-level multimodal data sets? What are each of their strengths
and limitations?
11.5. Existing learning analytics systems often are used for educational or class-
room management, rather than for assessing learning behaviors or domain exper-
tise in individual students. Why is multimodal learning analytics a more promising
avenue for evaluating learners’ mental state?
11.6. What different levels of analysis does multimodal learning analytics enable?
In what ways do they provide a comprehensive and powerful window on students’
learning?
11.7. What types of data could be collected and analyzed automatically to evaluate
student learning using existing commercially available technology?
11.8. In the future, what educational assessment objectives might be accom-

plished more effectively using multimodal learning analytics, rather than an expert
human teacher? Why?
11.9. What evidence exists that experts conserve communication energy? What
are the implications for exploring further predictors of expertise using multimodal
learning analytics? And for producing new theory?
11.10. What specific types of data will be needed to support next-step research on
multimodal learning analytics?
11.11. What are the main privacy concerns involved in using multimodal learning
analytics, and how would you approach managing them?
11.12. Using either the Math Data Corpus or Oral Presentation Corpus, design
a multimodal learning analytics study to investigate a question that you think is
interesting. You can plan data analyses using any modality or level of analysis that
would be supportive of your aims. State your hypotheses, and also the potential
uses or impact of any findings.
References
Anoto. 2016. http://pressreleases.triplepointpr.com/2016/02/07/anoto-announces-
acquisition-of-we-inspire-destiny-wireless-and-pen-generations/. Accessed October
21, 2016. 346
K. E. Arnold and M. D. Pistilli. 2012. Course signals at purdue: using learning analytics to
increase student success. Proceedings of the 2nd International Conference on Learning
Analytics and Knowledge, ACM/SOLAR, 267–270. DOI: 10.1145/2330601.2330666. 337
A. Arthur, R. Lunsford, M. Wesson, and S. L. Oviatt. 2006. Prototyping novel collaborative
multimodal systems: Simulation, data collection and analysis tools for the next
decade. In Eighth International Conference on Multimodal Interfaces (ICMI’06), pp.
209–226. ACM, New York. DOI: 10.1145/1180995.1181039. 340
A. D. Baddeley and G. J. Hitch. 1974. Working memory. G. A. Bower, editor, The Psychology
of Learning and Motivation: Advances in Research and Theory, Volume 8, pp. 47–89.
Academic Press, New York. 357
R. S. Baker, S. K. D’Mello, M. M. Rodrigo, and A. C. Graesser. 2010. Better to be frustrated than
bored: The incidence, persistence, and impact of learners’ cognitive-affective states
during interactions with three different computer-based learning environments.
International Journal of Human-Computer Studies, 68(4):223–241. DOI: 10.1016/j.ijhcs
.2009.12.003. 356
R. S. Baker, A. T. Corbett, K. R. Koedinger, and A Z. Wagner. 2004. Off-task behavior in the
cognitive tutor classroom: When students game the system. Proceedings of the SIGCHI
Conference on Human factors in Computing Systems (CHI), pp. 383–390. ACM, New
York. DOI: 10.1145/985692.985741. 337
T. Baltrusaitis, P. Robinson, and L. Morency. 2016. OpenFace: An open source facial behavior
analysis toolkit. In Proceedings of the 2016 IEEE Winter Conference on Applications of
Computer Vision (WACV), IEEE, New York. DOI: 10.1109/WACV.2016.7477553. 356
L. Batrinca, G. Stratou, A. Shapiro, L.-P. Morency, and S. Scherer. 2013. Cicero: Towards a
multimodal virtual audience platform for public speaking training. Intelligent Virtual
Agents, 116–128. DOI: 10.1007/978-3-642-40415-3_10. 363
P. Boersma. 2002. Praat, a system for doing phonetics by computer. Glot International,
5(9/10):341–345. 362
References 367
L. Chen, G. Feng, J. Joe, C. Leong, C. Kitchen, C. and Lee. 2014a. Toward automated
assessment of public speaking skills using multimodal cues. In Proceedings of the
International Conference on Multimodal Interaction (ICMI), pp. 200–203. ACM, New
York. DOI: 10.1145/2663204.2663265. 345, 347, 349, 351
L. Chen, C. Leong, G. Feng, and C. Lee. 2014b. Using multimodal cues to analyze MLA’14 Oral
Presentation Quality Corpus: Presentation delivery and slides quality. In Proceedings
of the 2014 ACM Grand Challenge Workshop on Multimodal Learning Analytics, pp.
45–52. ACM, New York. DOI: 10.1145/2666633.2666640. 347, 350, 351
L. Chen, C. Leong, G. Feng, C. Lee, and S. Somasundaran. 2015. Utilizing multimodal
cues to automatically evaluate public speaking performance. In Proceedings of the
International Conference on Affective Computing and Intelligent Interaction, pp. 394–
400. DOI: 10.1109/ACII.2015.7344601. 343, 347
P. C. H. Cheng, and H. Rojas-Anaya. 2007. Measuring mathematical formula writing
competence: An application of graphical protocol analysis. In D. S. McNamara
and J. G. Trafton editors, Proceedings of the 29th Conference of the Cognitive Science
Society, pp. 869–874. Cognitive Science Society, Austin, TX. 352, 357
P. C. H. Cheng, and H. Rojas-Anaya. 2008. A graphical chunk production model: Evaluation
using graphical protocol analysis with artificial sentences. In B. C. Love, K. McRae
and V. M. Sloutsky editors, Proceedings of 30th Annual Conference of the Cognitive
Science Society, pp. 1972–1977. Cognitive Science Society, Austin, TX. 352, 357
H. Clark and S. Brennan. 1991. Grounding in communication. In Resnick, Levine and
Teasley, editors, Perspectives on Socially Shared Cognition, pp. 127–149, APA,
Washington. 357
H. Clark, and E. F. Schaefer. 1989. Contributing to discourse. Cognitive Science, 13: 259–294.
DOI: 10.1016/0364-0213(89)90008-6. 357
P. Cohen and S. Oviatt. 2017. Multimodal speech and pen interfaces. In S. Oviatt, B. Schuller,
P. Cohen, D. Sonntag, G. Potamianos and A. Krueger, editors, The Handbook of
Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling and Common
Modality Combinations. Morgan & Claypool Publishers, San Rafael, CA. 332
A. Comblain. 1994. Working memory in Down’s Syndrome: Training the rehearsal strategy.
Down’s Syndrome Research and Practice, 2(3):123–126. DOI: 10.3104/reports.42. 359
S. D’Mello, S. Craig, A. Witherspoon, B. McDaniel, and A. Graesser 2008. Automatic
detection of learner’s affect from conversational cues. User Modeling and User Adapted
Interaction, 18(1-2):45–80. DOI: 10.1007/s11257-007-9037-6. 356
S. K. D’Mello and A. C. Graesser. 2010. Multimodal semi-automated affect detection from
conversational cues, gross body language, and facial features. User Modeling and
User- Adapted Interaction, 20(2):147–187. DOI: 10.1007/s11257-010-9074-4. 356
S. K. D’Mello and J. Kory. 2015. A Review and meta-analysis of multimodal affect detection
systems. ACM Computing Surveys, 47(3), article 43, ACM, New York. DOI: 10.1145/
2682899. 352, 354, 355
F. Dominguez, V. Echeverria, K. Chiluiza, and X. Ochoa. 2015. Multimodal selfies: Designing

a multimodal recording device for students in traditional classrooms. In Proceedings
of the ACM International Conference on Multimodal Interaction, pp. 567–574. DOI:
10.1145/2818346.2830606. 346, 347
P. Ekman and W. V. Friesen. 1978. Facial Action Coding System, Consulting Psychologists
Press, Palo Alto, CA. 355
K. Ericsson, N. Charness, P. Feltovich, and R. Hoffman, editors. 2006. The Cambridge

Handbook of Expertise and Expert Performance. Cambridge University Press, New York.
349
A. Ezen-Can, J. F. Grafsgaard, J. C. Lester, and K. E. Boyer. 2015. Classifying student dialogue

acts with multimodal learning analytics. In Proceedings of the Fifth International
Conference on Learning Analytics And Knowledge, pp. 280–289. ACM, New York. DOI:
10.1145/2723576.2723588. 348
Fraunhofer. 2016. http://www.iis.fraunhofer.de/en/ff/bsy/tech/bildanalyse/shore-

gesichtsdetektion.html. Accessed October 22, 2016. 356
J. H. Friedman. 2002. Stochastic gradient boosting. Computational Statistics & Data Analysis,
38(4):367–378. DOI: 10.1016/S0167-9473(01)00065-2. 351
S. Goldin-Meadow, H. Nusbaum, S. Kelly, and S. Wagner. 2001. Explaining math: Gesturing

lightens the load. Psychological Science, 12(6):516–522. DOI: 10.1111/1467-9280
.00395. 359
J. F. Grafsgaard, J. B. Wiggins, K. E. Boyer, E. N. Wiebe, and J. C. Lester. 2013. Automatically

recognizing facial expression: Predicting engagement and frustration. Proceedings
of the Sixth International Conference on Educational Data Mining, pp. 43–50. DOI:
10.1109/ACII.2013.33. 355
J. F. Grafsgaard, J. B. Wiggins, A. K. Vail, K. E. Boyer, E. N. Wiebe, and J. C. Lester. 2014.

The additive value of multimodal features for predicting engagement, frustration,
and learning during tutoring. In Proceedings of the ACM International Conference on
Multimodal Interaction, pp. 42–49. ACM, New York. DOI: 10.1145/2663204.2663264.
352, 354, 355, 356
N. Hill and W. Schneider. 2006. Brain changes in the development of expertise: Neu-
roanatomical and neurophysiological evidence about skill-based adaptations. In
Ericsson, Charness, Feltovich & Hoffman, editors, The Cambridge Handbook of
Expertise and Expert Performance. Cambridge University Press, 37:653–682. DOI:
10.1017/CBO9780511816796.037. 360
K. Hinckley. 2017. A background perspective on touch as a multimodal (and multi-sensor)

construct. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos and A.
Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume 1:
Foundations, User Modeling, and Common Modality Combinations, Chap 4. Morgan &
Claypool Publishers, San Rafael, CA. 332
References 369
K. James and L. Engelhardt. 2012. The effects of handwriting experience on functional brain
development in pre-literate children. Trends in Neuroscience and Education, 1: 32–42.
DOI: 10.1016/j.tine.2012.08.001. 359
K. James, J. Lester, S. Oviatt, K. Cheng, and D. Schwartz. 2017a. Perspectives on learning
with multimodal technologies. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag,
G. Potamianos and A. Krueger, editors, The Handbook of Multimodal-Multisensor
Interfaces, Volume 1: Foundations, User Modeling, and Common Modality Combinations,
Chapter 13. Morgan & Claypool Publishers, San Rafael, CA. 365
K. James, S. Vinci-Booher, and F. Munoz-Rubke. 2017b. The impact of multimodal-
multisensory learning on human performance and brain activation patterns. In S.
Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos and A. Krueger, editors, The
Handbook of Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling,
and Common Modality Combinations, Chapter 2. Morgan & Claypool Publishers, San
Rafael, CA. DOI: 10.1145/3015783.3015787. 359
B. Jee, D. Gentner, K. Forbus, B. Sageman, and D. Uttal. 2009. Drawing on experience: Use
of sketching to evaluate knowledge of spatial scientific concepts. In Proceedings of
the 31st Conference of the Cognitive Science Society, Amsterdam. 350
B. Jee, D. Gentner, D. Uttal, B. Sageman, K. Forbus, C. Manduca, C. Ormand, T. Shipley,
and B. Tikoff. 2014. Drawing on experience: How domain knowledge is reflected in
sketches of scientific structures and processes. Research in Science Education, 44(6),
859–883 Springer, Dordrecht. DOI: 10.1007/s11165-014-9405-2. 350
A. Jones, A. Kendira, D. Lenne, T. Gidel, and C. Moulin. 2011. The tatin-pic project: A multi-
modal collaborative work environment for preliminary design. In Proceedings of the
Conference on Computer Supported Cooperative Work in Design (CSCWD), pp. 154–161.
DOI: 10.1109/CSCWD.2011.5960069. 346
D. Kahneman, P. Slovic, and A. Tversky, editors. 1982. Judgment under Uncertainty: Heuristics
and Biases. Cambridge University Press, New York. 364
M. Kapoor and R. Picard. 2005. Multimodal affect recognition in learning environments. In
Proceedings of the ACM Multimedia Conference, ACM, New York. DOI: 10.1145/1101149
.1101300. 352, 354
A. Katsamanis, V. Pitsikalis, S. Theodorakis, and P. Maragos. 2017. Multimodal gesture
recognition. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos and
A. Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume
1: Foundations, User Modeling, and Common Modality Combinations, Chapter 11.
Morgan & Claypool Publishers, San Rafael, CA. DOI: 10.1145/3015783.3015796. 332
R. F. Kizilcec, C. Piech, and E. Schneider. 2013. Deconstructing disengagement: analyzing
learner subpopulations in massive open online courses. In Proceedings of the Third
International Conference on Learning Analytics and Knowledge, ACM, New York, pp.
170–179. DOI: 10.1145/2460296.2460330. 337
S. Kopp and K. Bergmann. 2017. Using cognitive models to understand multimodal
processes: The case for speech and gesture production. In S. Oviatt, B. Schuller,
P. Cohen, D. Sonntag, G. Potamianos and A. Krueger, editors, The Handbook of

Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling, and Common
Modality Combinations, Chapter 6. Morgan & Claypool Publishers, San Rafael, CA.
DOI: 10.1145/3015783.3015791. 332
K. Kurihara, M. Goto, J. Ogata, Y. Matsusaka, and T. Igarashi. 2007. Presentation sensei: A
presentation training system using speech and image processing. In Proceedings of
the ACM International Conference on Multimodal Interfaces, pp. 358–365. ACM, New
York. DOI: 10.1145/1322192.1322256. 363
J. Leitner, J. Powell, P. Brandl, T. Seifried, M. Haller, B, Doray, and P. To. 2009. FLUX- A tilting
multi-touch and pen-based surface. In Proceedings of the International Conference on
Human Factors in Computing (CHI). ACM, New York. DOI: 10.1145/1520340.1520459.
346
C. Leong, L. Chen, G. Feng, C. Lee, and M. Mulholland. 2015. Utilizing depth sensors for
analyzing multimodal presentations: Hardware, software and toolkits. Proceedings of
the ACM International Conference on Multimodal Interaction, pp. 547–556. ACM, New
York. DOI: 10.1145/2818346.2830605. 344
B. Lindblom. 1990. Explaining phonetic variation: A sketch of the H and H theory. In W.
Hardcastle and A. Marchal, editors, Speech Production and Speech Modeling, pp.
403–439. Kluwer, Dordrecht. DOI: 10.1007/978-94-009-2037-8_16. 357
G. Littlewort, J. Whitehill, T. Wu, I. Fasel, M. Frank, J. Movellan, and M. Bartlett. 2011.
The Computer Expression Recognition Toolbox (CERT). In Proceedings of the IEEE
International Conference on Automatic Face and Gesture Recognition, pp. 298–305. 355
M. Longcamp, C. Boucard, J.-C. Gilhodes, J.-L. Anton, M. Roth, B. Nazarian, and J.-L. Velay.
2008. Learning through hand or typewriting influences visual recognition of new
graphic shapes: Behavioral and functional imaging evidence. Journal of Cognitive
Neuroscience, 20(5):802–815. DOI: 10.1162/jocn.2008.20504. 359
A. R. Luria. 1961. The Role of Speech in the Regulation of Normal and Abnormal Behavior.
Oxford, Liveright. 338, 359
L. Martin and D. Schwartz. 2009. Prospective adaptation in the use of external representa-
tions. Cognition and Instruction, 27(4):370–400. DOI: 10.1080/07370000903221775.
350
R. Martinez Maldonado, Y. Dimitriadis, J. Kay, K. Yacef, and M.-T. Edbauer. 2012.
Orchestrating a multi-tabletop classroom: From activity design to enactment and
reflection. In Proceedings of the ACM International Conference on Interactive Tabletops
and Surfaces, pp. 119–128. ACM, New York. DOI: 10.1145/2396636.2396655. 346
R. E. Mayer and R. Moreno. 1998. A split-attention effect in multimedia learning: Evidence
for dual processing systems in working memory. Journal of Educational Psychology,
90(2):312–320. DOI: 10.1037/0022-0663.90.2.312. 358
L. Morency, J. Whitehill, and J. Movellan. 2010. Monocular head pose estimation using
generalized adaptive view-based appearance model. Image and Vision Computing,
28(5):754–761. DOI: 10.1016/j.imavis.2009.08.004. 356
References 371
S. Y. Mousavi, R. Low, and J. Sweller. 1995. Reducing cognitive load by mixing auditory and
visual presentation modes. Journal of Educational Psychology, 87(2):319–334. DOI:
10.1037/0022-0663.87.2.319. 358
K. Nakamura, W-J. Kuo, F. Pegado, L. Cohen, O. Tzeng, and S. Dehaene. 2012. Universal
brain systems for recognizing word shapes and handwriting gestures during reading.
In Proceedings of the National Academy of Science, 109(50):20762–20767. DOI: 10.1073/
pnas.1217749109. 359
Y. I. Nakano and R. Ishii. 2010. Estimating user’s engagement from eye-gaze behaviors
in human-agent conversations. In Proceedings of the International Intelligent User
Interface Conference, pp. 139–148. ACM, New York. DOI: 10.1145/1719970.1719990.
354
A.-T. Nguyen, W. Chen, and M. Rauterberg. 2012. Online feedback system for public
speakers. IEEE Symposium on e-Learning, e-Management and e-Services. Citeseer,
Malaysia, 46–51. DOI: 10.1109/IS3e.2012.6414963. 343, 363
L. S. Nguyen, D. Frauendorfer, M. S. Mast, and D. Gatica-Perez. 2014. Hire me: Computational
inference of hirability in employment interviews based on nonverbal behavior. IEEE
Transactions on Multimedia, 16(4):1018–1031. 363
X. Ochoa. Description of the presentation quality dataset and coded documents,
unpublished manuscript. 343
X. Ochoa, K. Chiluiza, G. Méndez, G. Luzardo, B. Guamán, and J. Castells. 2013. Expertise
estimation based on simple multimodal features. In Proceedings of the 15th ACM
International Conference on Multimodal Interaction, pp. 583–590. ACM, New York, NY.
DOI: 10.1145/2522848.2533789. 350
S. L. Oviatt. 2013a. Problem solving, domain expertise, and learning: Ground-truth
performance results for Math Data Corpus. In International Data-Driven Grand
Challenge Workshop on Multimodal Learning Analytics. ACM, New York. DOI: 10.1145/
2522848.2533791. 340, 341
S. Oviatt. 2013b. The Design of Future Educational Interfaces. Routledge Press, New York, NY.
DOI: 10.4324/9780203366202. 333, 356
S. Oviatt. 2017. Theoretical foundations of multimodal interfaces and systems. In S. Oviatt,
B. Schuller, P. Cohen, D. Sonntag, G. Potamianos and A. Krueger, editors, The
Handbook of Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling
and Common Modality Combinations, Chapter 1. Morgan & Claypool Publishers, San
Rafael, CA. 356, 358, 359
S. L. Oviatt, A. Arthur, Y. Brock, and J. Cohen. 2007. Expressive pen-based interfaces for math
education. In Chinn, Erkens and Puntambekar, editors, Proceedings of the Conference
on Computer-Supported Collaborative Learning, International Society of the Learning
Sciences, 8(2):569–578. 350, 358
S. Oviatt and A. Cohen. 2013. Written and multimodal representations as predictors of
expertise and problem-solving success in mathematics. In Proceedings of the 15th
ACM International Conference on Multimodal Interaction, pp. 599–606. ACM, New

York. DOI: 10.1145/2522848.2533793. 339, 341, 350, 351, 352
S. L. Oviatt and A. Cohen. 2014. Written activity, representations, and fluency as predictors
of domain expertise in mathematics. In Proceedings of the 16th ACM International
Conference on Multimodal Interaction. ACM, New York. DOI: 10.1145/2663204
.2663245. 352, 353, 357
S. L. Oviatt, A. O. Cohen, N. Weibel, K. Hang, and K. Thompson. 2014. Multimodal learning
analytics data resources: Description of math data corpus and coded documents.
In Third International Data-Driven Grand Challenge Workshop on Multimodal Learning
Analytics. ACM Press, New York. 341, 361
S. L. Oviatt and P. R. Cohen. 2015. The Paradigm Shift to Multimodality in Contemporary
Computer Interfaces. Human-Centered Interfaces Synthesis series J. Carrol, editor.
Morgan & Claypool, San Rafael, CA. DOI: 10.2200/S00636ED1V01Y201503HCI030.
335
S. L. Oviatt, R. Coulston, and R. Lunsford. 2004. When do we interact multimodally?
Cognitive load and multimodal communication patterns. In Proceedings of the
International Conference on Multimodal Interfaces, pp. 129–136. ACM, New York. DOI:
10.1145/1027933.1027957. 358
S. L. Oviatt, K. Hang, J. Zhou, and F. Chen. 2015. Spoken interruptions signal productive
problem solving and domain expertise in mathematics. In Proceedings of the 17th
ACM International Conference on Multimodal Interaction. ACM, New York. DOI: 10
.1145/2818346.2820743. 350, 351, 353, 358
S. L. Oviatt, K. Hang, J. Zhou, K. Yu, and F. Chen. Aptil 2016. Dynamic handwriting signal
features predict domain expertise. Incaa Designs Technical Report 52016. 349, 350,
353, 360, 361
C. Peters, S. Asteriadis, and K. Karpouzis. 2010. Investigating shared attention with a
virtual agent using a gaze-based interface. Journal on Multimodal User Interfaces,
3(1-2):119–130. DOI: 10.1007/s12193-009-0029-1. 354
G. Potamianos, E. Marcheret, Y. Mroueh, V. Goel, A. Koumbaroulis, A. Vartholomaios, and
S. Thermos. 2017. Audio and visual modality combination in speech processing
applications. In S. Oviatt, B. Schuller, P. Cohen, D. Sonntag, G. Potamianos and
A. Krueger, editors, The Handbook of Multimodal-Multisensor Interfaces, Volume
1: Foundations, User Modeling, and Common Modality Combinations, Chapter 12.
Morgan & Claypool Publishers, San Rafael, CA. DOI: 10.1145/3015783.3015797. 332
A. Purandare and D. Litman. 2008. Content-learning correlations in spoken tutoring dialogs
at word, turn, and discourse levels. In Proceedings of the 21st International FLAIRS
Conference, Coconut Grove, FL. 349, 358
P. Qvarfordt. 2017. Gaze-informed multimodal interaction. In S. Oviatt, B. Schuller, P. Cohen,
D. Sonntag, G. Potamianos and A. Krueger, editors, The Handbook of Multimodal-
Multisensor Interfaces, Volume 1: Foundations, User Modeling, and Common Modality
References 373
Combinations, Chapter 9. Morgan & Claypool Publishers, San Rafael, CA. DOI: 10
.1145/3015783.3015794. 332
M. Raca and P. Dillenbourg. 2013. System for assessing classroom attention. In Proceedings
of the Third International Conference on Learning Analytics and Knowledge, pp. 265–269.
ACM, New York. DOI: 10.1145/2460296.2460351. 345, 354
M. Raca and P. Dillenbourg. 2014. Holistic analysis of the classroom. In Proceedings of the
ACM International Data-Driven Grand Challenge workshop on Multimodal Learning
Analytics, pp. 13–20. ACM, New York. DOI: 10.1145/2666633.2666636. 346
C. Rich, B. Ponsler, A. Holroyd, and C. Sidner. 2010. Recognizing engagement in human-
robot interaction. In Proceedings of the International Conference on Human-Robot
Interaction, pp, 375–382. ACM, New York. 354
J. Sanghvi, A. Pereira, G. Castellano, P. McOwan, I. Leite, and A. Paiva. 2011. Automatic
analysis of affective postures and body motion to detect engagement with a game
companion. In Proceedings of the International Conference on Human-Robot Interaction.
ACM, New York. DOI: 10.1145/1957656.1957781. 354
J. Schneider, D. Börner, P. Van Rosmalen, and M. Specht. 2015. Presentation trainer,
your public speaking multimodal coach. In Proceedings of the ACM on International
Conference on Multimodal Interaction, pp. 539–546. ACM, New York. DOI: 10.1145/
2818346.2830603. 363
L. M. Schreiber, G. D. Paul, and L. R. Shibley. 2012. The development and test of the
public speaking competence rubric. Communication Education. 61(3):205–233. DOI:
10.1080/03634523.2012.670709. 345
A. Serrano-Laguna and B. Fernández-Manjón. 2014. Applying learning analytics to simplify
serious games deployment in the classroom. Global Engineering Education Conference
(EDUCON), pp. 872–877. IEEE, New York. DOI: 10.1109/EDUCON.2014.6826199. 337
N. Singer. 2014. With tech taking over schools, worries rise. New York Times. September 15,
2014. 363
S. Tindall-Ford, P. Chandler, and J. Sweller. 1997. When two sensory modes are better than
one. Journal of Experimental Psychology: Applied, 3(3):257–287. DOI: 10.1037/1076-
898X.3.4.257. 358
Visage Technologies. 2016. http://visagetechnologies.com. Accessed October 22, 2016. 356
L. Vygotsky. 1962. Thought and Language. MIT Press, Cambridge, MA. (Translated by E.
Hanfmann and G. Vakar from 1934 original). 338, 359
B. Woolf, W. Burleson, I. Arroyo, T. Dragon, D. Cooper, and R. Picard. 2009. Affect-aware
tutors: recognising and responding to student affect. International Journal of Learning
Technology, 4(3-4):129–164. DOI: 10.1504/IJLT.2009.028804. 356
M. Worsley and P. Blikstein. 2010. Toward the development of learning analytics: Student
speech as an automatic and natural form of assessment. Annual Educational Research
Association Conference. 349, 350
B. Xiao, R. Lunsford, R. Coulston, M. Wesson, and S. L. Oviatt. 2003. Modeling mul-

timodal integration patterns and performance in seniors: Toward adaptive
processing of individual differences. In Proceedings of the Fifth International Con-
ference on Multimodal Interfaces (ICMI). ACM, New York. DOI: 10.1145/958432
.958480. 359
Z. Zeng, M. Pantic, G. Roisman, and T. Huang. 2009. A survey of affect recognition methods:
Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 31:39–58. DOI: 10.1109/TPAMI.2008.52. 355
Z. Zhang. 2012. Microsoft Kinect sensor and its effect. IEEE Multimedia, 19(2):4–10. 343
J. Zhou, K. Hang, S. Oviatt, K. Yu, and F. Chen. 2014. Combining empirical and machine
learning techniques to predict math expertise using pen signal features. In
Proceedings of the Second International Data-Driven Grand Challenge Workshop on
Multimodal Learning Analytics, pp. 29–36. ACM, New York. DOI: 10.1145/2666633
.2666638. 353, 360
Index
1-hot encoding of text, 115–116 Active learning in userstate and trait

3D tracking in facial analysis, 384 recognition, 142
Activity theory in multimodal learning
AAM (Active Appearance Model) analytics, 359
defined, 377 Adaboost
depression behavioral signals, 382–383 AU intensity estimation, 86
Abdominal respiration in deception cognitive load indicators, 316, 318
detection, 430 domain adaptation, 90
Accumulative GSR, defined, 293 facial expressions, 86
ACII (Affective Computing and Intelligent training learners, 52–53
Interaction) conference series, 188 Adaptability in multimodal interfaces,
ACM International Conference on 74–75
Multimodal Interaction (ICMI), Adaptation in userstate and trait
336 recognition, 139
Acoustics Adaptive fuzzy systems, 238
as contextual cue, 277 Affect
deception detection, 437–439 challenges and limitations, 24
speech recognition, 53 defined, 168
Action classification, challenges and human perception of expressions, 272–
limitations, 24 276
Action units (AUs). See also Facial Action multimodal expressions of. See
Coding System (FACS) Multimodal expressions of affects
defined, 377 overview, 169–170
depression detection, 386 Affect annotations
facial analysis, 168, 382–384 defined, 168
facial expressions, 83–87 description, 173–174
recognition pipeline, 248–249 Affect detection
segmentation, 137 affect overview, 169–170
video, 184 affective experience-expression link,
Active Appearance Model (AAM) 170–171
defined, 377 AVEC challenge, 188–189
depression behavioral signals, 382–383 discussion, 189–191
474 Index
Affect detection (continued) correlation analysis methods, 76

focus questions, 191–192 defined, 20
ground truth, 171–172 Dynamic Time Warping, 78
introduction, 167–169 early integration, 58
modality fusion, 173–180 encoder-decoder models, 112
multimodal coordination of affective multimodal machine learning, 20
responses, 172–173 social signals, 218
multimodal expressions. See Multimodal Alternative encoders and decoders, 107–109
expressions of affects Alzheimer’s disease, 301
multimodal learning analytics, 355 AMI Meeting Corpus
real-time sensing. See Real-time sensing description, 231
of social signals social interactions, 22
references, 192–202 Analytic applications for multimodal
trends and state of art, 185–189 learning analytics, 363
walk-throughs, 180–185 Anger
Affect-sensitive multimodal interfaces. See affect detection, 170
Multimodal and affect-sensitive expressions, 83, 266–267
interfaces Animals, multimodal communication by,
Affect signals, defined, 229 205–208
Affective computing (AC) Animated agents
animated agents, 264 affective computing, 264
deep learning, 461 Autism Spectrum Disorders, 270
defined, 168, 229 human perception of expressions, 270
description, 168–169 ANNs (artificial neural networks), 2–3
rise of, 5, 22 ANVIL tool
state of, 189–190 affect and social signals, 233
Affective Computing and Intelligent userstate and trait recognition, 149
Interaction (ACII) conference series, Appearance-based modeling in userstate
188 and trait recognition, 147
Affective experience-expression link Apraxia, 388
defined, 168 Architectures
overview, 170–171 multimodal signal processing, 3, 5
Affective ground truth userstate and trait recognition, 135–144
defined, 168 Arousal aspect in deception detection, 423
overview, 171–172 Artificial Intelligence (AI)
Affective pictures databases, 230 emotional, 5–6
Age groups in cognitive load measurement, HCI relationship, 1–2
298 increased usage, 2–3
AHMM (Asynchronous Hidden Markov machine learning. See Machine learning
model), 236 Artificial neural networks (ANNs), 2–3
AI. See Artificial Intelligence (AI) ASD (Autism Spectrum Disorders)
AISReact application, 314–318 contextual cues, 278
Alignment human-robot interactions for social
challenges, 24 learning, 270–271
Index 475
Asynchronous Hidden Markov model depression behavioral signals, 396

(AHMM), 236 depression detection, 386
Asynchronous sensor measurements, 76 document comparisons, 61
Attention facial analysis, 385
encoder-decoder models, 109–112 text representation, 116–117
learning prerequisites, 353–356 Bayesian Networks (BNs), 83–85
Audio Beck Depression Index (BDI), 379
deception detection, 443 Behavioral cues and measures
deep learning, 460 cognitive load indicators, 290, 301–304
userstate and trait recognition, 145–146 depression. See Depression behavioral
Audio/Visual Emotion Challenge (AVEC) signals
affect detection, 188–189 social signals, 215, 217, 219
deep learning, 462 Belfast Naturalistic Database, 231
social interactions, 22 Belfast Story Telling corpus
speech analysis, 388, 400 description, 231
speech and depression, 390–391, 397 enjoyment recognition, 245–246
userstate and trait recognition, 133–135, Bi-directional Long-Short-Term Memory
150 Neural Networks, 83
Audio-visual speech recognition (AVSR) Bi-directional LSTMs (BLSTMs), 178–179
challenges and limitations, 24 Bias vectors
motivation for, 21 dense layers, 101
Augmented Multiparty Interaction with image representation, 113–114
Distance Access (AMIDA), 232 Bimodality communication, 207–208
AUs. See Action units (AUs); Facial Action Bipolar depression, 376
Coding System (FACS) Black boxes in deep learning, 463–464
Autism Spectrum Disorders (ASD) Black Dog database, 400
contextual cues, 278 Blends of emotions, 267–270
human-robot interactions for social Blobs in thermal infrared imagery, 147
learning, 270–271 Blood volume pulse (BVP) sensors in
Autoencoders deception detection, 430
multimodal deep learning, 63 BLSTM-NN (Dynamic Bayesian Networks)
neural networks, 28 classifiers, 239
Automated transcription requirements, BLSTMs (bi-directional LSTMs), 178–179
240–241 Blush detection in deception detection,
AVEC. See Audio/Visual Emotion Challenge 432
(AVEC) BNs (Bayesian Networks), 83–85
Average rule in training learners, 54–55 Body language and expressions
AVSR (audio-visual speech recognition) as contextual cue, 277
challenges and limitations, 24 deception detection, 427
motivation for, 21 depression behavioral signals, 392–394
human perception of, 269–272
Bag-of-words (BoW) algorithm BodyANT sensors, 148
deception detection, 440 Boltzmann machines, 34
defined, 377 BoW. See Bag-of-words (BoW) algorithm
476 Index
Brain activity and regions multimodal learning analytics, 355

cognitive load, 304 userstate and trait recognition, 150
userstate and trait recognition, 148 Children verbal responses in deception
Bridge Correlational Neural Network, 37 detection, 426
Broadcast material Chittaranjan, G. and Hung, H., role-playing
databases, 230 games, 439
real-time sensing, 251 Chunking clusters in userstate and trait
BROMP (Baker-Rodrigo Observation recognition, 137
Method Protocol), 182 Classification algorithms in deception
BVP (blood volume pulse) sensors in detection, 441–442
deception detection, 430 Classifier combinations, defined, 204
Classifier diversity
Cameras in social signals analysis, 215–216 defined, 204
CAMI (Cognition-Adaptive Multimodal social signals, 212–214
Interface), 299 Classifiers, defined, 204
Canal9 corpus, 231 Classifying multimodal data
Canonical correlation analysis (CCA), 8 conclusions and future work, 64–66
coordinated representations, 32 focus questions, 66–67
description, 4 integration, 57–60
multimodal interfaces, 76–77 introduction, 49
Canonical Time Warping (CTW) multimodal deep learning, 62–64
correlation analysis methods, 78–79 multiple kernel learning, 60–62
multimodal interfaces, 76 overview, 49–57
Captions references, 67–69
encoder-decoder models, 108–112 Click-stream data in multimodal learning
media description, 22–23 analytics, 337–338
multilingual images, 37 Clinic-based multimodal assessment of
Capture process in userstate and trait depression, 400
recognition, 136 CLT (Cognitive Load Theory)
Cascading model in learner training, 56 multimodality and cognitive load, 310–
Case study for cognitive load indicators, 311
313–318 unidimensional scales, 295
Categorical representation in userstate and working memory based, 301
trait recognition, 133 CMS (Continuous Measurement System),
CBOW (continuous bag-of-words) model, 233
116–117 CNNs. See Convolutional Neural Networks
CCA (canonical correlation analysis), 8 (CNNs)
coordinated representations, 32 Co-clustering approach for social signals,
description, 4 219
multimodal interfaces, 76–77 Co-learning
Center for Epidemiological Studies deep learning, 466
Depression Scale, 379 defined, 20
CERT (Computer Expression Recognition discussion, 38
Toolbox) hybrid data, 37
deception detection, 429 multimodal machine learning, 20
Index 477
non-parallel data, 35–37 Compression auto encoders, 143

overview, 33 Computational methods in multimodal
parallel data, 33–35 analysis, 348–349
social signals, 219 Computer Expression Recognition Toolbox
Co-training, 33–34 (CERT)
COBE (Common Orthogonal Basis deception detection, 429
Extraction), 77 multimodal learning analytics, 355
Cognition-Adaptive Multimodal Interface userstate and trait recognition, 150
(CAMI), 299 Computer-mediated learning, 338
Cognitive load, defined, 293, 333 Computing Adaptive Testing, 379
Cognitive Load Component Survey, 298 Concatenation in joint representations, 26
Cognitive load indicators Concept Net database, 146
behavioral measures, 301–304 Conceptual grounding in non-parallel data,
case study, 313–318 35–36
conclusion, 318–320 Conditional Ordinal Random Field (CORF),
focus questions, 320–321 85
introduction, 289–292 Conditional Random Fields (CRFs) for facial
multimodal signals and data fusion, expressions, 84–85
309–318 Confidence measure and estimation, 10
references, 321–330 description, 4
state-of-the-art theories, 292–301 userstate and trait recognition, 140
subjective measures, 295–296 Congruent conditions in emotional
Cognitive load measurement expressions, 271
applications, 299–301 Congruent modalities, 264–265, 267, 269–
defined, 293 270
factors, 297–299 Construct
performance, 296–297 defined, 168
physiological, 304–309 description, 169
purpose, 290 Context
Cognitive Load Theory (CLT) affect expressions, 276–278
multimodality and cognitive load, 310– databases, 230
311 encoder-decoder model attention
unidimensional scales, 295 mechanisms, 111
working memory based, 301 multimodal interfaces, 74, 87–88
Cognitive processing in deception Context-sensitive CORF (cs-CORF) model,
detection, 423 87–88
Columbia-SRI-Colorado (CSC) corpus, 438 Continuous bag-of-words (CBOW) model,
Combinations, classification, 52, 204 116–117
Common Orthogonal Basis Extraction Continuous classification in multimodal
(COBE), 77 frameworks, 241
Communication, defined, 204 Continuous Measurement System (CMS),
Compatibility function for joint representa- 233
tion, 121 Continuous representation, defined, 133
Competence in Oral Presentation Corpus, Continuous skip-gram model, 116, 118
344 Control aspect in deception detection, 423
478 Index
Control Question Test (CQT), 431 Cross modal hashing

Convolutional layers challenges and limitations, 24
defined, 101 coordinated representations, 31
intermediate fusion, 104–105 Cross modal retrieval, challenges and
multimodal deep learning, 62 limitations, 24
Convolutional Neural Networks (CNNs) Cross validation, defined, 422
deep learning, 460 Crowd sourcing in userstate and trait
defined, 101 recognition, 139
early fusion, 101–102 cs-CORF (Context-sensitive CORF) model,
encoder-decoder models, 108–109, 111 87–88
image representation, 113 CSC (Columbia-SRI-Colorado) corpus, 438
intermediate fusion, 104 CTW (Canonical Time Warping)
joint representation, 120 correlation analysis methods, 78–79
multimodal representations, 25 multimodal interfaces, 76
Cooperative learning, 10 Cues in social signals, 215, 217
description, 4 Culture factors in userstate and trait
userstate and trait recognition, 142 recognition, 151–152
Coordinated representations CURRENNT tool, 150
multimodal, 25–27, 30–32
social signals, 219 DAIC (Distress Assessment Interview
Coordinated responses in affect detection, Corpus), 399
172 DAMSL (Dialog Act Markup in Several
Copula functions for facial expressions, 87 Layers), 232
CORF (Conditional Ordinal Random Field), Darwin, Charles, 203
85 Data, databases, and coding
Correlation analysis methods for deception detection, 434, 440
multimodal interfaces Math Data Corpus, 342
CCA, 77 multimodal learning analytics, 338,
DTW and CTW, 78–79 347–348
JIVE, 77–78 Oral Presentation Corpus, 345
overview, 75–77 real-time sensing of social signals,
RCICA, 79–82 228–233
RCITW, 82–83 userstate and trait recognition, 143
Correlations Data-driven approach
defined, 72 cognitive load measurement, 290
multimodal interfaces, 71 word embeddings, 25
Corruptions in RCICA, 79–80 Data fusion
Coupled Hidden Markov Models (CHMMs) affect detection, 174
affect detection, 176–177 cognitive load indicators, 309–318
multimodal fusion, 236 DataShop repository, 348
COVAREP toolkit, 388 DBMs (deep Boltzmann machines), 28–29
CQT (Control Question Test), 431 DBNs (Dynamic Bayesian Networks)
CRFs (Conditional Random Fields) for facial affect detection, 176–177
expressions, 84–85 facial expressions, 84–85
Index 479
DCCA (deep canonical correlation analysis) deep architectures, 63

affect detection, 76 encoder-decoder models, 105–112
KCCA alternative, 32 focus questions, 123
DCT (Discrete Cosine Transform) for facial fusion models, 100–105
expressions, 86 future, 465–466
Deception detection image representation, 113–114
datasets and devices, 434, 440 introduction, 99–100
focus questions, 445–446 multimodal embedding models, 112–122
future, 444–445 multimodal joint representation, 119–
individual modalities, 422–423 122
introduction, 419–421 multimodal signal processing, 3, 5
language and acoustics, 437–439 overview, 457–458
language overview, 424–427 perspectives, 122–123
multimodal feature extraction, 434–435, references, 123–128, 470–472
440–441 responsibility, 467–468
multiple modalities, 433 text representation, 114–119
physiology, 430–433 Deep neural networks, 28
psychology, 423–424 classification, 52
results, 435–437, 441–444 machine learning, 141
thermal imaging, physiological sensors, modality fusion in MMAD systems,
and language analysis, 433–434 178–180
vision and language, 439–444 multimodal deep learning, 62–63, 65
vision overview, 427–430 Deep Reinforcement Learning, 122–123
Decision-level fusion Deep structured joint embedding, 121
affect detection, 175, 182–183 deep visual-semantic embedding (DeViSE),
depression behavioral signals, 396 30
early studies, 233 Denoising autoencoders, 28
userstate and trait recognition, 140 Dense layers
Decision process in userstate and trait defined, 101
recognition, 139–140 early fusion, 101
Decision Tree (DT) algorithm in deception Department of Defense Polygraph Institute,
detection, 441–442 431
Deep architectures in deep learning, 63 Dependencies
Deep asymmetric structured joint classification, 52
embedding, 121 context, 87–88
Deep Boltzmann machines (DBMs), 28–29 multimodal deep learning, 62
Deep canonical correlation analysis (DCCA) multimodal interfaces, 71
affect detection, 76 Depression behavioral signals
KCCA alternative, 32 analysis, 394–395
Deep learning assessment, 379–380
affect detection, 177–179 body movement, 392–394
as catalyst for scientific discovery, 458– conclusion and current challenges,
465 401–403
conclusion, 468–470 depression overview, 376–379
480 Index
Depression behavioral signals (continued) Dorsolateral prefrontal cortex (DLPFC) and

facial analysis, 382–385 cognitive load, 304
focus questions, 404–405 Driver distraction in cognitive load, 300–301
implementation-related considerations DT (Decision Tree) algorithm in deception
and elicitation approaches, 398–401 detection, 441–442
introduction, 375–376 DTW. See Dynamic Time Warping (DTW)
multimodal fusion, 395–398 Dual-task paradigm in cognitive load
signal processing systems, 380–382 measurement, 297
speech analysis, 385–391 Dynamic Bayesian Networks (BLSTM-NN)
Description feature for databases, 231 classifiers, 239
Detection style classification papers for Dynamic Bayesian Networks (DBNs)
speech and depression, 390–391 affect detection, 176–177
DeViSE (deep visual-semantic embedding), facial expressions, 84–85
30 Dynamic classifiers in multimodal fusion,
Dialog Act Markup in Several Layers 235–236
(DAMSL), 232 Dynamic Time Warping (DTW), 8
Dimensional spaces in mental states, correlation analysis methods, 78–79
266 description, 4
Dindar, M., and cognitive load measure- multimodal fusion, 236
ment, 298–299 RCICA, 82
Discrete Cosine Transform (DCT) for facial Dynamics of expressions, 271
expressions, 86 Dysarthria in speech analysis, 388
Discrete representation, defined, 133
Disgust expressions, 266 EARL (Emotion Annotation and Represen-
Distant-supervised learning, 463 tation Language), 140, 232
Distress Assessment Interview Corpus Early combinations, defined, 52
(DAIC), 399 Early fusion
Diverse social signals, 212–213 affect detection, 174–175
DLPFC (dorsolateral prefrontal cortex) and defined, 101
cognitive load, 304 joint representations, 26, 219
Domain adaptation overview, 101–103
defined, 72 social signals, 208–209, 212, 218
facial behavior analysis, 89–90 studies, 234
multimodal interfaces, 74, 88–90 Early integration in classifying multimodal
Domain expertise data, 57–58
analytics, 332 ECG in cognitive load measurement, 298
defined, 333 EDA (electrodermal activity), 174
multimodal learning analytics, 353 Educational activity
problem solving, 350–353 Math Data Corpus, 342
Domain-trained approaches in userstate Oral Presentation Corpus, 345
and trait recognition, 146 Educational management systems in
Dominance multimodal learning analytics, 338
feeling ranges, 266 Educational Testing Service
non-redundant signals, 207 multimodal learning analytics, 347–348
Index 481
Oral Presentation Corpus, 343–344 Emotive music databases, 230

ELAN (EUDICO Linguistic Annotator), 149, EmotiW challenge, 135
233 EmoVoice tool, 150
Electrodermal activity (EDA), 174 Encoder-decoder models, 8
Electromyography (EMG) system alternative, 107–109
facial expressions, 277 attention mechanisms, 109–112
userstate and trait recognition, 148 description, 4
Elicitation methods for depression sequence-to-sequence models, 105–107
behavioral signals, 399 Encoding in userstate and trait recognition,
Embedding models 140
multimodal. See Multimodal embedding Energy variability in speech analysis, 387–
models 388
sequence-to-sequence encoder-decoder, Engagement as learning prerequisite,
105–106 353–356
text representation, 114–119 Enhancement
EMBODI-EMO database, 270 redundant signals, 207
Embodied Cognition theory, 358–359 userstate and trait recognition, 139
Emergence in non-redundant signals, 207 Enjoyment recognition, multimodal, 244–
Emergency management in cognitive load, 250
299–300 Ensembles
EMG (electromyography) system classification, 51–52
facial expressions, 277 facial expressions, 84
userstate and trait recognition, 148 multimodal data, 49–50
EMMA (Extensible MultiModal Annotation) Equivalence in redundant signals, 207
markup language, 140 EUDICO Linguistic Annotator (ELAN), 149,
Emotion Annotation and Representation 233
Language (EARL), 140, 232 Event detection, challenges and limitations,
Emotion Markup Language (EML), 232–233 24
Emotional state in learning prerequisites, Event-driven fusion
353–356 enjoyment recognition, 245–248
Emotional valence ratings of facial multimodal, 237
expressions, 275 EXMARaLDA (Extensible Markup Language
EmotionML standard for Discourse Annotation), 233
databases, 232 Expression of Emotion in Animals and Man
userstate and trait recognition, 140 (Darwin), 203
Emotions Expressions
affect detection. See Affect detection affects. See Multimodal expressions of
body movement and depression, 392 affects
challenges and limitations, 24 deception detection, 427–429
deception detection, 423, 428 emotions, 266–269
expressions, 266–269 Extensible Markup Language for Discourse
facial expressions, 83, 85 Annotation (EXMARaLDA), 233
haptic expressions of affects, 276 Extensible MultiModal Annotation (EMMA)
social interactions, 22 markup language, 140
482 Index
Extraction of behavioral cues for social depression behavioral signals, 396

signals, 215 early studies, 233
Extraneous loads userstate and trait recognition, 139
cognitive load indicators, 292–293 Feed-forward neural network language
defined, 293 model (FFNN-LM), 115–116
Eye activity Feelings in affective ground truth, 171
cognitive load, 305, 307–309 FEELtrace tool, 149
depression behavioral signals, 394 Felt emotion aspect in deception detection,
EyesWeb XMI Expressive Gesture 423
Processing Library, 150 FFNN-LM (feed-forward neural network
language model), 115–116
FACET program, 182–183 Flat speech, 387–388
Facial Action Coding System (FACS) Fluency Disorders, 389
databases, 232 Formant features in speech analysis, 387
deception detection, 428–429 Frame-level features for userstate and trait
defined, 72, 377 recognition, 136
depression behavioral signals, 382–383 Full video for deception detection, 443
facial expressions, 83 Functional magnetic resonance imaging
multimodal learning analytics, 355 (fMRI) technology, 431
Facial analysis Fusion and fusion models
affect detection, 181 affect detection, 173–180, 182–185
depression behavioral signals, 382–385 cognitive load indicators, 309–318
domain adaptation, 89–90 deep learning, 100–105
multimodal interfaces, 73–74 defined, 20
Facial electromyography, 277 depression behavioral signals, 395–398
Facial expressions and behavior early fusion, 101–103
deception detection, 427–429 framework requirements, 240–242
Emotional Valence ratings, 275 intermediate fusion, 104–105
human perception of, 269–272 joint representations, 26, 219
intensity estimation, 86–87 late fusion, 103–104
social signals, 215, 217 multimodal machine learning, 20
temporal modeling, 83–87 overview, 100–101
temporal segmentation, 85–86 real-time sensing of social signals,
Facial thermal patterns in deception 233–237
detection, 431–432 social signals, 208–209, 212, 218–219
FACS. See Facial Action Coding System studies, 234
(FACS) userstate and trait recognition, 139–140
Fake resume paradigm, 438
Feature-based approaches in userstate and GAD (Generalized Anxiety Disorder), 379
trait recognition, 147 Galvanic Skin Response (GSR)
Feature-level fusion cognitive load, 305–307
affect detection, 174–175, 180–182 cognitive load measurement, 292, 314–
deception detection, 435–436 318
defined, 422 defined, 293
Index 483
userstate and trait recognition, 148 online recognition, 239–240

Games GKT (Guilty Knowledge Test), 431
databases, 230 GMMs (Gaussian Mixture Models)
real-time sensing, 251 defined, 377
GANs (Generative Adversarial Networks), speech and depression, 390
462, 466 Graphical models in joint representations,
Garbage classes, 239 27–28
Gating units in deep learning, 62 Graphical User Interfaces (GUIs), 2
Gaussian Mixture Models (GMMs) Gross errors
defined, 377 correlation analysis methods, 76
speech and depression, 390 defined, 72
Gaussian Staircase Regression (GSR) Ground truth, affective, 171–172
approach Grounding non-parallel data, 35–36
depression behavioral signals, 398 GSR (Galvanic Skin Response). See Galvanic
speech and depression, 391, 398 Skin Response (GSR)
GAVAM (Generalized Adaptive View-based GSR (Gaussian Staircase Regression)
Appearance Model), 356 approach
Gaze depression behavioral signals, 398
as contextual cue, 277 speech and depression, 391, 398
depression behavioral signals, 393–394 GTrace (General Trace) program
GBM (Generalized Boosted Regression data annotation, 149
Models), 351–352 databases, 233
Gender groups in cognitive load measure- Guilty Knowledge Test (GKT), 431
ment, 298 GUIs (Graphical User Interfaces), 2
General Inquirer database, 146
General Trace (GTrace) program H&H Theory, 357
data annotation, 149 Hamilton Rating Scale for Depression
databases, 233 (HRSD), 379
Generalized Adaptive View-based Appear- Hamming space, 31
ance Model (GAVAM), 356 Hand-crafted features
Generalized Anxiety Disorder (GAD), 379 defined, 377
Generalized Boosted Regression Models facial analysis, 384
(GBM), 351–352 Hand gestures
Generative Adversarial Networks (GANs), as contextual cue, 277
462, 466 deception detection, 427
GentleBoost for facial expressions, 84–85 Handwriting
Germane loads AI usage, 2
cognitive load indicators, 292, 294 cognitive load indicators, 302–303
defined, 293 multimodal learning analytics, 360–361
Gestures Happiness
as contextual cue, 277 Autism Spectrum Disorders, 271
deception detection, 427 expressions, 266
depression behavioral signals, 393 Haptic expressions of affects, 273–276
emotion categories, 269 Hashing, 31
484 Index
HCI (Human-Computer Interaction) Human performance in limited-resource

AI increased usage, 2–3 theories, 358
AI relationship, 1–2 Human-robot interactions
cognitive load indicators, 294 haptic expressions of affects, 274–276
HCRF (Hidden Conditional Random Field) social learning, 270–271
for facial expressions, 84–85 Humanoid robot trends, 204
Head behavior in social signals, 215, Hundred year emotion war, 169
217 Hybrid data in co-learning, 37
Heart rate Hybrid fusion for affect detection, 175
cognitive load, 305 Hybrid SVM-HMM model, 85
userstate and trait recognition, 149
Hidden Conditional Random Field (HCRF) IAPS (International Affective Picture
for facial expressions, 84–85 System), 308
Hidden layers Idle conditions in emotional expressions,
description, 5 271
neural networks, 8 IEMOCAP (Interactive Emotional Dyadic
Hidden Markov Models (HMMs) Motion Capture) database,
affect detection, 176–177 231
audio-visual speech recognition, 21 iHEARu-EAT corpus, 150
facial expressions, 84–85 iHEARu-PLAY platform, 149
multimodal fusion, 236 IIT (Information Integration Theory),
userstate and trait recognition, 141 273–274
Hidden states ILHAIRE project, 244–245
encoder-decoder models, 105, 108 Image-sentence pairs in joint representa-
Recurrent Neural Networks, 101 tion, 119–120
Hierarchical functionals in userstate and Images
trait recognition, 137 challenges and limitations, 24
Histogram of Oriented Gradients (HOG), deep learning, 64
383 joint representation, 119–122
HMMs. See Hidden Markov Models (HMMs) media description, 22–23
Homeostatic Property Clusters, 219 multimodal embedding models, 113–
HRSD (Hamilton Rating Scale for 114
Depression), 379 order-embeddings, 31
Human-Computer Interaction (HCI) userstate and trait recognition, 146–147
AI increased usage, 2–3 Imitation learning in userstate and trait
AI relationship, 1–2 recognition, 145
cognitive load indicators, 294 Incongruent modalities in multimodal
Human motion in cognitive load indicators, expressions of affects, 264–267, 270,
301–302 272–273
Human perception of combinations of Incremental processing in online
expressions recognition, 239–240
facial and bodily expressions, 269–272 Independence
haptic expressions of affects, 273–276 non-redundant signals, 207
speech and other modalities, 272–273 social signals, 209
Index 485
Indexing and retrieval in multimodal Kalman filtering, 238

applications, 21–22 Kernel canonical correlation analysis
Individual components in RCICA, 79 (KCCA)
Inductive bias for learning models, 50 description, 32
Inflectional languages in text representa- multimodal interfaces, 76
tion, 114 Kernel learning
Information Integration Theory (IIT), defined, 52
273–274 multiple, 60–62
Information retrieval system design in Kernel mean matching (KMM), 89–90
cognitive load measurement, 301 Kinect device
Insight in problem solving, 350–353 description, 232
Integration in classifying multimodal data, enjoyment recognition, 245–246, 248
57–60 Oral Presentation Corpus, 343–345
Intensity userstate and trait recognition, 147
facial expressions, 86–87 KMM (kernel mean matching), 89–90
human perception of expressions, 272– Knowledge in userstate and trait
273 recognition, 144, 146
Interactive Emotional Dyadic Motion Kohavi-Wolpert variance, 214
Capture (IEMOCAP) database, 231
Interfaces LAK (Learning Analytics and Knowledge
multimodal and affect-sensitive. See Conference), 336
Multimodal and affect-sensitive Landmark detection systems, 432
interfaces Language
userstate and trait recognition, 140–141 cognitive load, 303–304
Intermediate classification combinations, data-driven word embeddings, 25
52 deception detection, 420
Intermediate fusion, 104–105 deception detection, and acoustics,
Intermediate integration in classifying 437–439
multimodal data, 60 deception detection, approach, 433–437
Intermodal context in databases, 230 deception detection, overview, 424–427
International Affective Picture System deep learning, 464
(IAPS), 308 grounding, 36
Intrinsic loads order-embeddings, 31
cognitive load indicators, 292, 294 userstate and trait recognition, 145–146,
defined, 293 151–152
video description, 108, 111
Joint and Individual Variation Explained Late classification combinations, 52
(JIVE), 77–78 Late fusion, 103–104
Joint representations affect detection, 175
multimodal, 25–30, 119–122 social signals, 208–209
social signals, 219 studies, 234
Joysticks, 1 userstate and trait recognition, 140
Late integration in classifying multimodal
k-NN classifiers, 89 data, 58–60
486 Index
Latent Trees for facial expressions, 87 userstate and trait recognition, 152
Laughter LIWC (Linguistic Inquiry and Word Count),
depression behavioral signals, 382 424–427, 435
incongruent modalities, 273 Local Binary Patterns (LBPs)
recognizing, 245–246 facial expressions, 86
Layers multimodal frameworks, 397
hidden, 5, 8 Locally Linear Reconstruction, 239
intermediate fusion, 104–105 Log-based models for affect detection, 183
late fusion, 103–104 Logistic Regression for facial expressions,
LBP-TOP method, 384–386 86
LBPs (Local Binary Patterns) Long-short term memory (LSTM) models
facial expressions, 86 affect detection, 177–180
multimodal frameworks, 397 coordinated representations, 30–31
Learners non-parallel data, 35
description, 51 sequential representations, 29
mental state assessment. See Multimodal text representation, 118–119
learning analytics Long Short-Term Memory Neural Networks
multimodal data, 49–51 (LSTM-NNs), 236
training, 52–54 Long-term traits, defined, 133
Learning Analytics and Knowledge Longer-term traits, defined, 133
Conference (LAK), 336 Longitudinal data
Learning and learning analytics, 331 body movement and depression, 392
defined, 333 defined, 377
multimodal. See Multimodal learning Loosely coupled components in affect
analytics detection, 172
userstate and trait recognition, 141–143, Loss function in classification, 51
145 Low-rank matrices in RCICA, 80
Learning-centered affective states, 182–183 LSTM models. See Long-short term memory
Learning-oriented behaviors, 339 (LSTM) models
Leave-one-out cross validation LSTM-NNs (Long Short-Term Memory
deception detection, 435 Neural Networks), 236
defined, 422 Lying. See Deception detection
Life and human sciences, multimodal
communication in, 205–208 M.I.N.I. interviews, 379
Light pens, 1 Machine learning
Limited-resource theories in multimodal analytics, 349
learning analytics, 357–358 deep learning, 459–460
Lindblom, B., H&H Theory, 357 learners, 51
Linguistic Inquiry and Word Count (LIWC), multimodal. See Multimodal machine
424–427, 435 learning
Linguistic modality, description, 19 Machine translation in encoder-decoder
Linking models, 111
affective experience-expression link, 168, Macro-bimodality in communication, 208
170–171 MAHNOB Mimicry Database, 231–232
Index 487
Major Depressive Disorder (MDD) speech and depression, 390

diagnostic criteria, 379 superseded, 25
prevalence, 376 Mental disorders in depression. See
Majority voting Depression behavioral signals
social signals analysis, 212 Mental state assessment
training learners, 54 introduction, 331–332
Mapping modalities in encoder-decoder multimodal learning analytics. See
models, 111 Multimodal learning analytics
MAPTRAITS challenge, 133, 150 Mental status
Math Data Corpus expressions, 266
limited-resource theories, 358 galvanic skin response, 306
multimodal learning analytics, 350–351, Metacognitive awareness
353 defined, 333
overview, 339–343 emergence, 332
Matrices MFCCs. See Mel frequency cepstral
dense layers, 101 coefficients (MFCCs)
RCICA, 80 MFHMM (Multi-stream Fused Hidden
Maximum Rule in social signals analysis, Markov Model), 236, 238
211 Micro-bimodality in communication, 208
McGurk effect, 21 Micro-expressions in deception detection,
MCLM (Multimodal Cognitive Load 428
Measurement) system, 311–313 Microphones in social signals analysis,
MCS (Multiple Classifiers System) for social 215–216
signals, 209, 213–214 Microsoft Kinect device
MDD (Major Depressive Disorder) description, 232
diagnostic criteria, 379 enjoyment recognition, 245–246, 248
prevalence, 376 Oral Presentation Corpus, 343–345
MDIST models, 313 userstate and trait recognition, 147
Mechanical Turk MindReading database, 266
crowd sourcing, 139 Minimum rule in training learners, 54
deception detection, 427 Missing data
MED (multimedia event detection) tasks, 22 multimodal frameworks, 241
Media description online recognition, 238
challenges and limitations, 24 Mixture of experts model in training
multimodal applications, 22–23 learners, 56
Media summarization, challenges and Mixture of Parts model, 392
limitations, 24 MMAD system. See Multisensor-multimodal
MediaEval challenge, 135 Affect Detection (MMAD)
Median Rule Mobile device sensors for depression
social signals analysis, 211–212 behavioral, 395
training learners, 54 Modalities
Mel frequency cepstral coefficients (MFCCs) description, 19
defined, 378 Math Data Corpus, 342
multimodality and cognitive load, 313 Oral Presentation Corpus, 345
488 Index
Modality fusion in affect detection, 173–180 temporal modeling of facial expressions,

Model adaptation for multimodal 83–87
interfaces, 88–90 Multimodal assessment of depression. See
Model-based fusion for affect detection, Depression behavioral signals
176–177, 183–185 Multimodal Cognitive Load Measurement
Modeling tasks in multimodal learning (MCLM) system, 311–313
analytics, 355–356 Multimodal communication, 203
Modulation of non-redundant signals, 207 Multimodal coordination of affective
Moments of insight responses, 172–173
defined, 333 Multimodal data
problem solving, 332 classifying. See Classifying multimodal
Monotonous speech, 387–388 data
Motion databases, 231
depression behavioral signals, 392–394 defined, 229
userstate and trait recognition, 147 Multimodal deep learning, 62–64
Motor control in speech analysis, 388–389 Multimodal embedding models
Mouse, 1 image representation, 113–114
MRD (Multimodal Recording Device), multimodal joint representation, 119–
346–347 122
MRI scanners in deception detection, overview, 112–113
430 text representation, 114–119
Multi-level multimodal learning analytics Multimodal expressions of affects
defined, 334 conclusion, 278–279
description, 332–334 context impact, 276–278
Multi-stream Fused Hidden Markov Model emotions and expressions, 266–269
(MFHMM), 236, 238 focus questions, 279–280
Multi-subject processing in userstate and introduction, 263–266
trait recognition, 152 perception of combinations, 269–276
Multidimensional scales for cognitive load references, 280–285
indicators, 295 Multimodal fusion
Multimedia event detection (MED) tasks, 22 affect detection, 176
Multimedia retrieval, challenges and defined, 168, 229, 293
limitations, 24 depression behavioral signals, 395–398
Multimedia videos, indexing and retrieval, framework requirements, 240–242
22 real-time sensing of social signals,
Multimodal and affect-sensitive interfaces 233–237
adaptability, 74–75 Multimodal indicators of cognitive load. See
conclusion, 90–91 Cognitive load indicators
context, 74 Multimodal joint representation, 119–122
context dependency, 87–88 Multimodal learning analytics
correlation analysis methods, 75–83 advantages, 337–339
focus questions, 91 analysis techniques, 348–349
introduction, 71–73 challenges and limitations, 361–363
model adaptation, 88–90 conclusions and future directions, 363–
temporal information, 73–74 365
Index 489
dataset limitations, 347–348 modality fusion, 173–180

defined, 334 trends, 185–187
focus questions, 365–366 Multisensory requirements in multimodal
history, 335–336 frameworks, 240
infrastructure, 345–347 MUMIN coding scheme
introduction, 331–332 databases, 232
Math Data Corpus, 339–343 deception detection, 440
objectives, 336–337 Music databases, 230
Oral Presentation Corpus, 343–345
overview, 332–335 N-grams in deception detection, 427
prerequisites of learning, 353–356 Naive Bayes in deception detection,
research findings, 349–356 425
theoretical basis, 356–361 Nao robots, 270
Multimodal machine learning NAQ (Normalized Amplitude Quotient), 387
co-learning, 33–38 NASA Task Load Index (NASA-TLX), 295–
conclusion, 38 296
focus questions, 38–39 National Institute of Mental Health (NIMH),
introduction, 19–21 380
multimodal applications, 21–23 Natural language
multimodal representations, 23–32 data-driven word embeddings, 25
references, 39–48 video description, 108, 111
Multimodal Machine Learning Workshop, Naturalness of databases, 230
462–463 Neural Networks (NNs)
Multimodal Recording Device (MRD), affect detection, 177–179, 183–185
346–347 coordinated representations, 30
Multimodal representations facial expressions, 83–84
coordinated, 30–32 fusion models, 100
discussion, 32 joint representations, 26–28
joint, 25–30 multimodal deep learning, 62–63
overview, 23–25 recurrent. See Recurrent Neural Networks
Multimodal signals (RNNs)
cognitive load indicators, 309–318 text representations, 114–116
processing architectures, 3, 5 userstate and trait recognition, 141
Multimodal term, ACM references to, Neural representations in deep learning,
204–206 459–461, 464, 466
Multimodal zero shot learning, 37 Neuro-machine-translation (NMT), 463
Multiple Classifiers System (MCS) for social Neuroticism, Extraversion, Openness, and
signals, 209, 213–214 Five Factor Inventory, 439
Multiple kernel learning Neutral expressions in affects, 265
classification, 52 NIMH (National Institute of Mental Health),
overview, 60–62 380
Multisensor-multimodal Affect Detection NMT (neuro-machine-translation), 463
(MMAD) NNs. See Neural Networks (NNs)
accuracy, 187–188 Noise suppression in deception detection,
affect detection, 172, 176, 180–185 432
490 Index
Non-parallel data Pattern recognition in classification tasks,

co-learning, 35–37 51
grounding, 35–36 PAULA XML (Potsdam Exchange Format for
Non-prototypical behavior in online Linguistic Annotations), 232
recognition, 238–239 Pavlidis, I., deception detection, 431
Non-redundant signals in multimodal PC/laptop-based multimodal assessment of
communication, 206–207 depression, 400
Nonverbal behavioral social signals, 219 PCFG (Probabilistic Context Free Grammar)
Normalized Amplitude Quotient (NAQ), 387 parse trees, 425
Nursing education in cognitive load theory, Perception-action dynamic theories, 358–
301 359
Performance
Offline learning in userstate and trait analytics, 332
recognition, 141–142 cognitive load indicators, 290, 296–297
Online interaction in deception detection, defined, 334
426 problem solving, 350–353
Online learning in userstate and trait Person-specific classifiers for multimodal
recognition, 141–142 interfaces, 88
Online recognition systems PHANTOM Desktop arm, 276
defined, 229 PHOG (Pyramid of Histogram of Gradient),
real-time sensing, 228, 237–240 397
Open-Face tools, 356 PHQ-9 (Patient Health Questionnaire), 379
OpenBlissART tool, 150 Physics Playground game, 182
openSMILE library and toolkit Physiological measures
speech analysis, 388 affect detection, 182
speech and depression, 397 cognitive load, 290, 304–309
userstate and trait recognition, 150 deception detection, 420–421, 430–434
Optimization in userstate and trait defined, 422
recognition, 142–143 userstate and trait recognition, 147–148
Oral Presentation Corpus, 343–345, 351 Pitch variability in speech analysis, 387
Order-embeddings of images and language, Pittsburgh database, 400
31 PixelCNNs network, 462
Overfitting Planar Travelling Salesman Problem, 111
defined, 378 Pleasure continuum, 266
speech analysis, 388 Polygraph tests, 419–420
POS (part of speech tags) in deception
Pairwise diversity in social signals, 213 detection, 425
Parallel data in co-learning, 33–35 Post-Traumatic Stress Disorder (PTSD)
Parkinson’s disease in cognitive load body movement and depression, 393–394
measurement, 301 depression behavioral signals, 379, 399
Part of speech tags (POS) in deception Postures as contextual cue, 277
detection, 425 Potsdam Exchange Format for Linguistic
Partially observed HCRF, 84 Annotations (PAULA XML), 232
Patient Health Questionnaire (PHQ-9), 379 Praat tool, 362
Index 491
Pre-processing in userstate and trait Pupillary response

recognition, 136 cognitive load, 307–309
Prediction power and results depression behavioral signals, 394
cognitive load indicators, 308 Pursuit Test (PT), 298
depression, 390–391, 397–398, 402 Push-to-talk online recognition, 239
late fusion, 103 Pyramid of Histogram of Gradient (PHOG),
machine learning, 51–52 397
multimodal data classification, 50, 54–56,
58–60 Quasi-Open-Quotient (QOQ), 387
multimodal deep learning, 460–464
multimodal learning analytics, 349–355 Random forest model in multimodal
social signals, 239 learning, 53
Prerequisites for learning Random Forest (RF) algorithm in deception
defined, 334 detection, 441–442
mental states, 332 Rapid Facial Reactions (RFRs), 277
multimodal learning analytics, 353–356 Rate of speech in speech analysis, 387
Pressure sensors for userstate and trait RBM (restricted Boltzmann machines), 28
recognition, 148 RCICA (Robust Correlated and Independent
Privacy issues in multimodal learning Component Analysis)
analytics, 362–363 correlation analysis methods, 77
Probabilistic Context Free Grammar (PCFG) multimodal interfaces, 79–82
parse trees, 425 RCICA with Time Warpings (RCITW), 82–83
Probabilistic graphical models for joint RCNNs (Regional Convolutional Neural
representations, 28–29 Networks), 113
Problem solving in multimodal learning RDoC (Research Domain Criteria) project,
analytics, 350–353 380
Product reviews, deception detection for, Readability index in deception detection,
425 426
Product rule in training learners, 54 Real-time requirements in multimodal
Projective methods in RCICA, 81 frameworks, 242
Proprioception, 270 Real-time sensing of social signals
Prosody database collection, 228–233
social signals, 215, 217 focus questions, 253
speech analysis, 389 introduction, 227–228
Pseudo-modality in userstate and trait multimodal frameworks, 240–242
recognition, 149 multimodal fusion, 233–237
Pseudo-multimodal approach in userstate online recognition, 237–240
and trait recognition, 133 Social Signal Interpretation framework,
Psychology in deception detection, 423–424 242–250
Psychomotor retardation, 378 Recognition pipeline walk-through, 248–
PT (Pursuit Test), 298 250
PTSD (Post-Traumatic Stress Disorder) RECOLA database
body movement and depression, 393–394 affect detection, 188
depression behavioral signals, 379, 399 userstate and trait recognition, 150
492 Index
Recurrent connections in multimodal deep Restricted Boltzmann machines (RBM), 28

learning, 62 RF (Random Forest) algorithm in deception
Recurrent dense layers, 101 detection, 441–442
Recurrent neural network language model RFRs (Rapid Facial Reactions), 277
(RNN-LM), 117–119 RMSE (root mean square error) methods in
Recurrent Neural Networks (RNNs) speech and depression, 391
dense layers, 101 RNN-LM (recurrent neural network
early fusion, 101–103 language model), 117–119
encoder-decoder models, 105–108, 111 RNNs. See Recurrent Neural Networks
image representation, 113 (RNNs)
joint representation, 120 Robust Correlated and Independent
multimodal fusion, 236 Component Analysis (RCICA)
multimodal representations, 25 correlation analysis methods, 77
sequential representations, 29–30 multimodal interfaces, 79–82
text representation, 114, 117–119 Robust Regression, 239
Reduction of features in userstate and trait Role-playing games (RPGs) in deception
recognition, 138 detection, 439
Redundancy Root mean square error (RMSE) methods in
defined, 204 speech and depression, 391
multimodal communication signals, Rule-based classifiers for facial expressions,
206–207 83, 85
multimodal expressions of affects, 264
Regional Convolutional Neural Networks Sadness
(RCNNs), 113 Autism Spectrum Disorders, 271
Regressions basic emotion, 266
depression behavioral signals, 398 SAL (Sensitive Artificial Listener) corpus,
facial expressions, 86 231
multimodal learning analytics, 349, Scale invariant feature transform (SIFT),
351–352 23–24
speech and depression, 391 Scattered X’s test (SX), 298
Reinforcement Learning, 122–123 SCHMMs (semi-coupled Hidden Markov
Reliabilities in multimodal learning models), 177
analytics, 351–352 SCID-5 interviews for depression, 379
Representation Scope of databases, 230
defined, 20 Search function for userstate and trait
multimodal machine learning, 20 recognition, 138
social signals, 218 Second International Workshop and
Research Domain Criteria (RDoC) project, Data-Driven Grand Challenge on
380 Multimodal Learning Analytics, 336
Research findings in multimodal learning Segmentation
analytics, 349–356 facial expressions, 85–86
Respiration rate userstate and trait recognition, 137
deception detection, 430 Selection and generation in userstate and
userstate and trait recognition, 148 trait recognition, 138
Index 493
Self-adaptation in userstate and trait Similarity models for coordinated

recognition, 145 representations, 30
SEMAINE corpus Skilled performance
affect detection, 188 analytics, 332
real-time sensing of social signals, 231 defined, 334
social interactions, 22 problem solving, 350–353
userstate and trait recognition, 150 Skin conductance
Semi-coupled Hidden Markov models cognitive load, 305
(SCHMMs), 177 deception detection, 430
Semi-supervized learning in userstate and Skin temperature in deception detection,
trait recognition, 141–142 430
SensAble PHANTOM Desktop arm, 276 SmartKom corpus, 231
Sensitive Artificial Listener (SAL) corpus, Smartphones
231 AI in, 2
Sensors depression assessment, 400
depression behavioral signals, 394–395 Smiling as depression behavioral signal,
multimodality and cognitive load, 310 382–383
userstate and trait recognition, 147 Social anxiety, 393
Sensory modalities, 19 Social connectivity for depression
Sentence-image pairs in joint representa- behavioral signals, 395
tion, 119–120 Social intelligence, advent of, 5–6
Sequence-to-sequence encoder-decoder Social interactions in multimodal
models, 105–107 applications, 22
Sequential joint representations, 27, 29–30 Social networks in deception detection,
Severity style classification papers for 425–426
speech and depression, 391 Social Signal Interpretation (SSI) framework
Shape contours in userstate and trait basic concepts, 242–244
recognition, 147 multimodal enjoyment recognition,
Shared hidden layers 244–250
description, 5 overview, 242
neural networks, 8 recognition pipeline, 248–250
SHORE tools, 356 Social Signal Processing (SSP), defined, 204,
Short-term traits, defined, 133 229
Shot-boundary detection, 22 Social signals
SIFT (scale invariant feature transform), classifier diversity, 212–214
23–24 defined, 204, 229
Signal-level fusion in userstate and trait focus questions, 222
recognition, 136 introduction, 203–205
Signal-level predictive results in multimodal multimodal analysis, 208–218
learning analytics, 360 multimodal communication in life and
Signal processing systems for depression human sciences, 205–208
behavioral, 380–382 next steps, 218–220
Silent video for deception detection, 443 real-time sensing. See Real-time sensing
Similarity joint representations, 27 of social signals
494 Index
Social signals (continued) Spontaneous speech in depression

state-of-the-art, 214–218 behavioral signals, 399
Soft attention mechanisms in encoder- SSI framework. See Social Signal Interpreta-
decoder models, 110 tion (SSI) framework
Softmax layers SSI tool, 149
defined, 101 SSP (Social Signal Processing), defined, 204,
feed-forward neural networks, 115–116 229
intermediate fusion, 104–105 Stacking model in training learners, 55
late fusion, 103–104 Stakeholder approval and control for
recurrent neural networks, 117 multimodal learning analytics, 362
Spanish essays, deception detection in, State-of-the-art theories for cognitive load
426 indicators, 292–301
Sparse CCA, 76 Static rule-based classifiers for facial
Sparse corruptions in RCICA, 79–80 expressions, 85
Sparse matrices in RCICA, 80 Stationary settings in multimodal learning
Spatio-temporal features analytics, 338
defined, 378 Still conditions in emotional expressions,
depression behavioral signals, 396 271
facial analysis, 384 STIP (spatio-temporal interest points)
Spatio-temporal interest points (STIP) method, 384–386
method, 384–386 Stream-level fusion in affect detection, 174
Speech and speech analysis Structured coordinated space models, 31
AI usage, 2 Structured joint representations, 27
cognitive load indicators, 303–304 Subject-specific features in multimodal
cognitive load measurement, 290–291 interfaces, 88
deception detection, 437–439 Subjective (self-report) measures for
depression behavioral signals, 385–391, cognitive load, 289, 295–296
399 Subset selection in classifying multimodal
emotion determination by, 5 data, 59
human perception of, 272–273 Sum Rule
Speech-gesture redundancy in affect social signals analysis, 210
expressions, 264 training learners, 54
Speech recognition and synthesis Support Vector Machines (SVMs)
audio-visual speech recognition, 21 deception detection, 425–426
challenges and limitations, 24 defined, 378
development of, 2 domain adaptation, 89–90
encoder-decoder models, 111 facial expressions, 83–84, 86
multimodal learning, 53 userstate and trait recognition, 141
Speech syntactic complexity in deception Support Vector Regression (SVR)
detection, 426 defined, 378
Speech synthesis challenges and limita- domain adaptation, 90
tions, 24 facial expressions, 86
Spoken language in userstate and trait online recognition, 239
recognition, 145–146 speech and depression, 391
Index 495
Supra-segmental features in userstate and continuous skip-gram model, 116

trait recognition, 137 deception detection, 424, 443
Surprise expressions, 266 feed-forward neural network language
SVMs. See Support Vector Machines (SVMs) model, 115–116
SVR. See Support Vector Regression (SVR) multimodal embedding models, 114–119
Switchboard speech recognition, 2 recurrent neural networks, 117–119
SX (Scattered X’s test), 298 Theory of Least Collaborative Effort, 357
Synchronization requirements Thermal imaging
multimodal frameworks, 240 deception detection, 433–434
Social Signal Interpretation framework, userstate and trait recognition, 147
243–244 THMM (Tripled Hidden Markov Model),
Syntactic constituency parsing in encoder- 236
decoder models, 111 Tools for userstate and trait recognition,
Syntactic patterns in deception detection, 149–150
425, 427 Topic models in social signals, 219
Synthesis Trackballs, 1
challenges and limitations, 24 Training learners, 52–54
userstate and trait recognition, 152 Transcription requirements in multimodal
Systems-level learning theory, 359–361 frameworks, 240–241
Transfer learning, 10
T-SNE technique in deep learning, 464 deep learning, 461
T-unit analysis defined, 5, 334
deception detection, 426 multimodal learning analytics, 332
defined, 422 non-parallel data, 35
Tabletop-enhanced classrooms in parallel data, 34
multimodal learning analytics, userstate and trait recognition, 143
346 Translation
Tactile signals for userstate and trait defined, 20
recognition, 148 multimodal machine learning, 20
Tandem tracking in deception detection, social signals, 218
432 Travelling Salesman Problem, 111
Temporal consistency of AUs in facial TrecVid initiative, 22
expressions, 84 Trimodal models in learning predictions,
Temporal context for databases, 230 355
Temporal discrepancies in correlation Tripled Hidden Markov Model (THMM),
analysis methods, 76 236
Temporal dynamics of facial expression, Truth bias in deception detection, 425
defined, 72 Tsalamlal, M. Y., haptic expressions of
Temporal information for multimodal affects, 274–276
interfaces, 73–74 Twitter feeds for depression behavioral
Temporal modeling in facial expressions, signal, 395
83–87
Text representation and analysis Unidimensional scales for cognitive load
continuous bag-of-words model, 116–117 indicators, 295
496 Index
Unimodal affect detection (UMAD) natural language, 108, 111

accuracy, 187–190 Virtual agents in multimodal expressions of
affect detection, 172 affects, 265
Unimodal zero shot learning, 37 Virtual characters in emotions and
Unipolar depression, 376 expressions, 271–272
Unsupervised approaches Visage tools for multimodal learning
deep learning, 465 analytics, 355–356
social signals, 219 Vision
userstate and trait recognition, 142–143 deception detection, 427–430, 439–444
User-independent models, defined, 168 deep learning, 464
Userstate and trait recognition Visual clues in deception detection, 420
architectures, 135–144 Visual modality, description, 19
attempts overview, 132–135 Visual question-answering (VQA)
audio and spoken and written language, challenges and limitations, 24
145–146 multimodal applications, 23
emerging trends and future directions, Vocal modality, description, 19
151–152 Vocal Tract Coordination (VTC) feature,
focus questions, 152–153 390, 397
images and video, 146–147 Voice and voice quality
introduction, 131–132 cognitive load measurement, 290
learning, 141–143 defined, 378
modalities, 144–150 speech analysis, 387
modeling, 132 Voice User Interfaces (VUIs), 2
modern architecture perspective, 144 Voting in training learners, 54
optimization, 142–143 Vowel Space Area (VSA)
physiology, 147–148 defined, 378
pseudo-multimodality, 149 speech and depression, 390
tactile signals, 148 VQA (visual question-answering)
tools, 149–150 challenges and limitations, 24
walk-through, 150–151 multimodal applications, 23
VSA (Vowel Space Area)
VAM corpus, 231 defined, 378
Variable-state Latent CRF (VSL-CRF) model, speech and depression, 390
84 VSL-CRF (Variable-state Latent CRF) model,
Velten mood induction databases, 230 84
Video VTC (Vocal Tract Coordination) feature,
affect detection, 183 390, 397
AI usage, 2 VUIs (Voice User Interfaces), 2
deception detection, 443
userstate and trait recognition, 146–147 W3C EmotionML standard, 140
Video description W4 quadruplet for multimodal interfaces,
challenges and limitations, 24 74
encoder-decoder models, 111 W5+ context dependency model, 87
Index 497
W5+ sextuplet for multimodal interfaces, 74 Working memory based on cognitive load
Walking speed in cognitive load measure- theory, 301
ment, 301 Working memory resources limitations,
Wearable-based assessment of depression, 292, 294
400 Working Memory theory, 310
Web scale annotation by image embedding Wrapper searches in userstate and trait
(WSABIE) model, 30 recognition, 138
Weight matrices for dense layers, 101 Writing
Weighted voting in training learners, 55–56 AI usage, 2
WEKA 3 tool, 150 cognitive load indicators, 302–303
Whirlwind project, 1 multimodal learning analytics, 360–361
White-boxing in deep learning, 464 userstate and trait recognition, 145–146
Wizard-of-Oz scenario WSABIE (web scale annotation by image
databases, 230 embedding) model, 30
real-time sensing, 251
Word statistics in deception detection, 427 Zeno robots, 270
Wordnet Affect Zero shot learning (ZSL), 14
database, 146 description, 5
deception detection, 425, 427 non-parallel data, 36–37
Biographies
Editors
Philip R. Cohen (Monash University) is Director of the Laboratory for Dialogue Re-
search and Professor of Artificial Intelligence in the Faculty of Information Technol-
ogy at Monash University. His research interests include multimodal interaction,
human-computer dialogue, and multi-agent systems. He is a Fellow of the Amer-
ican Association for Artificial Intelligence, past President of the Association for
Computational Linguistics, recipient (with Hector Levesque) of the Inaugural In-
fluential Paper Award by the International Foundation for Autonomous Agents and
Multi-Agent Systems, and recipient of the 2017 Sustained Achievement Award from
the International Conference on Multimodal Interaction.
He was most recently Chief Scientist in Artificial Intelligence and Senior Vice
President for Advanced Technology at Voicebox Technologies, where he led efforts
on semantic parsing and dialogue. Cohen founded Adapx Inc. to commercialize
multimodal interaction, deploying multimodal systems to civilian and government
organizations. Prior to Adapx, he was Professor and Co-Director of the Center for
Human-Computer Communication in the Computer Science Department at Ore-
gon Health and Science University and Director of Natural Language in the Artificial
Intelligence Center at SRI International. Cohen has published more than 150 arti-
cles and has 5 patents. He co-authored the book The Paradigm Shift to Multimodality
in Contemporary Computer Interfaces (2015, Morgan & Claypool Publishers) with Dr.
Sharon Oviatt. (Contact: philip.cohen@monash.edu)
Antonio Krüger (Saarland University and DFKIGmbH) is Professor of Computer Sci-

ence and Director of the Media Informatics Program at Saarland University, as well
as Scientific Director of the Innovative Retail Laboratory at the German Research
Center for Artificial Intelligence (DFKI). His research areas focus on intelligent user
interfaces, and mobile and ubiquitous context-aware systems. He has been General
Chair of the Ubiquitous Computing Conference and Program Chair of MobileHCI,
500 Biographies
IUI, and Pervasive Computing. He is also on the Steering Committee of the Inter-
national Conference on Intelligent User Interfaces (IUI) and an Associate Editor of
the journals User Modeling and User-Adapted Interaction and ACM Interactive, Mobile,
Wearable and Ubiquitous Technologies. (Contact: krueger@dfki.de)
Sharon Oviatt (Monash University) is internationally known for her multidisci-

plinary work on multimodal and mobile interfaces, human-centered interfaces,
educational interfaces, and learning analytics. She has been recipient of the inau-
gural ACM-ICMI Sustained Accomplishment Award, National Science Foundation
Special Creativity Award, and ACM-SIGCHI CHI Academy award. She has published
over 160 scientific articles in a wide range of venues and is an Associate Editor of
the major journals and edited book collections in the field of human-centered in-
terfaces. Her other books include The Design of Future Educational Interfaces (2013,
Routledge) and The Paradigm Shift to Multimodality in Contemporary Computer Inter-
faces (2015, Morgan & Claypool Publishers). (Contact: sharon.oviatt@monash.edu)
Gerasimos Potamianos (University of Thessaly) is Associate Professor and Director

of Graduate Studies in Electrical and Computer Engineering. His research spans
multisensory and multimodal speech processing and scene analysis, with applica-
tions to human-computer interaction and ambient intelligence. He has authored
over 120 articles and has 7 patents. He received a Diploma degree from the Na-
tional Technical University of Athens, and a M.Sc. and Ph.D. from Johns Hopkins
University, all in electrical and computer engineering. In addition to his academic
experience, he has worked at AT&T Research Labs, IBM Thomas J. Watson Research
Center (US), and at the FORTH and NCSR ‘Demokritos’ Research Centers in Greece.
(Contact: gpotam@ieee.org)
Björn Schuller (University of Augsburg and Imperial College London) is currently

ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing at University of
Augsburg and Reader in Machine Learning at Imperial College. He is best known for
his work on multisensorial/multimodal intelligent signal processing for affective,
behavioral, and human-centered computing. In 2015 and 2016, he was honored
by the World Economic Forum as one of 40/50 extraordinary scientists under age
40. In 2018, he was elevated to Fellow of the IEEE and Senior Member of the ACM.
He has published over 700 peer-reviewed scientific contributions across a range
of disciplines and venues, and is Editor-in-Chief of IEEE Transactions on Affective
Computing. His books include Intelligent Audio Analysis (2013, Springer) and Com-
putational Paralinguistics (2013, Wiley). (Contact: bjoern.schuller@imperial.ec.uk)
Biographies 501
Daniel Sonntag (German Research Center for Artificial Intelligence, DFKI) is a Prin-
cipal Researcher and Research Fellow. His research interests include multimodal
and mobile AI-based interfaces, common-sense modeling, and explainable ma-
chine learning methods for cognitive computing and improved usability. He has
published over 130 scientific articles and was the recipient of the German High
Tech Champion Award in 2011 and the AAAI Recognition and IAAI Deployed Appli-
cation Award in 2013. He is the Editor-in-Chief of the German Journal on Artificial
Intelligence (KI) and editor-in-chief of Springer’s Cognitive Technologies book se-
ries. Currently, he leads both national and European projects from the Federal
Ministry of Education and Research, the Federal Ministry for Economic Affairs and
Energy, and Horizon 2020. (Contact: daniel.sonntag@dfki.de)
Authors and Challenge Discussants

Mohamed Abouelenien (University of Michigan-Dearborn) is an Assistant Profes-
sor in the Department of Computer and Information Science at the University of
Michigan-Dearborn. He was a Postdoctoral Research Fellow in the Electrical En-
gineering and Computer Science Department at the University of Michigan, Ann
Arbor from 2014–2017. In 2013, he received his Ph.D. in Computer Science and
Engineering from the University of North Texas. His areas of interest broadly cover
data science topics, including applied machine learning, computer vision, and nat-
ural language processing. He has worked on a number of projects in these areas,
including affective computing, deception detection, ensemble learning, video and
image processing, face and action recognition, and others. His recent research
involves data analytics projects as well as modeling of human behavior for differ-
ent applications. Abouelenien has published extensively in international journals
and conferences in IEEE, ACM, Springer, and SPIE. He also served as the chair
for the ACM Workshop on Multimodal Deception Detection, a reviewer for IEEE
Transactions and Elsevier journals, and a program committee member for multiple
international conferences.
Chaitanya Ahuja (Carnegie Mellon University) is a doctoral candidate at the Lan-

guage Technologies Institute in the School of Computer Science at Carnegie Mellon
University. His interests range in various topics in natural language, computer vi-
sion, computational music, and machine learning. Before starting graduate school,
Chaitanya completed his Bachelor’s degree at the Indian Institute of Technology,
Kanpur, with a research focus on spatial audio.
502 Biographies
Ethem Alpaydin (Bogazici University) received his degree of Docteur es Sciences

from École Polytechnique Fédérale de Lausanne in 1990. Currently, he is a Pro-
fessor in the Department of Computer Engineering of Bogazici University and a
member of The Science Academy, Istanbul. As a visiting researcher, he worked in
the Department of Brain and Cognitive Sciences at MIT in 1994, at the Interna-
tional Computer Science Institute at UC Berkeley in 1997, at the Idiap Research
Institute in Switzerland in 1998, and at TU Delft in 2014. He was a Fulbright Senior
Scholar in 1997/1998 and received the Research Excellence Award from the Bogazici
University Foundation in 1998 (junior faculty) and 2008 (senior faculty), the Young
Scientist Award from the Turkish Academy of Sciences in 2001, and the Scientific
Encouragement Award from the Turkish Scientific and Technical Research Coun-
cil in 2002. His book Introduction to Machine Learning, published by MIT Press, is
now in its third edition and was translated into Chinese, German, and Turkish.
Machine Learning: The New AI was also published by MIT Press as part of the Es-
sential Knowledge Series in 2016, and has since been translated into Russian and
Japanese. He is a senior Member of the IEEE and an Editorial Board Member of the
Pattern Recognition journal, published by Elsevier.
Mehdi Ammi (University of Paris-Saclay) is an Associate Professor at the University

of Paris-Saclay. He is also the head of the Pervasive and Ubiquitous Environments
team and member of the Architecture and Models for Interaction group at the
multidisciplinary LIMSI-CNRS lab. He earned his Ph.D. at the University of Orleans
in 2005 with an emphasis on robotics and virtual reality.
Michel-Ange Amorim (Université Paris-Sud) is a Full Professor at the Université

Paris-Sud (UPSUD), Université Paris-Saclay, Orsay, France. He received his Ph.D.
in cognitive psychology from Université René Descartes, Paris, France, in 1997.
At UPSUD, he leads CIAMS, a multidisciplinary laboratory investigating human
movement, including motor control, psychology, biomechanics, physiology, and
behavioral and cognitive neuroscience. His research interests are in embodied cog-
nition combining psychophysics—the Information Integration Theory approach—
with neuroimaging techniques in normals and patients, in order to decipher the
neurocognitive processes underlying spatial and motoric embodiment of self, self-
environment, and self-other relationships.
Elisabeth André (Augsburg University) is a Full Professor of Computer Science and

Founding Chair of Human-Centered Multimedia at Augsburg University in Ger-
many. She has a long track record in multimodal human-machine interaction,
embodied conversational agents, social robotics, affective computing, and social
Biographies 503
signal processing. Elisabeth André has served as a General and Program Co-Chair
of major international conferences, including ACM International Conference on
Intelligent User Interfaces (IUI), ACM International Conference on Multimodal
Interfaces (ICMI), and International Conference on Autonomous Agents and Mul-
tiagent Systems (AAMAS). In 2010, Elisabeth André was elected a member of the
prestigious Academy of Europe, the German Academy of Sciences Leopoldina, and
AcademiaNet. To honor her achievements in bringing artificial intelligence tech-
niques to HCI, she was awarded a EurAI (European Coordinating Committee for
Artificial Intelligence) fellowship in 2013. Most recently, she was elected to the CHI
Academy, an honorary group of leaders in the field of human-computer interaction.
Syed Z. Arshad (University of New South Wales) has a Ph.D. in computer science
from the University of New South Wales, Australia. He is a member of IEEE, ACM,
SIGCHI, and SIGKDD. His research interests include cognitive computing, intelli-
gent user interfaces, machine learning, and data visualization techniques.
Tadas Baltrušaitis (Microsoft) is a scientist at Microsoft in Cambridge, UK. His pri-

mary research interests lie in the automatic understanding of non-verbal human
behavior, computer vision, and multimodal machine learning. In particular, he is
interested in the application of such technologies to healthcare settings, with a fo-
cus on mental health. Before joining Microsoft, he was a post-doctoral researcher
at Carnegie Mellon University, working on multimodal machine learning and auto-
matic facial behavior analysis. He received his Ph.D. at the University of Cambridge,
where his work focused on automatic facial expression analysis in especially diffi-
cult real-world settings.
Samy Bengio (Google) has been a research scientist at Google since 2007. Before
that, he had been a senior researcher in statistical machine learning at IDIAP Re-
search Institute, where he supervised Ph.D. students and postdoctoral fellows. His
research interests span many areas of machine learning such as deep architec-
tures, representation learning, sequence processing, speech recognition, image
understanding, support vector machines, mixture models, large-scale problems,
multi-modal (face and voice) person authentication, brain–computer interfaces,
and document retrieval. He was the program chair of NIPS 2017 and ICLR 2015
and 2016; was the general chair of BayLearn 2012–2015, the Workshops on Machine
Learning for Multimodal Interactions (MLMI) 2004–2006, and the IEEE Workshop
on Neural Networks for Signal Processing (NNSP) in 2002; and served on the pro-
gram committees of several international conferences such as NIPS, ICML, ICLR,
ECML and IJCAI.
504 Biographies
Nigel Bosch (University of Illinois at Urbana-Champaign) is a postdoctoral re-

searcher with the National Center for Supercomputing Applications. His research
utilizes data mining and machine learning techniques to understand human ex-
periences including emotion, cognition, and behavior. His current interests are
focused on developing methods to model new forms of big multimodal data in on-
line educational contexts, as well as studying the ethical implications of machine
learning models trained in these contexts. He received his Ph.D. in computer sci-
ence from the University of Notre Dame.
Mihai Burzo (University of Michigan-Flint) is an Assistant Professor of Mechani-

cal Engineering at the University of Michigan-Flint. Prior to joining University of
Michigan in 2013, he was an Assistant Professor at University of North Texas. His
research interests include heat transfer in microelectronics and nanostructures,
thermal properties of thin films of new and existing materials, multimodal sens-
ing of human behavior, and computational modeling of forced and natural heat
convection. He has published over 50 articles in peer-reviewed journals and confer-
ence proceedings. He is the recipient of several awards, including the 2006 Harvey
Rosten Award For Excellence for “outstanding work in the field of thermal analysis
of electronic equipment,” best paper award at the Semitherm conferences in 2013
and 2006, Young Engineer of the Year from the North Texas Section of ASME in
2006, and Leadership Award from SMU in 2002.
Fang Chen (DATA61, CSIRO) is a Senior Principal Research Scientist of Analytics

in DATA61, CSIRO. She holds a Ph.D. in Signal and Information Processing, an
M.Sc. and B.Sc. in Telecommunications and Electronic Systems, respectively, and
an MBA. Her research interests are behavior analytics, machine learning, and
pattern recognition in human and system performance prediction and evaluation.
She has done extensive work on human-machine interaction and cognitive load
modeling. She pioneered the theoretical framework of measuring cognitive load
through multimodal human behavior and provided much of the empirical evidence
on using human behavior signals and physiological responses to measure and
monitor cognitive load.
Huili Chen (MIT) is a Ph.D. student at the MIT Media Lab. She obtained a Bachelor
of Arts degree in Psychology and earned a Bachelor of Science degree in Computer
Science from the University of Notre Dame in 2016. She is very interested in human-
machine interaction and interactive artificial intelligence.
Lei Chen (Liulishuo Inc.) is a Principal Research Scientist at Liulishuo’s AI Lab

located in Silicon Valley who explores using AI technologies on improving language
Biographies 505
education. Prior to Liulishuo, he worked at Educational Testing Service (ETS) from

2008–2017. At ETS, his research focused on the automated assessment of spoken
language using speech recognition, natural language processing, and machine
learning technologies. Since 2013, he has been working on multimodal signal
processing technology for assessing video-based performance tests in areas such
as public-speaking. In the 2009 International Conference of Multimodal Interface
(ICMI), he won the Outstanding Paper Award sponsored by Google. He received a
B.Eng. degree from Tianjin University in China, an M.Sc. degree from the Chinese
Academy of Science (CAS), and a Ph.D. degree from Purdue University. All of his
degrees are in electrical engineering.
Céline Clavel (Université Paris-Sud) received a Ph.D. in Cognitive Psychology from

the Université Paris Ouest Nanterre La Défense in 2007. In September 2010, she
became an Assistant Professor at Université Paris-Sud and teaches in the depart-
ment of Accounting and Management in the Sceaux Institute of Applied Sciences
and in the Ergonomic Master. Her research is focused on the emotional process in
a virtual or real social interaction context and on the multi-user multimodal inter-
actions in Collaborative Virtual Environments (CVEs). Her main research interest
is to specify the psychology models to computer science applications and evaluate
their contributions and/or study their impacts on behavior.
Jeffrey Cohn , Ph.D. (University of Pittsburg and Carnegie Mellon University), is

Professor of Psychology and Psychiatry at the University of Pittsburgh and Adjunct
Professor at the Robotics Institute, Carnegie Mellon University. He leads interdis-
ciplinary and inter-institutional efforts to develop advanced methods of automatic
analysis and synthesis of face and body movement and applies them to research
in human emotion, communication, psychopathology, and biomedicine. His re-
search has been supported by grants from the U.S. National Institutes of Health and
U.S. National Science Foundation, among other sponsors. He chairs the Steering
Committee of the IEEE International Conference on Automatic Face and Gesture
Recognition (FG) and has served as General Chair of international conferences on
automatic face and gesture recognition, affective computing, and multimodal in-
terfaces.
Matthieu Courgeon (École Nationale d’Ingénieurs de Brest) defended his Ph.D. the-
sis in 2011 on affective computing and interactive autonomous virtual characters.
He is the creator of the Multimodal Affective and Reactive Characters toolkit, used
by several research teams around the world. His research area spans interactive ex-
pressive artificial humans and robots to collaborative immersive virtual reality. He
506 Biographies
earned a Best Paper Award at Ubicomp 2013 for his work with expressive virtual
humans with the Affective Computing team at the MIT Medialab.
Dr. Nicholas Cummins (University of Augsburg) is a habilitation candidate at the

Chair of Embedded Intelligence for Health Care and Wellbeing at the University of
Augsburg. He received his Ph.D. in Electrical Engineering from UNSW Australia in
February 2016. He is currently involved in the Horizon 2020 projects DE-ENIGMA,
RADAR-CNS, and TAPAS. His current research interests include multisenory signal
analysis, affective computing, and computer audition with a particular focus on the
understanding and analysis of different health states. He has (co)authored over 50
conference and journal papers (over 400 citations, h-index 12). Dr. Cummins is a
reviewer for IEEE, ACM, and ISCA journals and conferences, as well as serving on
their program and organizational committees. He is a member of ACM, ISCA, IEEE,
and the IET.
Li Deng (Citadel) has been the Chief Artificial Intelligence Officer of Citadel since
May 2017. Prior to Citadel, he was the Chief Scientist of AI, the founder of Deep
Learning Technology Center, and Partner Research Manager at Microsoft (2000–
2017). Prior to Microsoft, he was a tenured full professor at the University of Water-
loo and held teaching and research positions at Massachusetts Institute of Technol-
ogy (1992–1993), Advanced Telecommunications Research Institute (1997–1998),
and HK University of Science and Technology (1995). He has been a Fellow of the
IEEE since 2004, a Fellow of the Acoustical Society of America since 1993, and a
Fellow of the ISCA since 2011. He has also been an Affiliate Professor at Univer-
sity of Washington, Seattle, since 2000. He was elected to the Board of Governors
of the IEEE Signal Processing Society and served as editor-in-chief of the IEEE Sig-
nal Processing Magazine and IEEE/ACM Transactions on Audio, Speech, and Language
Processing from 2008–2014, for which he received the IEEE SPS Meritorious Service
Award. In recognition of his pioneering work on disrupting the speech recognition
industry using large-scale deep learning, he received the 2015 IEEE SPS Technical
Achievement Award for “Outstanding Contributions to Automatic Speech Recogni-
tion and Deep Learning.” He also received numerous best paper and patent awards
for contributions to artificial intelligence, machine learning, information retrieval,
multimedia signal processing, speech processing and recognition, and human lan-
guage technology. He is an author or co-author of six technical books on deep
learning, speech processing, pattern recognition and machine learning, and natu-
ral language processing.
Sidney D’Mello (University of Colorado Boulder) is an Associate Professor in the In-

stitute of Cognitive Science and Department of Computer Science at the University
Biographies 507
of Colorado Boulder. He is interested in the dynamic interplay between cognition

and emotion while individuals and groups engage in complex real-world tasks. He
applies insights gleaned from this basic research program to develop intelligent
technologies that help people achieve their fullest potential by coordinating what
they think and feel with what they know and do. D’Mello has co-edited 6 books and
published over 220 journal papers, book chapters, and conference proceedings (13
of these have received awards). His work has been funded by numerous grants and
he serves or has served as associate editor for four journals and on the editorial
boards for six others while also playing leadership roles in several professional or-
ganizations.
Julien Epps (UNSW Sydney and Data61, CSIRO) is an Associate Professor of Signal
Processing with UNSW Sydney and a Contributed Principal Researcher at Data61,
CSIRO. He has authored or co-authored more than 200 publications and 4 patents,
mainly on topics related to emotion and mental state recognition of speech and
behavioral signals. He is serving as an Associate Editor for IEEE Transactions on
Affective Computing and Frontiers in ICT (the Human-Media Interaction and Psy-
chology sections), and he recently served as a member of the Advisory Board of
the ACM International Conference on Multimodal Interaction. He has delivered
invited tutorials on topics related to this book for major conferences, including
INTERSPEECH 2014 and 2015 and APSIPA 2010, and invited keynotes for the 4th
International Workshop on Audio-Visual Emotion Challenge (part of ACM Multi-
media 2014) and the 4th International Workshop on Context-Based Affect Recog-
nition (part of AAAC/IEEE Affective Computing and Intelligent Interaction 2017).
Marc Ernst (Ulm University) heads the Department of Applied Cognitive Psychology
at Ulm University. He studied physics in Heidelberg and Frankfurt/Main. In 2000,
he received his Ph.D. from the Eberhard-Karls-University Tübingen for investiga-
tions into the human visuomotor behavior that he conducted at the Max Planck In-
stitute for Biological Cybernetics. For this work, he was awarded the Attempto-Prize
from the University of Tübingen and the Otto-Hahn-Medaille from the Max Planck
Society. He was a research associate at the University of California, Berkeley working
with Prof. Martin Banks on psychophysical experiments and computational mod-
els investigating the integration of visual-haptic information before returning to
the MPI in Tübingen and becoming principle investigator of the Sensorimotor Lab
in the Department of Professor Heinrich Bülthoff. In 2007, he became leader of
the Max Planck Research Group on Human Multisensory Perception and Action,
before joining the University of Bielefeld and the Cognitive Interaction Technology
Center of Excellence (CITEC) in 2011.
508 Biographies
Anna Esposito (Wright State University) received her Laurea degree summa cum
laude in Information Technology and Computer Science from the Università di
Salerno in 1989 with the thesis “The Behavior and Learning of a Deterministic Neu-
ral Net” (published in Complex System, 6(6), 507–517, 1992). She received her Ph.D.
in Applied Mathematics and Computer Science from the Università di Napoli Fed-
erico II in 1995. Her Ph.D. thesis “Vowel Height and Consonantal Voicing Effects:
Data from Italian” (published in Phonetica, 59(4),197–231, 2002) was developed at
the MIT Research Laboratory of Electronics (RLE), under the supervision of profes-
sor Kenneth N. Stevens. She completed a postdoctoral program at the International
Institute for Advanced Scientific Studies (IIASS), and was Assistant Professor in the
Department of Physics at Università di Salerno, where she taught classes on cyber-
netics, neural networks, and speech processing (1996–2000). She was a Research
Professor in the Department of Computer Science and Engineering at Wright State
University (WSU) (2000–2002). She is currently a Research Affiliate at WSU and Asso-
ciate Professor in Computer Science in the Department of Psychology at Università
della Campania Luigi Vanvitelli. She has authored more than 170 peer-reviewed
publications in international journals, books, and conference proceedings and
edited or co-edited over 25 international books with Italian, EU, and overseas col-
leagues.
Yoren Gaffary (Insa Rennes) is an expert research engineer on haptic simulations

in virtual environments at Insa Rennes, France. His research interests are in affec-
tive computing, haptics, and augmented reality. He received a Master’s degree in
Information, Learning, and Cognition at Université Paris-Sud. His Ph.D.thesis, also
at Université Paris-Sud, concerned affective computing using mediated touch with
robotic devices coupled with virtual humans.
Roland Goecke (University of Canberra) is Professor of Affective Computing in the

School of Information Technology & Systems in the Faculty of Science & Technology
at the University of Canberra. Professor Goecke holds a Master’s degree in Com-
puter Science (1998) from the University of Rostock, Germany, and a Ph.D. (2004)
in Computer Science from the Australian National University, Canberra, Australia.
Prior to joining the University of Canberra in 2008, he worked as a Senior Research
Scientist with Seeing Machines, as a Researcher at the NICTA Canberra Research
Lab, and as a Research Fellow at the Fraunhofer Institute for Computer Graph-
ics, Germany. His research interests are in affective computing, computational
behavior analysis, social signal processing, pattern recognition, computer vision,
human-computer interaction, multimodal signal processing, and e-research.
Biographies 509
Joseph F. Grafsgaard (University of Colorado Boulder) received a B.A. in Computer

Science from the University of Minnesota Twin Cities in 2005, and M.Sc. and Ph.D.
degrees in Computer Science from North Carolina State University in 2012 and
2014, respectively. He is currently a postdoctoral research associate at the Institute
of Cognitive Science at the University of Colorado Boulder. His research interests
include affective computing, advanced learning technologies, detection/modeling
of nonverbal behavior and physiology, and multimodal affective interaction. He is
a member of the Association for the Advancement of Affective Computing (AAAC),
the ACM, the IAIED Society, the IEDM Society, and the IEEE.
Jyoti Joshi (University of Canberra) is a postdoctoral researcher at the University

of Waterloo. She earned a Ph.D. at the Human-Centred Computing lab at the Uni-
versity of Canberra, Australia, and is supervised by Prof. Roland Goecke and Prof.
Michael Wagner. Before starting her Ph.D., she worked as a research assistant with
Vision and Sensing Group, University of Canberra. She also worked as a consultant
at a leading EDA company Cadence Design Systems, India prior to coming to Aus-
tralia. Her current research revolves around applications of pattern recognition,
computer vision, and machine learning techniques with a focus on affect-based
multimedia analysis.
Gil Keren (University of Passau) is a doctoral student at the University of Passau,

Germany. Prior to that, he completed Bachelor’s and Master’s degrees in Psychol-
ogy and Mathematics at Ben Gurion University, Israel. He conducts research and
publishes academic papers on the topics of artificial neural networks, artificial in-
telligence, and models of human cognition.
Jean-Claude Martin (Université Paris-Sud) is first-class Full Professor of Computer

Science at Université Paris-Sud. He is the head of the pluridisciplinary Cogni-
tion Perception Use research group at LIMSI-CNRS. He conducts research on the
sensory-motor bases of social cognition in humans and in multimodal interfaces,
such as expressive virtual agents and social robots. He considers several appli-
cation areas related to social skills training: autism; job and medical interviews;
virtual coaches; leadership and teamwork; stress management; and sports and
e-Health. He is the Editor-in-Chief of the Springer Journal on Multimodal User In-
terfaces (JMUI). He has been involved in several projects about how we perceive
(in)congruent blends of expressions of emotions in several modalities. He super-
vised and co-supervised 14 defended Ph.D. theses.
Rada Mihalcea (University of Michigan) is a Professor in the Computer Science

and Engineering department at the University of Michigan. Her research interests
510 Biographies
are in computational linguistics with a focus on lexical semantics, computational

social sciences, and conversational interfaces. She serves or has served on the
editorial boards of the Journals of Computational Linguistics, Language Resources and
Evaluations, Natural Language Engineering, Research in Language in Computation,
IEEE Transactions on Affective Computing, and Transactions of the Association for
Computational Linguistics. She was a program co-chair for the Conferences of the
Association for Computational Linguistics (2011) and the Empirical Methods in
Natural Language Processing (2009), and general chair for the Conference of the
North American Chapter of the Association for Computational Linguistics (2015).
She is the recipient of a National Science Foundation CAREER award (2008) and a
Presidential Early Career Award for Scientists and Engineers (2009).
Louis-Philippe Morency (Carnegie Mellon University) is Assistant Professor in the

Language Technology Institute at Carnegie Mellon University, where he leads the
Multimodal Communication and Machine Learning Laboratory (MultiComp Lab).
He was formerly a research assistant professor in the Computer Sciences Depart-
ment at the University of Southern California and a research scientist at the USC
Institute for Creative Technologies. Professor Morency received his Ph.D. and Mas-
ter’s degrees from MIT Computer Science and Artificial Intelligence Laboratory. His
research focuses on building the computational foundations to enable computers
with the abilities to analyze, recognize, and predict subtle human communica-
tive behaviors during social interactions. In particular, Professor Morency was lead
co-investigator for the multi-institution effort that created SimSensei and Multi-
Sense, two technologies to automatically assess nonverbal behavior indicators of
psychological distress. He is currently chair of the advisory committee for ACM
International Conference on Multimodal Interaction and associate editor at IEEE
Transactions on Affective Computing.
Amr El-Desoky Mousa (Apple Inc.) received his Ph.D. in 2014 from the Chair of
Human Language Technology and Pattern Recognition at RWTH Aachen Univer-
sity, Germany, where his research was focused on automatic speech recognition. In
2014–2015, he worked as a postdoctoral researcher at the Machine Intelligence and
Signal Processing Group at the Technical University of Munich, Germany. In 2016–
2017, he worked as a Research and Teaching Associate at the Chair of Complex
and Intelligent Systems, University of Passau, Germany. In June 2017, he moved
to Apple Inc. to work as a senior Machine Learning Engineer taking part in the re-
search and development related to Siri, one of the most well-known speech-based
Intelligent Assistants in the world. His main research interests include deep learn-
Biographies 511
ing, large vocabulary continuous speech recognition, language modeling, acoustic

modeling, and natural language processing.
Xavier Ochoa (Escuela Superior Politécnica del Litoral) is currently a Full Profes-
sor in the Faculty of Electrical and Computer Engineering at Escuela Superior
Politécnica del Litoral (ESPOL) in Guayaquil, Ecuador. He directs the Information
Technology Center (CTI) and the research group on Teaching and Learning Tech-
nologies (TEA) at ESPOL. He is currently the Vice President of the Society for the
Research in Learning Analytics (SoLAR), member of the coordination team of the
Latin American Community on Learning Technologies (LACLO), and president of
the Latin American Open Textbooks Initiative (LATIn). He is editor of the Journal of
Learning Analytics and member of the Editorial Board of the IEEE Transactions on
Learning Technologies. He coordinates several regional and international projects
in the field of learning technologies. His main research interests revolve around
learning analytics, multimodal learning analytics, and data science.
Yannis Panagakis (Middlesex University and Imperial College London) is an assis-

tant professor at Middlesex University London and research faculty at the Depart-
ment of Computing, Imperial College London. His research interests lie in machine
learning and its interface with signal processing, high-dimensional statistics, and
computational optimization. Specifically, Yannis is working on models and algo-
rithms for robust and efficient learning from high-dimensional data and signals
representing audio, visual, affective, and social phenomena. He received his M.Sc.
and Ph.D. from the Department of Informatics at Aristotle University of Thessa-
loniki and his B.Sc. in Informatics and Telecommunication from the University of
Athens, Greece. Yannis has been awarded the prestigious Marie-Curie Fellowship,
among various scholarships and awards for his studies and research.
Maja Pantić (Imperial College London) is a Professor of Affective and Behavioral

Computing and leader of the i.BUG group at Imperial College London, working
on machine analysis of human non-verbal behavior and its applications to human-
computer, human-robot, and computer-mediated human-human interaction. Pro-
fessor Pantić has published more than 250 technical papers in the areas of ma-
chine analysis of facial expressions, machine analysis of human body gestures,
audiovisual analysis of emotions and social signals, and human-centered machine
interfaces. She has served as the keynote speaker, chair and co-chair, and an orga-
nization or program committee member at numerous conferences in her areas of
expertise. She received a B.Sc. from Delft University in 1995, followed by an M.Sc.
512 Biographies
in Artificial Intelligence in 1997. Pantić earned a Ph.D. in 2001 at the Delft Uni-
versity of Technology, with a thesis on facial expression analysis by computational
intelligence techniques.
Veronica Perez-Rosas (University of Michigan) is an Assistant Research Scientist

of Computer Science and Engineering at the University of Michigan. Her main re-
search interest areas are natural language processing (NLP), computational linguis-
tics, applied machine learning, and multimodal interaction. Her work examines
human communication to build computational models able to analyze, recognize,
and predict affect-related behaviors during social interaction. She has developed
methods that make use of multimodal information present during the social inter-
action (verbal and non-verbal) and combine data-driven approaches with linguistic
and psycho-linguistic knowledge to build novel solutions to diverse NLP problems
such as sentiment analysis and deception detection. She has publications in rep-
utable journals and conferences, including IEEE, ACM, and ACL. She also served
as a reviewer for IEEE Transactions and Elsevier journals and served as a program
committee member for multiple international conferences in the NLP community.
Olivier Pietquin (Google) completed his Ph.D. under the Faculty of Engineering at
l’Universiste de Mons, Belgium, and the University of Sheffield, UK. He then did a
postdoctoral program with Philips in Germany before joining the Ecole Supérieure
d’Electricité, France, in 2005, as an Associate Professor and, later, Professor. He
headed the computer science program of the UMI GeorgiaTech-CNRS (joint lab
in Metz) and also worked with the INSERM IADI team. In 2013, he moved to
the University of Lille, France, as a Full Professor and joined the Inria SequeL
(Sequential Learning) team. Since 2016, Olivier has been on leave with Google, first
with DeepMind in London, and then with Brain in Paris. His current research is
about direct and inverse reinforcement learning, learning from demonstrations,
and applications to human-machine interaction.
Fabio Ramos (University of Sydney) is an Associate Professor in Machine Learning

and Robotics at the School of Information Technologies and co-Director of the Cen-
tre for Translational Data Science at the University of Sydney. He received B.Sc. and
M.Sc. degrees in Mechatronics Engineering at the University of Sao Paulo, Brazil,
in 2001 and 2003, respectively, and a Ph.D. at the University of Sydney, Australia,
in 2008. He was an ARC postdoctoral fellow from 2008–2010, and an ARC DECRA
fellow from 2012–2014. He has authored over 130 peer-reviewed publications and
received numerous awards. His research focuses on statistical machine learning
Biographies 513
techniques for large-scale data fusion with applications in robotics, mining, envi-
ronmental monitoring, and healthcare.
Arun Ross (Michigan State University) is a Professor in the Department of Computer

Science and Engineering and the Director of the i-PRoBe Lab at Michigan State
University (MSU). He was a faculty member at West Virginia University (WVU) from
2003–2012 and the Assistant Site Director of the NSF Center for Identification
Technology and Research (CITeR) from 2010–2012. Arun received his B.E. (Hons.)
in Computer Science from the Birla Institute of Technology and Science, Pilani,
India, and his M.S. and Ph.D. in Computer Science and Engineering from MSU. He
coauthored the textbook Introduction to Biometrics and the monograph Handbook
of Multibiometrics, and coedited the Handbook of Biometrics. He received the IAPR
JK Aggarwal Prize, the IAPR Young Biometrics Investigator Award, and the NSF
CAREER Award and was an invited speaker at the Frontiers of Science Symposium
organized by the National Academy of Sciences in 2006. He also received the 2005
Biennial Best Paper Award from the Pattern Recognition Journal and the Five-Year
Highly Cited BTAS 2009 Paper Award. Arun served as a panelist at an event organized
by the United Nations Counter-Terrorism Committee (CTC) at the UN Headquarters
in 2013. He was an Associate Editor of IEEE Transactions on Information Forensics
and Security (2009–2013) and IEEE Transactions on Image Processing (2008–2013). He
currently serves as Associate Editor of IEEE Transactions on Circuits and Systems for
Video Technology, Senior Area Editor of IEEE Transactions on Image Processing, Area
Editor of the Computer Vision and Image Understanding Journal, Associate Editor of
the Image and Vision Computing Journal, and Chair of the IAPR TC4 on Biometrics.
Ognjen (Oggi) Rudovic (MIT) is a Postdoctoral, Marie Curie, Fellow in the Affective
Computing Group at MIT Media Lab, working on personalized machine learning
for affective robots and analysis of human data. His background is in automatic
control theory, computer vision, artificial intelligence, and machine learning. In
2014, he received a Ph.D. from Imperial College London, UK, where he worked on
machine learning and computer vision models for automated analysis of human
facial behavior. His current research focus is on developing machine-learning-
driven assistive technologies for individuals on the autism spectrum.
Stefan Scherer (Embodied, Inc. and USC) is the CTO of Embodied, Inc. and a Re-
search Assistant Professor in the Department of Computer Science at the University
of Southern California (USC) and the USC Institute for Creative Technologies (ICT),
where he leads research projects funded by National Science Foundation and the
765Army Research Laboratory. Stefan Scherer directs the lab for Behavior Analytics
514 Biographies
and Machine Learning and is the Associate Director of Neural Information Process-
ing at the ICT. He received the degree of Dr. rer. nat. from the faculty of Engineering
and Computer Science at Ulm University in Germany with the grade summa cum
laude (i.e. with distinction) in 2011. His research aims to automatically identify
characterize, model, and synthesize individuals’ multimodal nonverbal behavior
within both human-machine as well as machine-mediated human-human inter-
action. His research was recently featured in the Economist, the Atlantic, and the
Guardian, and was awarded a number of best paper awards in renowned interna-
tional conferences. His work is focused on machine learning, multimodal signal
processing, and affective computing. Stefan Scherer serves as an Associate Editor
of the Journal IEEE Transactions on Affective Computing and is the Vice Chair of the
IAPR Technical Committee 9 on Pattern Recognition in Human-Machine Interac-
tion.
Mohamed Yacine Tsalamlal (LIMSI-CNRS Lab) received a Master’s degree in

Human-Machine Systems engineering from Université de Lorraine, France, and
a Ph.D. degree in Computer Science from the Université Paris-Saclay. He is a post-
doctoral researcher with the Architectures and Models for Interaction Group at
the LIMSI-CNRS Lab. His research interests include the study of tactile interaction
techniques for mediated social communication. He exploits multiple approaches
(mechanics, robotics, experimental phycology) to help in the design of efficient
multimodal affective interaction systems.
Alessandro Vinciarelli (University of Glasgow) is a Full Professor in the School of

Computing Science at the University of Glasgow and an associated academic at the
Institute of Neuroscience and Psychology. His main research interest is social signal
processing, the domain aimed at modeling, analyzing, and synthesizing nonverbal
behavior in human-human and human-machine interactions. He has published
over 130 works and has been Principal Investigator or co-Principle Investigator of
more than 15 national and international projects. He has chaired and co-chaired
more than 30 international events and is the co-founder of Klewel, a knowledge
management company recognized with several awards.
Johannes Wagner (University of Augsburg) graduated with a Master of Science

in Informatics and Multimedia in 2007 and received a doctoral degree for his
study “Online Systems for Multimodal Behaviour Analysis” in 2016. He is currently
employed as a research assistant at the lab of Human Centered Multimedia (HCM)
at he University of Augsburg and has been working on several European projects
(Humaine, Callas, Ilhaire, CEEDs, Kristina, AriaValuspa). His main research focus
Biographies 515
is the integration of Social Signal Processing (SSP) in real-life applications. He is the

founder of the Social Signal Interpretation (SSI) framework, a general framework
for the integration of multiple sensors into multimedia applications.
Yang Wang (DATA61, CSIRO) is a Principal Research Scientist of Analytics in

DATA61, CSIRO. He received his Ph.D. in Computer Science from the National Uni-
versity of Singapore in 2004. His research interests include machine learning and
information fusion techniques and their applications to intelligent infrastructure,
cognitive, and emotive computing.
Kun Yu (DATA61, CSIRO) is a Research Scientist in DATA61, CSIRO. He received his

Ph.D. in Electrical Engineering from University of New South Wales, Australia. His
research interests include human-computer interaction, cognitive load examina-
tion, human trust calibration, and multimodal human-machine interfaces.
Stefanos Zafeiriou (Imperial College London) is a Reader in Machine Learning and

Computer Vision with the Department of Computing, Imperial College London and
a Distinguishing Research Fellow with the University of Oulu in the Finish Distin-
guishing Professor Program. In 2011, he was a recipient of the Prestigious Junior
Research Fellowships from Imperial College London to start his own independent
research group. He was the recipient of the President’s Medal for Excellence in Re-
search Supervision for 2016. He has coauthored more than 55 papers, mainly on
novel statistical machine learning methodologies applied to computer vision prob-
lems, such as 2-D/3-D face analysis, deformable object fitting and tracking, shape
from shading, and human behavior analysis, published in the most prestigious
journals in his field of research and many top conferences, such as CVPR, ICCV,
ECCV, and ICML.
Jianlong Zhou (DATA61, CSIRO) is a Senior Research Scientist of Analytics in

DATA61, CSIRO. He earned a Ph.D. in Computer Science from the University of
Sydney, Australia. His research interests include transparent machine learning,
human-computer interaction, cognitive computing, visualization, spatial aug-
mented reality, and related applications.
Volume 2 Glossary
Accumulative GSR refers to the summation of GSR values over the task time. If the
GSR is considered as a continuous signal over time, the accumulative GSR is the
integration of GSR values over the task time.
Action units and action descriptors are the smallest visually discriminable facial
movements. Action units are movements for which the anatomic basis is known
[Cohn, Ambadar, & Ekman, 2007]. They are represented as either binary events
(presence vs. absence) or with respect to five levels of ordinal intensity. Action
units individually or in combinations can represent nearly all possible facial
expressions.
Active Appearance Model (AAM). An AAM is a statistical model of shape and
grey-level appearance that can generalize to almost any face [Edwards, Taylor,
& Cootes, 1998; Matthews & Baker, 2004]. An AAM seeks to find the model
parameters that can generate a synthetic image as close as possible to a
target image [Cootes & Taylor, 2004]. AAMs are learned from hand-labelled
training data.
Affect. Broad term encompassing constructs such as emotions, moods, and
feelings. Is not the same as personality, motivation, and other related terms.
Affect and social signals can be described as temporal patterns of a multiplicity of
non-verbal behavioral cues which last for a short time [Vinciarelli et al. 2009] and
are expressed as changes in neuromuscular and physiological activity [Vinciarelli
et al. 2008a]. Sometimes, we consciously draw on affect and social signals to alter
the interpretation of a situation, e.g. by saying something in a sarcastic voice
to signal that we actually mean the opposite. At another time, we use them
without being aware of it, e.g., by showing sympathy towards our counterpart
by mimicking his or her verbal and nonverbal expressions. See Chapter 7 of
this volume for an overview of affective and social signals for which automated
recognition approaches have been proposed.
518 Volume 2 Glossary
Affect annotation. The process of assigning affective labels (e.g., bored, confused,
aroused) or values (e.g., arousal = 5) to data (e.g., video, audio, text).
Affective computing. Computing techniques and applications involving emotion
or affect.
Affective computing and social signal processing aims at perceiving and interpret-
ing nonverbal behavior with a machine by detecting affect and social signals
just as humans do among themselves. This will lead to a new generation of com-
puters that is perceived as more natural, efficacious and trustworthy [Vinciarelli
et al. 2008b, Vinciarelli et al. 2009], and it makes room for a more human-like
and intuitive interaction [Pantic et al. 2007].
Affective experience-expression link. The relationship between experiencing an
affective state (e.g., feeling confused) and expressing it (e.g., displaying a
furrowed brow).
Affective ground truth. Objective reality involving the “true” affective state. Is a
misleading term for psychological constructs like affect.
Alignment A third challenge is to identify the direct relations between (sub)-
elements from two or more different modalities. For example, we may want
to align the steps in a recipe to a video showing the dish being made. To tackle
this challenge we need to measure similarity between different modalities and
deal with possible long-range dependencies and ambiguities.
Classifier: in pattern recognition and machine learning, it is a function that maps
an object of interest (represented through a set of physical measurements called
features) into one of the classes or categories such an object can belong to (the
number of classes or categories is finite).
Bag-of-words (BoW) is a data-driven algorithm for summarizing large volumes of
features. It can be thought of as a histogram whose bins are determined by
partitions (or clusters) of the feature space.
Canonical Correlation Analysis (CCA) is a tool to infer information based on cross-
covariance matrices. It can be used to identify correlation across heterogeneous
modalities or sensor signals. Let each modality be represented by a feature
vector and let us assume there is correlation across these, such as is in
audiovisual speech recognition. Then, CCA will identify linear combinations
of the individual features with maximum correlation amongst each other.
Classifier Combination: in pattern recognition and machine learning, it is a body of
methodologies aimed at jointly using multiple classifiers to achieve a collective
Volume 2 Glossary 519
performance higher—to a statistically significant extent—than the individual

performance of any individual classifier.
Classifier Diversity: in a set of classifiers that are being combined combined, the
diversity is the tendency of different classifiers to have different performance in
different regions of the input space.
Cognitive load is a multidimensional construct that refers to the momentary

working memory load experienced by a person while performing a cognitive
task. It can be increased by a wide variety of factors, including the difficulty
of a task, the materials or tools used (e.g., computer interface), the situational
context (e.g., distracting vs. quiet setting), the social context (e.g., working
individually, vs. jointly with a group), a person’s expertise in the domain, a
person’s physiological stress level, and so forth. Cognitive Load Theory describes
cognitive load as having three components—intrinsic load, extraneous load,
and germane load [Sweller et al. 2011]. For a detailed discussion of the dynamic
and often non-linear interplay between cognitive load and domain expertise,
including how cognitive load can either expand or minimize the performance
gap between low- and high-performing students, see Oviatt [2013].
Cognitive load measurement refers to the methods to quantitatively discriminate

the different levels of cognitive load experienced by the user. Usually the
cognitive load is induced with varied task difficulty (i.e. the extraneous load is
manipulated), and the methods to discriminate cognitive load include subjective
methods, performance-based methods, physiological methods and behavioral
methods.
Co-learning A fifth challenge is to transfer knowledge between modalities, their

representation, and their predictive models. This is exemplified by algorithms of
co-training, conceptual grounding, and zero shot learning. Co-learning explores
how knowledge learning from one modality can help a computational model
trained on a different modality. This challenge is particularly relevant when one
of the modalities has limited resources (e.g., annotated data).
Communication: process between two or more agents aimed at the exchange of
information or at the mutual modification of beliefs, shared or individual.
Confidence measure is the information on the assumed certainty of a decision
made by a machine learning algorithm.
Congruent expressions of affects. A multimodal combination is said to involve
congruent expressions of affects if each modality is conveying the same affect
in terms of the category or the dimension (e.g., the facial expressions express
anger and the hand gestures also express anger).
Cooperative learning in machine learning is a combination of active learning and
semi-supervised learning. In the semi-supervised learning part, the machine
learning algorithm labels unlabeled data based on its own previously learned
model. In the active learning part, it identifies which unlabeled data is most
important to be labeled by humans. Usually, cooperative learning tries to
minimize the amount of data to be labeled by humans while maximizing the gain
in accuracy of a learning algorithm. This can be based on confidence measures
such that the machine labels unlabeled data itself as long as it is sufficiently
confident in its decisions. It asks humans for help only where its confidence is
insufficient, but the data seem to be highly informative.
Construct. A conceptual variable that cannot be directly observed (e.g., intelligence,
personality).
Continuous or discrete (i.e., categorical) representation refer to the modeling of a
user state or trait. As an example, the age of a user can be modeled as continuum
such as the age in years. As opposed to this, a discretized representation would
be broader age classes such as “young,” “adult’s”, and “elderly.” In addition, the
time can be discretized or continuous (in fact, it is always discretized in some
respect—at least by the sample rate of the digitized sampling of the sensor
signals). However, one would speak of continuous measurement if processing
is delivering a continuous output stream on a (short) frame-by-frame basis rather
than an asynchronous processing of (larger) segments or chunks of the signal
such as per spoken word or per body gesture.
A Convolutional Neural Network (CNN) is a neural network that contains one or
more convolutional layers. A convolutional layer is a layer that processes an
image (or any other data comprised of points with a notion of distance between
these points, such as an audio signal) by convolving it with a number of kernels.
A correlation is a single number that describes the degree of relationship between
two variables (signals). It most often refers to how close two variables are to
having a linear relationship with each other.
A dense layer is the basic type of layer in a neural network. The layer takes a
one-dimensional vector as input and transforms it to another one-dimensional
vector by multiplying it by a weight matrix and adding a bias vector.
Depression refers broadly to the persistence over an extended period of time of sev-
eral of the following symptoms: lowered mood, interest, or pleasure; psychomo-
tor retardation; psychomotor agitation; diminished ability to think/concentrate;

increased indecisiveness; fatigue or loss of energy; insomnia; hypersomnia; sig-
nificant weight loss or weight gain; feelings of worthlessness or excessive guilt;
and recurrent thoughts of death or recurrent suicidal ideation. It is important to
note that there are multiple definitions of depression (see references in Section
12.2).
Domain adaptation refers to machine learning methods that learn from a source
data distribution a well performing model on a different (but related) target data
distribution.
Domain expertise refers to the level of working knowledge and problem-solving
competence within a specific subject matter, such as algebra. It is a relatively
stable state that influences how a person perceives and strategizes solving a
problem. For the same task, a more expert student will group elements within
it into higher-level patterns, or perceive the problem in a more integrated
manner. A person’s level of domain expertise influences a variety of problem-
solving behaviors (e.g., fluency, effort expended, accuracy). A more expert person
also will experience lower cognitive load when working on the same problem
as a less expert person. Most people experience domain expertise in some
subjects. This everyday experience of domain expertise is distinct from elite
expert performance that occurs in a small minority of virtuosos or prodigies,
which can take a decade or lifetime to achieve.
Dynamic Time Warping (DTW) is a machine learning algorithm to align two
time series such as feature vectors extracted over time based on similarity
measurement. This similarity is often measured by distance measures such as
Euclidean distance or based on correlation such as when aligning heterogenous
modalities. A classical application example is speech recognition, where words
spoken at different speed are aligned in time to measure their similarity. DTW
aims at a maximized match between the two observation sequences usually
based on local and global alignment path search restrictions.
In early combination, the inputs from all the different modalities are concatenated
and fed to a single model. In late combination, for each modality there is a
separate model that makes a prediction based on its modality, and these model
predictions are later fused by a combining model.
Early fusion models are models for processing multimodal or multisensorial data,
in which a model is processing the concatenation of all the data representations
from the different modalities. In late fusion models, there is a unimodal model
for each modality, and the outputs of all unimodal models are then combined
to a final prediction based on all modalities.
Encoder-decoder architectures in deep learning start with an encoder neural
network which—based on its input—usually outputs a feature map or vector.
The second part—the decoder—is a further network that—based on the
feature vector from the encoder—provides the closest match either to the
input or an intended output. The decoder is in most cases employing the
same network structure but in opposite orientation. Usually, the training is
carried out unsupervised data, i.e., without labels. The target for learning is
to minimize the reconstruction error, i.e., the delta between the input to the
encoder and the output of the decoder. A typical application is to use encoder-
decoder architectures for sequence-to-sequence mapping, such as in machine
translation where the encoder is trained on sequences (phrases) in one language
and the decoder is trained to map its representation to a sequence (phrase) in
another language.
An ensemble is a set of models and we want the models in the set to differ in their
predictions so that they make different errors. If we consider the space defined
by the three dimensions that define a model as we defined above, the idea is
to sample smartly from that space of learners. We want the individual models
to be as accurate as possible individually, and at the same time, we want them
to complement each other. How these two criteria affect the accuracy of the
ensemble depends on the way we do the combination.
From another perspective, we can view each particular model as one noisy
estimate to the real (unknown) underlying problem. For example, in a classi-
fication task, each base classifier, depending on its model, hyper-parameters,
and input features, learns one noisy estimator to the real discriminant. In
such a perspective, the ensemble approach corresponds to constructing a fi-
nal estimator from these noisy estimators—for example, voting corresponds to
averaging them.
When the different models use inputs in different modalities, there are three
ways in which the predictions of models can be combined, namely, early, late,
and intermediate combination/integration/fusion.
Extraneous load refers to the level of working memory load that a person experi-
ences due to the properties of materials or computer interfaces they are using
[Oviatt 2017].
FACS refers to the Facial Action Coding System [Ekman & Friesen, 1978; Ekman,
Friesen, & Hager, 2002]. FACS describes facial activity in terms of anatomically
based action units (AUs). Depending on the version of FACS, there are 33 to 44
AUs and a large number of additional “action descriptors” and other movements.
Feature-level multimodal fusion. The process of integrating features from different
modalities using diverse methodologies such as concatenating the features
together (early fusion) or combining the models obtained from each modality
at decision level (late fusion).
Fusion A fourth challenge is to join information from two or more modalities to
perform a prediction. For example, for audio-visual speech recognition, the
visual description of the lip motion is fused with the speech signal to predict
spoken words. The information coming from different modalities may have
varying predictive power and noise topology, with possibly missing data in at
least one of the modalities.
Galvanic Skin Response (GSR) refers to galvanic skin response which is a measure
of the conductivity of human skin, and can provide an indication of changes in
the human sympathetic nervous system during the cognitive task time.
Gaussian mixture models (GMMs) are probability density functions comprising
a weighted sum of individual Gaussian components, each with their own
mean and covariance. They are commonly employed to compactly characterize
arbitrary distributions (e.g. of features) that are not well-fitted by a single
Gaussian.
Germane load refers to the level of a person’s effort and activity compatible with
mastering new domain content during learning. It pertains to the cognitive
resources dedicated to constructing new schema in long-term memory.
Gross errors refer to non-Gaussian noise of large magnitude. Gross errors are often
in abundance in audio-visual data due to incorrect localisation and tracking,
presence of partial occlusions, enviromental noise etc.
Hand-crafted features refer to features developed to extract a specific type of
information, usually as part of a hypothesis-driven research study. By contrast,
data-driven features are those extracted automatically from raw signal data by
algorithms (e.g., neural networks), whose physical interpretation often cannot
easily be described.
Incongruent expressions of affects. A multimodal combination is said to involve
incongruent expressions of affects if the combined modalities are conveying
different affects in terms of the category or the dimension (e.g., the facial
expressions express joy, while hand gestures express anger). Such combinations
are also called blends of emotions. They might occur even in a single modality
such as the facial expressions (e.g., the upper part of the face may express a
certain emotion, while the bottom part of the face conveys a different emotion).
Researchers exploring the perception of human (or computer-generated)
multimodal expressions of affects usually consider the following attributes
related to perception and affects that are impacted by or do impact multimodal
perception:
Recognition rate
Reaction time
Affect categories
Affect dimensions
Multimodal integration patterns
Inter-individual differences and personality
Timing: synchrony vs. sequential presentation of the signals in different
modalities
Modality dominance
Context: environment and task related information (e.g., food or
violent scenes, and associated applications for specific users with food
disorders or PTSD)
Task difficulty
In intermediate combination, each modality is first processed to get a more abstract

representation and then all such representations from different modalities are
fed together to a single model. This processing can be in the form of a kernel
function, which is a measure of similarity, and such an approach is called
multiple kernel learning. Or the intermediate processing may be done by one or
more layers of a neural network, and such an approach corresponds to a deep
neural network.
The level of combination depends on the level we expect to see a dependency
between the inputs in different modalities. Early combination assumes a
dependency at the lowest level of input features; intermediate combination
assumes a dependency at a more abstract or semantic level that is extracted after
some processing of the raw input; late combination assumes no dependency in
the input but only at the level of decisions.
Intrinsic load is the inherent difficulty level and related working memory load
associated with the material being processed during a user’s primary task.
Learning analytics involves the collection, analysis, and reporting of data about
learners, including their activities and contexts of learning, in order to under-
stand and provide better support for effective learning. First-generation learning
analytics focused exclusively on computer-mediated learning using keyboard-
and-mouse computer interfaces. This limited analyses to click-stream activity
patterns and linguistic content. Early learning analytic techniques mainly have
been applied to managing educational systems (e.g., attendance tracking, track-
ing work completion), advising students based on their individual profiles (e.g.,
courses taken, grades received, time spent in learning activities), and improving
educational technologies. Learning analytics data typically are summarized on
dashboards for educators, such as teachers or administrators.
Leave-one-out cross validation. Cross validation is the process of dividing a dataset
into batches where one batch is reserved for testing and all the other batches
are used for training a system. Leave-one-out means each batch is formed of a
single instance.
The user (long-term) traits include biological trait primitives (e.g., age, gender,
height, weight), cultural trait primitives in the sense of group/ethnicity mem-
bership (e.g,. culture, race, social class, or linguistic concepts such as dialect
or first language), personality traits (e.g., the “OCEAN big five” dimensions
openness, conscientiousness, extraversion, agreeableness, and neuroticism or
likability), and traits that constitute subject idiosyncrasy, i.e., ID.
A longer-term state can subsume (partly self-induced) non-permanent, yet longer-
term states (e.g., sleepiness, intoxication, mood such as depression (see also
Chapter 12 the health state such as having a flu), structural (behavioral,
interactional, social) signals (e.g., role in dyads and groups, friendship and
identity, positive/negative attitude, intimacy, interest, politeness), and (non-
verbal) social signals (see Chapters 7 and 8 and discrepant signals (e.g.,
deception (see also Chapter 13) irony, sarcasm, sincerity).
Longitudinal data refers to multiple recordings of the same type from the same
individual at different points in time, between which it is likely that the
individual’s state (e.g., depression score) has changed.
L() is the loss function that measures how far the prediction g(x t |θ) is from the
desired value r t . The complexity of this optimization problem depends on the
particular g() and L(). Different learning algorithms in the machine learning
literature differ either in the model they use, the loss function they employ, or
the how the optimization problem is solved.
This step above optimizes the parameters given a model. Each model has an
inductive bias that is, it comes with a set of assumptions about the data and the
model is accurate if its assumptions match the characteristics of the data. This
implies that we also need a process of model selection where we optimize the
model structure. This model structure depends on dimensions such as (i) the
learning algorithm, (ii) the hyper-parameters of the model (that define model
complexity), and (iii) the input features and representation, or modality. Each
model corresponds to one particular combination of these dimensions.
In machine learning, the learner is a model that takes an input x and learns to give
out the correct output y. In pattern recognition, typically we have a classification
task where y is a class code; for example in face recognition, x is the face image
and y is the index of the person whose face it is we are classifying.
In building a learner, we start from a data set X = {x t , r t }, t = 1, . . . , N that
contains training pairs of instances x t and the desired output values r t (e.g.,
class labels) for them. We assume that there is a dependency between x and
r but that it is unknown—If it were known, there would be no need to do any
learning and we would just write down the code for the mapping.
Typically, x t is not enough to uniquely identify r t ; we call x t the observables
and there may also be unobservables that affect r t and we model their effect
as noise. This implies that each training pair gives us only a limited amount of
information. Another related problem is that in most applications, x has a very
high dimensionality and our training set samples this high dimensional space
very sparsely.
Our prediction is given by our predictor g(x t |θ) where g() is the model and θ
is its set of parameters. Learning corresponds to finding that best θ ∗ that makes
our predictions as close as possible to the desired values on the training set:

N
θ ∗ = arg min L(r t , g(x t |θ))
θ
t=1
Mel frequency cepstral coefficients (MFCCs) are features that compactly represent
the short-term speech spectrum, including formant information, and are widely
used to characterize both spoken content (for automatic speech recognition)
and speaker-specific qualities (for automatic speaker verification). Briefly, a
mel-scale frequency-domain filterbank is applied to the spectrum to obtain mel
filterbank energies, the log of which is transformed to a lower-dimensional
representation using the discrete cosine transform.
Metacognitive awareness involves higher-level self-regulatory behaviors that guide

the learning process, such as an individual’s awareness of what type of problem
they are working on and how to approach solving it, the ability to diagnose error
states, or understanding what tools are best suited for a particular task.
Moment of insight refers to one of several phases during the process of problem
solving. It involves the interval of time immediately before and after a person
consciously realizes the solution to a problem they’ve been working on. This
idea represents what the person believes is the solution, although it may or may
not be correct.
Multi-level multimodal learning analytics refers to the different levels of analysis
enabled by multimodal data collection. For example, speech and handwriting
can be analyzed at the signal, activity pattern, representational, or metacognitive
levels. During research on learning, it frequently is valuable to analyze data
across behavioral and physiological/neural levels for a deeper understanding of
any learning effects. In this regard, multi-level multimodal learning analytics
can support a more comprehensive systems-level view of the complex process
of learning.
A multimodal corpus targets the recording and annotation of multiple communica-
tion modalities including speech, hand gesture, facial expression, body posture,
etc. Today, most corpora that are multimodal consist of audio-visual data. Other
modalities such as 3D body and gaze tracking, or physiological signals are hardly
present, but are needed to provide a broader picture of human interaction. The
collection of large databases rich of social behavior expressed through a variety
of modalities is key to model the complexity of social interaction [Vinciarelli
et al. 2012, Eerekoviae 2014].
Multimodal fusion is the process of combining information from multiple modali-
ties, such as audio and video, into a homogenous and consistent representation.
Combining affective and social cues across channels is important to resolve sit-
uations where social behavior is expressed in a complementary [Zeng et al.
2009] or even contradictory way [Douglas-Cowie et al. 2005]. This also involves
a proper modeling of the complex temporal relationships that exist between
the diverse channels. It can help to achieve higher precision such as in cogni-
tive load measurement, or better reliability via overcoming the limitations of
individual signal or interaction modalities. The fusion can be done at different
stages: mid-fusion and late-fusion. Mid-fusion refers to the fusion of features
extracted from multimodalities before classifications, while late-fusion is the
fusion of classification scores from single modality decisions.
Multimodal learning analytics is an emerging area that analyzes students’ natural

communication patterns (e.g., speech, writing, gaze, non-verbal behavior),
activity patterns (e.g., number of hours spent studying), and physiological and
neural patterns (e.g., EEG, fNIRS) in order to predict learning-oriented behaviors
during educational activities. These rich data can be analyzed at multiple
levels, for example at the signal level (e.g., speech amplitude), activity level
(e.g., frequency of speaking), representational level (e.g., linguistic content),
and others. These second-generation learning analytic techniques are capable
of predicting mental states during the process of learning, in some cases
automatically and in real time.
Multimodal or multiview signals are sets of heterogeneous data, captured by
different sensors, such as various types of cameras, microphones, and tactile
sensors and in different contexts.
Online recognition means that a system is able to detect and analyze affective
and social cues on-the-fly from the raw sensor input. Decisions based on the
perceived user state need to be made fast enough to allow for a fluent interaction
and it is not possible to look ahead in time. Setting up an online system is more
complex than processing data offline.
Overfitting is a problem that occurs when the training or estimation of a machine
learning method is performed on data with too few training examples relative
to the number of parameters to be estimated. The resulting problem is that the
method becomes too closely tuned to the training data, and generalizes poorly
to unseen test data.
Physiological sensor. A device that uses a transducer and a biological element
to collect physiological responses, such as heart rate and skin conductance,
and convert them into an electrical signal. The measures obtained with such
devices provide quantitative feedback about physiological changes or processes
experienced by research subjects.
Prerequisites for learning are precursors for successful learning to occur, which
can provide early markers. They include attention to the learning materials,
emotional and motivational predisposition to learn, and active engagement
with the learning activitites.
A pseudo-multimodal approach exploits a modality not only by itself, but in
addition to estimate another modality’s behavior to replace it. An example is
estimating the heart rate from speech parameters and using it alongside (other)
speech parameters.
A Recurrent Neural Network (RNN) is a neural network that contains one or more
recurrent layers. A recurrent layer is a layer that takes a sequence x indexed by t
and processes it element by a element, while maintaining a hidden state for each
unit in the layer: ht = RN N (ht−1 , xt ), where ht is the hidden state at step t, and
RNN is a transition function to compute the next hidden state, that depends
on the type of hidden layer.
Redundancy: tendency of multiple signals or communication channels to carry the
same or widely overlapping information.
Representation A first fundamental challenge is learning how to represent and
summarize multimodal data in a way that exploits the complementarity and
redundancy of multiple modalities. The heterogeneity of multimodal data
makes it challenging to construct such representations. For example, language is
often symbolic while audio and visual modalities will be represented as signals.
A sequence-to-sequence model is a neural network that processes a sequence as
its input and produces another sequence as its output. Example of such models
include neural machine translation and end-to-end speech recognition models.
Shared hidden layer is a layer within a neural network which is shared within
the topology. For example, different modalities, or different output classes, or
even different databases could be trained within parts of the network mostly.
In the shared hidden layer, however, they would share neurons by according
connections. This can be an important approach to model diverse information
types largely independently but provide mutual information exchange at some
point in the topology of a neural network.
A short-term state includes the mode (e.g., speaking style and voice quality),
emotions, and affects (e.g., confidence, stress, frustration, pain, uncertainty,
see also Chapters 6 and 8.
Skilled performance involves the acquisition of skilled action patterns, for example
when learning to become an expert athlete or musician. Skilled performance also
can involve the acquisition of communication skills, as in developing expertise
in writing compositions or giving oral presentations. The role of deliberate
practice has been emphasized in acquiring skilled performance.
Social Signals: constellations of nonverbal behavioural cues aimed at conveying
socially relevant information such as attitudes, personality, intentions, etc.
Social Signal Processing: computing domain aimed at modeling, analysis and
synthesis of social signals in human-human and human-machine interactions.
A softmax layer is a dense layer followed by the softmax nonlinearity. The softmax
nonlinearity takes a one-dimensional vector v of real numbers and normalizes
vj
it into a probability distribution, by applying softmax(v)j = e vi , where the
i
e
sum is over all coordinates of the vector v.
Spatio-temporal features are those which have both a spatial and a time dimension.
For example, the intensity of pixels in a video vary both in terms of their position
within a given frame (spatial dimension) and in terms of the frame number for
a given pixel coordinate (temporal dimension).
Support vector machine (SVM) is a widely used discriminative classification
method, which defines a separating hyperplane between two classes of features,
which is defined in terms of particular feature instances that are close to the
class boundaries, called support vectors.
Support vector regression (SVR) is a commonly used method for multivariate
regression, which concentrates on fitting a model by considering only training
features that are not very close to the model prediction.
Temporal dynamics of facial expression: rather than being like a single snapshot,
facial appearance changes as a function of time. Two main factors affecting
temporal dynamics of facial expression is the speed with which they unfold and
the changes of their intensity over time.
Transfer learning helps to reuse knowledge gained in one task in another task in
machine learning. It can be executed on different levels, such as the feature or
model level. For example, a neural network can be trained on a related task to
the task of interest at first. Then, the actual task of interest is trained “on top" of
this pre-training of the network. Likewise, rather than starting to train the target
task of interest based on a random initialization of a network, related data could
be used to provide a better starting point.
Transfer of learning refers to students’ ability to recognize parallells and make
use of learned information in new contexts, for example to apply learned
information outside of the classroom, in contexts not resembling the original
framing of the problem, with different people present, and so forth. This
requires generalizing the learned information beyond its original concrete
context.
Translation A second challenge addresses how to translate (map) data from one
modality to another. Not only is the data heterogeneous, but the relationship
between modalities is often open-ended or subjective. For example, there exist
a number of correct ways to describe an image and one perfect translation may
not exist.
T-unit analysis. The analysis of terminable units of language (T-unit), which is the
smallest group of words that could be considered as a grammatical sentence,
regardless of how is punctuated. T-unit analysis is used extensively to measure
the overall complexity of both speech and writing samples and consists mainly
on measuring different aspects of their syntactic construction in text such as
mean length of the t-units, and number of clauses present in each unit, among
others.
User-independent model A model that generalizes to a different set of users beyond
those used to develop the model.
Voice quality refers to the type of phonation during voiced speech. Depending
on the physical movement of the vocal folds during phonation, the perceived
quality of speech can change, even for the same speech sound uttered at the
same pitch. Descriptors such as “creaky” and “breathy” are applied to specific
modes of vocal fold vibration.
Vowel space area is a term given to the two dimensional area enclosed by lines
connecting pairs of vowels in the formant (F1/F2) space.
Zero-shot learning is a method in machine learning to learn a new task without
any training examples for this task. An example could be recognizing a new type
of object without any visual example but based on a semantic description such
as specific features that describe the object.

Multimodal Learning Analytics PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multimodal Learning Analytics PDF

Uploaded by

Copyright:

Available Formats

11

What is Multimodal Learning Analytics?

Cognitive load is a multidimensional construct that refers to the momentary working

Multi-level multimodal learning analytics refers to the different levels of analysis

tion across behavioral and physiological/neural measures, which can be used to

11.2.1 Why is Multimodal Learning Analytics Emerging Now,

The First International Workshop on Multimodal Learning Analytics and Second

11.2.2 What are the Main Objectives of Multimodal Learning Analytics?

long-term objective of this research is to develop automated prediction of learning

. Learning analytics has been limited to click-stream data, based on keyboard-

type or interact much, then their learning-oriented behaviors potentially

What Data Resources are Available on Multimodal

11.3.1 Math Data Corpus

these collaborative problem-solving sessions. Figure 11.1 illustrates image capture

Table 11.1 Math data corpus summary

Assessment Math domain expertise: Correct solutions, learning

sessions is documented in detail. One important feature of this dataset is high-

could be accomplished by a dedicated team of researchers and professional edu-

11.3.2 Oral Presentation Corpus

Overall, the Oral Presentation Corpus contains authentic information on stu-

Table 11.2 Oral presentation corpus summary

Assessment Skilled oral presentation

11.3.3 Other Infrastructure for Recording Classroom and

Tabletop-enhanced classrooms also have been used to capture and analyze

privacy. In addition, video-recordings would capture the teacher’s classroom pre-

11.3.4 Why Publicly Available Multimodal Learning Datasets are Limited

by corporate investments to produce future assessment products. Another rea-

11.3.5 Computational Methods and Techniques

What are the Main Themes from Research Findings

11.4.1 New Literature on Domain Expertise, Skilled Performance, and Insight

Analysis of information present in all modalities can yield

Valuable predictive information is present at multiple levels of analysis,

Multimodal combination of information sources yields higher

With respect to prerequisites of learning, multimodal recognition models have

Major findings that reflect change in expertise are evident

Early multimodal learning analytics research already has

11.4.2 Prior Literature on Prerequisites of Learning: Attention, Engagement,

ﬁndings in this area, which is by no means exhaustive. For a current summary of

Technologies 2016], SHORE [Fraunhofer 2016], GAVAM (Generalized Adaptive

What is the Theoretical Basis of Multimodal Learning Analytics?

11.5.1 How Existing Learning Theories Explain Data Observed in Multimodal

signing successful multimodal-multisensor interfaces, which includes limited re-

Perception-action Dynamic Theories

11.5.2 How Multimodal Learning Analytics Can Generate New

As introduced in Section 11.4.1, recent multimodal learning analytics work has

Session 1 (S1) Session 2 (S2) Learning No learning

Expert CDE = 77% CDE = 87% +10%

Non-expert CDE = 28% CDE = 26% –2%

What are the Main Challenges and Limitations of Multimodal

Often multimodal datasets collected in learning contexts cannot be shared with

Conclusions and Future Directions

methods, strategic prediction techniques, and fertile ﬁndings in the emerging

11.8. In the future, what educational assessment objectives might be accom-

F. Dominguez, V. Echeverria, K. Chiluiza, and X. Ochoa. 2015. Multimodal selﬁes: Designing

K. Ericsson, N. Charness, P. Feltovich, and R. Hoffman, editors. 2006. The Cambridge

A. Ezen-Can, J. F. Grafsgaard, J. C. Lester, and K. E. Boyer. 2015. Classifying student dialogue

Fraunhofer. 2016. http://www.iis.fraunhofer.de/en/ff/bsy/tech/bildanalyse/shore-

S. Goldin-Meadow, H. Nusbaum, S. Kelly, and S. Wagner. 2001. Explaining math: Gesturing

J. F. Grafsgaard, J. B. Wiggins, K. E. Boyer, E. N. Wiebe, and J. C. Lester. 2013. Automatically

J. F. Grafsgaard, J. B. Wiggins, A. K. Vail, K. E. Boyer, E. N. Wiebe, and J. C. Lester. 2014.

K. Hinckley. 2017. A background perspective on touch as a multimodal (and multi-sensor)

P. Cohen, D. Sonntag, G. Potamianos and A. Krueger, editors, The Handbook of

ACM International Conference on Multimodal Interaction, pp. 599–606. ACM, New