You are on page 1of 8

Prediction and Localization of Student

Engagement in the Wild


Amanjot Kaur Aamir Mustafa Love Mehta Abhinav Dhall
Learning Affect & Semantic Imaging Analysis Group,
Indian Institute of Technology Ropar
{amanjot.kaur, 2014csb1018, abhinav}@iitrpr.ac.in aamir 281btech14@nitsri.ac.in

Abstract—Digital revolution has transformed the traditional


teaching procedures, students are going online to access study
materials. It is realised that analysis of student engagement
in an e-learning environment would facilitate effective task
accomplishment and learning. Well known social cues of en- /HYHO
gagement/disengagement can be inferred from facial expressions,
body movements and gaze patterns. In this paper, student’s
response to various stimuli (educational videos) are recorded and
cues are extracted to estimate variations in engagement level. We
study the association of a subject’s behavioral cues with his/her
engagement level, as annotated by labelers. We have localized
engaging/non-engaging parts in the stimuli videos using a deep
multiple instance learning based framework, which can give use-
ful insight into designing Massive Open Online Courses (MOOCs)
video material. Recognizing the lack of any publicly available
dataset in the domain of user engagement, a new ‘in the wild’
dataset is curated. The dataset: Engagement in the Wild contains
264 videos captured from 91 subjects, which is approximately
16.5 hours of recording. Detailed baseline results using different
classifiers ranging from traditional machine learning to deep
learning based approaches are evaluated on the database. Subject
independent analysis is performed and the task of engagement /HYHO
prediction is modeled as a weakly supervised learning problem.
The dataset is manually annotated by different labelers and
the correlation studies between annotated and predicted labels Fig. 1. Frames from our engagement database (with low to high
of videos by different classifiers are reported. This dataset engagement [0-3]) with varied recording environments.
creation is an effort to facilitate research in various e-learning
environments such as intelligent tutoring systems, MOOCs, and
others. Download available at http://iitrpr.ac.in/lasii.
Index Terms—Dataset for Student Engagement, Engagement sistence and concentration. Several traditional methods and
Detection, Engagement Localization, E-learning Environment measures introduced in the literature to assess the level of
engagement have their own relevance. Different methods such
I. INTRODUCTION as self-reports: participants answer the set of questions related
Human beings require time and persistent effort to perform to their experience such as level of engagement, interest and
a task such as learning. On the same lines, the researchers have so on [3]. An observational study done by the external expert:
argued the importance of continuous effort or engagement to Focus on the behavioral analysis of students.
accomplish the learning task [1]. With the advent of digital Traditional measurement methods are not sufficient to measure
technologies, the traditional classroom teaching is transformed engagement in all the contexts. So automatic engagement
into advanced learning environments such as an Intelligent assessments for the digitally transformed learning environment
Tutoring System (ITS) and e-learning environment such as are required. These techniques analyze various facial cues,
Massive Open Online Courses (MOOCs). As a result, student body posture and well known social cues of engagement
engagement in the e-learning environment has also adapted in and disengagement captured automatically using affective
its own way. computing techniques. These techniques are sensitive to the
In the literature, Student engagement is defined as a com- engagement levels variations over time. Moreover, automatic
plex structure with multi-dimensions and components. Vari- measures can facilitate timely intervention to change the
ous other works in literature have classified engagement in course of fading engagement level [4].
different ways. It is evident that student engagement plays a key role in
Fredricks et al. [2] as behavioral engagement component, the learning process, so its automatic measurement is also
which explains student involvement in terms of effort, per- important in today’s digital learning environment. The paper

978-1-5386-6602-9/18/$31.00 ©2018 IEEE


Name Features Subj. Method Excitement and Frustration (1-5 scale) were used. In this
D’Mello et al. [12] FACS & Log 28 experiment, video and physiological sensor signals were used
Grafsgaard et al. [3] FACS 67 Regression to capture different modalities to analyze affective states of
Whitehill et al. [13] FACS 34 Regression
Grafsgaard et al. [14] FACS, Log - HMM students while solving mathematical problems of SAT. Dataset
Gupta et al. [15] Facial Features 112 DNN was self-annotated and linear regression was used to regress
Xian et al. [11] PPG 18 RBF+SVM the value of different affective states [16]. In another study
Arroyo et al. [16] Video 60 Regression
&Physiological
conducted by Bosch et al. in the real world settings, various
Bosch et al. [17] Facial Features 137 Regression affective states of engagement such as boredom, engaged,
Body Posture concentration, confusion, frustration and delight were used.
TABLE I Features were in the form of facial expressions and body
I MPORTANT WORKS FOR AUTOMATIC ENGAGEMENT DETECTION . movements. Regressors such as BayesNet, clustering, regres-
sion were used to regress different dimensions of the network
proposes an automatic pipeline for predicting user’s engage- [17].
ment, while they are watching educational videos such as the Surveys such as [18] [19] discuss the important work done in
ones in Massive Open Online Courses (MOOCs). Secondly, A the field of affective computing. Based on the knowledge in
new ‘Engagement Detection in the wild’ video-based database affective computing, researchers have recently tried to predict
(Figure 1) is also introduced. User engagement analysis is user engagement. El Kaliouby et al. [20] argue that user
of importance to learning environments, which involves in- engagement prediction is a challenging problem and has many
teraction between humans and machines. Generally, a course applications in autism-related studies. Furthermore, Tawari et
instructor is able to assess the engagement of students in a al. [21] found that a car driver’s affective state recognition
classroom environment based on their facial expression. Other in terms of the engagement level provides important cues in
well known social cues such as yawning, glued eyes, and body detecting drowsiness levels while driving. Navarathna et al.
posture are also handy in analyzing the behavioral change [22] proposed an interesting pipeline for analyzing a movie’s
of the participants in the natural environment. However, this affect by predicting viewer’s engagement in the movie. User
assessment becomes tricky, when students are watching study engagement level recognition is inevitable in e-learning and is
material such as videos, on their computers and mobile phones useful in detecting states such as fatigue, lack of interest and
in different environments. These different environments can be difficulty in understanding the content.
their room, office, cafeteria, outdoor and so on. Temporally constrained and dynamic nature of engagement
This work is related to the estimation of perceived engage- as a affective state signifies the fact that it doesn’t remain
ment by the external observer and it is appropriate for the same, moreover, varies between low and high. Pattern of user
e-learning environment. Various examples of environments in engagement can give various interesting insights such as: for
which the engagement of user is a vital component to analyze how long users concentrate, at which point in a lecture video
are: e-learning, health sector such as conducting autism-related the user loses interest etc. This is useful for successful delivery
studies and analyzing driver engagement in automated cars etc. of educational content and deciding evaluation criteria for
Automatic engagement prediction can be based on various user engagement [23].The number of datasets related to user
kinds of data modalities. It is argued that the student response engagement in the online learning environment released are
can be used as an indicator of engagement in intelligent limited [13], [24], [25]. With the advent of deep learning
tutoring systems [5]. Another approach is based on features frameworks, larger sized databases representing diverse set-
extracted on the basis of facial movements [6], [7]. Automated tings are required. The availability of datasets related to user
measures such as response time of a student to problems engagement especially student engagement in online courses
and test quizzes are used in ITS [8]. Physiological and would help in understanding the problems faced by students
neurological measures, requiring specialized sensors, such as such as: loss of interest, fatigue, boredom etc in the online
electroencephalogram, heart rate, and skin response have been learning environment.
used to measure engagement [9]–[11]. In this paper, we introduce a database for common bench-
Table I summarizes important engagement detection ap- marking and development of student engagement assessment
proaches. Xiang et al. have studied the impact of distractions methods, which work on‘in the wild’ conditions. Here ‘in the
or multitasking in MOOCs through Mobile apps [11]. In wild’ refer to different backgrounds, occlusion, illumination
this work, along with recorded photoplethysmography (PPM) and head poses etc. Video level engagement prediction task
signals post the lecture, self-reports and quiz related to video is modeled as a weakly labeled learning problem. Multiple
content was asked to 18 subjects who watched two stimuli as Instance Learning (MIL) is followed as the paradigm for this
lecture videos. weakly supervised learning. MIL has been found extensively
In an interesting work, Whitehill et al. analyzed the behav- useful for tasks, in which the dataset has weak or noisy labels
ioral engagement and defines four levels of engagement. The [26] and has been recently used in affect analysis tasks such as
analysis is performed on the Facial Features (FACS) with facial expression prediction [27] and image emotion prediction
the help of classifier [13]. In another experiment, affective [28]. MIL assumes that the data could be presented with the
states pertaining to engagement such as Interest, Confidence, help of a set of sub-instances or bag of sub-instances as {Xi },
where each bag consists of sub-instances {xij } and the weak latency and disturbance (of the subject) can help in simulating
labels {yi } are present at the bag level only [29], where i is different environments.
the number of instances or bags and j is the number of sub The videos are captured at a resolution of 640 x 480 pixels
instances in a particular bag. at 30 fps using a Microsoft Lifecam wide-angle F2.0 camera
Our work is based on Deep Multi-Instance Learning (excluding the Skype-based recordings). The audio signal is
(DMIL) framework, which makes use of complex features recorded using a wideband microphone, which is part of
that can be learned by deep neural networks along with MIL the camera. Apowersoft Screen Recorder software is used to
technique for a weakly supervised problem.Sikka et al. [30] capture the Skype videos.
studied pain localization in a video in a weakly supervised The dataset and it’s accompanying baseline code are pub-
fashion using the MIL framework. Tax et al. [27] proposed licly available to facilitate open research. The dataset also
an MIL based framework to select the concept frames using forms the part of the engagegent prediction in the wild
clustering technique for multi-instance learning. challenge as part of the Emotion Recognition in the Wild
The main contribution of this work is to formulate student challenge 20181 . The consent of the subjects has been taken
engagement prediction and localization as a MIL problem following the due process. To the best of our knowledge, this
and derive the baseline scores based on DMIL. The rationale is one of the first publicly available user engagement datasets
behind MIL is that labeling of engagement at frequent intervals being recorded in diverse conditions.
in user videos is expensive and noisy.
The rest of the paper is organized as follows: In section II, A. Data Annotation
we introduce the dataset and details about the data collection A team of 5 annotators labelled the engagement intensity
and annotation. In section III, we discuss the methodology of the subjects in videos. The labelers were instructed to label
used for student engagement prediction and localization. In the videos on the basis of their engagement intensity (from
section IV, we describe the experimental setup and the results facial expressions) ranging from 0 to 3. Audio for the videos
obtained respectively. was turned off during labeling. The engagement categories are
inspired by the work of Whitehill et al. [13]. The engagement
II. DATA C OLLECTION intensity mapping is as follows: Not Engaged means that
the subject is completely disengaged, e.g. the subject seems
In this section, we discuss the data recording paradigm uninterested and looks away from the screen frequently. Less
used in this work. The experiment has two parts: i) Video Engaged applies to be barely engaged e.g. the subject barely
recordings of the subjects, while they watch the stimuli videos opens his/her eyes, moves restlessly in the chair. Engaged
(each video around 5 minutes long) and ii) A verbal feedback applies that the subjects seem engaged in the content, e.g.
section, in which the subject is given 15 seconds to speak the subject seems to like the content and is interacting with
about the video. In those 15 seconds the subjects are asked to the video. Highly Engaged means that the subject seems to
express their views about the video, especially if they found be fully engaged, e.g. the subject was glued to the screen and
the video interesting?, would they like to watch the video was focused. Note that the labels above were used to make the
again?, which part of the video was most/least engaging?, labellers understand the steps of the engagement scale (0-3).
any comments/suggestions on how the video could have been The Train, Validation and Test partitions contain 147, 48
made more engaging etc. Based on the content of the stimuli videos and 67 videos, respectively. There is a natural progres-
videos, we choose purely educational videos Learn the Korean sion in the intensity of engagement, therefore, we are looking
Language in 5 minutes, a pictorial video (Tips to learn at the task of engagement problem as a regression problem.
faster) and (How to write a research paper). This was aimed For judging the annotator's reliability, we computed weighted
to capture both focused and enjoyable settings which allow Cohen's K with quadratic weights as the performance metric
natural variations. In the experiment section, as a baseline, we (similar to [13], [15]) as the labels are not categories but
explore the video signals only. intensities of engagement {0, 1, 2, 3}. Any annotator whose
The dataset contains 91 subjects (27 females and 64 males). agreement coefficient is less than 0.4 is marked less reliable
The age range of the subjects is between 19-27 years. A total and the label is ignored. After denoising, the labels are
of 264 videos are collected, each approximately 5 minutes averaged and rounded off to the nearest integer to give a
long. The dataset is collected in unconstrained environment i.e. ground truth engagement rating to the video. Example images
at different locations such as the computer lab, hostel rooms, for engagement labels are shown in Figure 1.
open ground etc. Inorder to have more uncontrolled real life
scenarios, we also include a video conferencing based setup. In B. Weighted Labels
this, the subject watched the stimuli on their computer (full) Annotation process should be such that there should be less
screen and in parallel a Skype-based video session was set subjective bias introduced while labeling the videos. In this
up. We captured the video region of the Skype application work, low agreement among the annotators and subjective bias
at the other end of the Skype video-call. It was made sure in the annotation process is addressed based on the technique
that the subject was not disturbed by the video recording. The
usefulness of recording over Skype is that frame drop, network 1 https://sites.google.com/view/emotiw2018
,-./0/!"#$%"&'#

;<=> ?;= ;=@ ;

DEFG,HF/I'%$:3'#
!"#$%"&'()'*+'"$
12'3%*'/-4/%55/!"#$%"&'#
A''./B':3%5/B'$C-30
678'-/)'9:'"&'

J!D/D-##'#

Fig. 2. Proposed deep multi-instance network: LBP-TOP features are extracted from facial video segments. Linear regression using DNN is
employed and instance level responses are ranked. Finally, max and mean pooling are used to predict the subject engagement intensity.

introduced in the work of Joshi et al. [31]. The notion of the from Three Orthogonal Planes (LBP-TOP) [33] is a standard
credibility of the user is defined in the work as consistency spatio-temporal texture descriptor and is used in our work.
of the annotator in giving engagement label to the same clip 1) Shorter segment: A sliding window vf =
which is shown at different time. We have used the same [f1 , f2 , · · · , fm ] of m frames is taken at a time
notion of credibility to calculate the weight of rating of each and LBP-TOP features are extracted. A smaller value
annotator based on the deviation in rating given to the same of m would fail to efficiently approximate the entire
clips shown at different time stamps. This gives the weighted expression while a larger value of m would yield results
labels and serves the purpose to penalize the raters who are with poor frame-wise specificity. We therefore deem
prone to give more erroneous ratings. m=50 (approx. 10 seconds in the sampled video) as
the appropriate length of each window for expression
III. P ROPOSED F RAMEWORK recognition. After extracting features for the first m
Our framework performs subject’s engagement detection frames, the sliding window is moved by p frames. The
and localization. The steps are as follows: The face and value of p is empirically set to 25 (overlap of 50%) as
facial landmark positions [32] are detected in each frame. The it was found to provide adequate accuracy.
video is then sub-sampled into small segments (instances) as
discussed in detail in Section III-A. Features are extracted for a C. Eye Gaze and Head Pose Feature Extraction
window of frames. The window then slides, next set of frames Student activities give cues to infer learning centered affec-
are extracted to compute features. This stage is discussed in tive states such as engagement and others. In this paper, social
Section III-B and III-C. The engagement prediction and local- signal in the form of eye gaze movement and head movement
ization are computed with the deep MIL network using mean features are extracted using OpenFace toolkit.
and top-k pooling for regression. The details are discussed 1) Head Pose (HP): The movement of head an indicator of
in Section III-D. Another model to process the sequence engagement intensity. OpenFace returns 3 points related
of segments is constructed using Long Short Term Memory to head movement in 3D space. These points give the
(LSTM) based network using mean pooling. The details are location of head w.r.t camera. Dynamics of these points
discussed in III-D give insights about head movement of a subject.
2) Eye Gaze (EG): The gaze of left and right eye is
A. Pre-Processing extracted using Open Face for every frame in the videos.
Let v = [f1 , f2 , f3 , · · · , fn ] be a video clip of a subject in It returns 3 dimensional direction vector for both the
response to a stimuli, where f denotes a video frame and n eyes. The movement of eyes give cues for the engage-
are the total number of frames. The OpenFace library [32] is ment level of the student such as closed eyes, looking
used to track the facial landmarks first in order to register the away from the screen could be negative indicators of the
facial region, which is then cropped to a final resolution of engagement level.
112 × 112. In order to speed up the experiments, each video Various challenges involved in constructing video level repre-
is therefore sampled at a different frame rate (they recorded sentation can be summarized as limited number of videos,
in varied conditions with different cameras) so that all have large decision unit (label is assigned to the whole video).
same number of frames. Aforementioned challenges highlight the fact that the whole
video contains lots of indicator of engagement in its short
B. LBP-TOP Feature Extraction temporal segments which in turn acts as a motivation to
We are interested in computing spatio-temporal features to construct segment level features using frame level features as
capture the changes in the facial region. Local Binary Patterns mentioned earlier. Each frame is represented as 9-dimensional
"#$%*)/$)"0*1234
!234,+,25.

H-$%#0,& A%'#B%'& A%3"%,,8",4&


7#*#&- 1'+,*%"& 25A:&C#D%,6#0& G
1'+,*%"603 A673%
$%#0:&$87%

2%3$%0*&
!"#$%&'%(%'& )%#*+"%,4& 2%3$%0*& A#078$&
)%#*+"%,&- 2*#*6,*61#'& EFG $%#04&567%8& !8"%,*& C
./%0!#1% $8$%0*,&9 '%(%'&)%#*+"%, A%3"%,,8"
,*77%(: $%#0
!"#$%&'($)* +,-. "#$%*)/$)"0*1234 "#$%*)/$)"0*1234 !2,+,6.
!234,+,25. !234,+,6.

;<<:&=2>?:&
F
;%%/?@=

Fig. 3. Different approaches discussed in experiments to regress engagement level of a video. Approach ’A’ is Multi instance learning
approach using Kmeans Clustering. Approach ’B’ is related with single video level and representation and Approach ’C’ is related to Deep
Multi-Instance learning approach.

feature vector. The statistical feature representation is useful 1) Top-l Pooling based Multi-instance Learning: A general
to capture the dynamics of various social cues present in the MIL framework for classification problems defines two kinds
form of eye gaze feature and head movement. We have applied of bags, positive and negative. In MIL, if at least one instance
various moment function in the form of standard deviation is positive then the bag is assigned a positive label [34].
and mean on eye gaze and head pose as different modalities. In our dataset, since the ground truth labels are given
For each modality segment level features are constructed as for the entire video spanning 5 minutes, we take the first l
follows. Eye gaze is returned as a 6-dimensional feature largest intensities of instances from the ranking layer instead
vector to capture the eye gaze behavior by the OpenFace of adopting the general multi-instance learning assumption
utility. Subsequently, two-moment functions such as mean of considering the largest contributing instance. The hyper-
and standard deviation are applied on the set of frames to parameter l has been chosen from the following set of values
construct a segment level features. As a result, eye gaze l ∈ {1, 5, 10, 20, 30} and is empirically set as 10.
movement for each segment is represented using 2*6=12 2) Mean Pooling based Multi-instance Learning: The en-
dimensional feature vector. Head pose modality represents the gagement intensity fluctuates among various instances over
movement of the head. It is a 3-dimensional vector, returned the span of the video. We adopted another scheme of pooling
by OpenFace. Following the application of moment functions, in which the final predicted intensity for ith bag is taken to
segment level feature is of 2*3=6 dimensional feature vector be the mean of all the intensities from the ranking layer i.e.
representing dynamics of head movement for a segment. These 1

M
M xij , where xij represents the j th segment of ith video.
features are concatenated for each segment which makes it j=1
an 18-dimensional feature vector as shown in Figure III. 3) Sequence Network: LSTM based framework is success-
For dimensionality reduction, Principal Component Analysis ful in sequence based prediction task. It motivates us to
(PCA) is applied on a normalized segment level features of design a network which leverage insights from the sequence of
all the videos. segments. The intuition behind using the LSTM based network
is that engagement level varies across the segments. The
D. Deep Multi-Instance Network for Engagement Localization annotators gives attention to different segments of the recorded
and Prediction video and gives label on the basis of perceived engagement
We design a deep multi-instance network for the task of level which implies different segments contribute differently
user engagement level prediction and localization. Figure II to the overall annotation of the video.
shows the proposed network architecture with multiple dense The Figure 4 presents the proposed architecture of the se-
regression layers, one ranking layer and one multi-instance quence based architecture. In this network the first layer is
loss layer. Pooling functions play an important role in MIL LSTM layer which has 32 hidden units. The purpose of this
deep networks to fuse instance-level outputs. Two pooling layer is to compute an activation value for the sequence of
schemes are employed for combining multiple instances, i) segments of a video. This maps the multi segment represen-
the top k-pooling based multi-instance learning taking only tation of a video to a the feature vector which represents the
the largest ten elements (k = 10) from the ranking layer and activation for different segments of the video. It is followed
ii) mean-pooling based multi-instance learning taking mean of by flatten layer which serves the purpose to make a single
all the elements. The details of these schemes are given later feature vector of activations of all the segments of the video.
in the section. This feature vector is passed to another module which consists
of three dense layers with sigmoid activations followed by explained as workflow B of Figure III. . The 18 dimensional
average pooling. This module is responsible to learn the statistical feature representation is extracted using standard
function which maps the segment level activations to the deviation and mean of the segment. It is followed by dimen-
engagement level of a video. sionality reduction using PCA and as a result each segment is
represented using 4 dimensional feature vector which is nor-
IV. E XPERIMENTS malized. The video level feature is constructed as an average of
The details about the dataset which was used for experi- all 4 dimensional feature vector of all the segments. As a result
mentation can be found in Section II. We followed subject to each video is represented as a 4 dimensional feature vector
independent cross-validation for dataset split. In this section, which is passed to a regressor. In our case RandomForest (RF)
we discuss the two approaches followed to predict the engage- Regressor is optimized using cross validation based on grid
ment level of individuals and result comparisons are shown in search applied on hyperparameters. Using metric performance
Section V. the depth of the tree is selected as 5 and no of trees parameter
is selected as 10.
A. Multiple Instance Learning Approach
The following section explains the workflow ’A’ of Figure C. Deep Multi Instance Learning
III. In this section, the workflow C of Figure III is introduced.
In this approach a temporal scanning window is taken and DNN:
features are extracted for each window segment of the video. The architecture for Deep MIL approach is shown in Figure
The training data is in the form of bags, B = {Xi , yi }N i=1 , II. Here segment level LBPTOP features are extracted. These
where Xi = {xij }M j=1 , y i ∈ Y and M (number of instances features are then passed to a neural network with four dense
per video) is equal to 150 and N is the total number of videos layers, obtaining a single feature for each segment. To obtain
in the training dataset. T a single label for the entire video, segment-wise features are
The segments obtained for each sequence are labeled using then passed through a Deep MIL loss layer using two pooling
the following two approaches: schemes, namely Top-l and Mean as mentioned in Section
1) Noisy/ Weak Labeling: All the segments in the training III-D.
data are assigned the same label as that of their corresponding We also evaluated Rectified Linear Unit (ReLU): f (x) =
video sequence. This strategy is similar to that employed in max(0, x), activation function the most frequently used activa-
previous works [35], [36]. tion function as it allows gradient flow for all positive values.
2) Relabeling segments using Kmeans clustering: In this Proposed Method: The architecture for LSTM based Deep
approach, the segments are divided into k clusters using MIL approach is shown in Figure 4 The eye gaze and head
the Kmeans clustering over the features. The hyperparam- pose features of the segment are fused for feature fusion as
eter k is selected from the following set of values k ∈ explained in section III-A. These features are passed through
{5, 10, 15, 20, 25, 30, 35, 40}. Segments falling in a cluster are the LSTM layer which returns activation for each segment
relabeled as follows: of the video, passed to the flatten layer and then flattened
i) All the segments in a cluster are relabeled by the mode feature vector is passed to the network of three dense layers
of the labels in that cluster. ii) All the segments in the cluster followed by average pooling which gives the regressed value
are relabeled by the mean of the labels in that cluster. of engagement level of a video.
3) Regressors: Several state of the art machine learning
models are trained for the regression task. i) Support Vector D. Metrics
Regression (SVR): This classifier works on the relabeled seg- Various performance metrics used in this work are Mean
ments and regress the value of the video as the mean value of Square Error (MSE) and Pearson Correlation Coefficient
all the segments belonging to that particular video. A Gaussian (PCC). The analysis performed on weighted labels is reported
kernel is trained by performing a grid search on the selection using weighted mean square error (MSE wgt) and weighted
of parameters C and sigma. The mse is obtained for the best Pearson Correlation Coefficient (PCC wgt).
combination of C and Sigma for different mode of segment
relabeling techniques as mentioned earlier. For mode analysis V. R ESULTS AND D ISCUSSIONS
C and Sigma is (1,1) whereas for mean analysis C and Sigma In this paper, we have presented the efficacy of different
is (1,4). approaches based on short temporal segments to regress the
BayesianRidge Regressor is used to regress the engagement engagement level of the videos. There are various challenges
level of a video from the regressed values of its segments with involved in the introduced video dataset. Firstly, the imbalance
different parameter settings. in the distribution of training and test samples for different
engagement levels. Secondly, the total number of labeled video
B. Traditional Machine Learning approach for single video samples across training and test sets is 267 which very less.
level representation Thirdly, the length of every labeled video in the dataset is
In this paradigm, the video level representation is con- 5mins on an average which is a much larger decision unit.
structed using segment level feature representation which is This makes the problem even more complex. Lastly, forming
!"#$% &'() 89: 89: 8;;
*+,%%-"
,-'.%*'/01/%22/!"#$%"&'#

A'%5/B0#'/C/DE'/F%G' H!I/I0##
345'0/)'67'"&'
!"#$%"&'()'*+'"$
<''=/>'7.%2/>'$?0.@

Fig. 4. The proposed deep multi-instance network pipeline. First, Eye-Gaze and Head Pose features are extracted from facial video segments.
Second, the sequence of segments is passed through a LSTM layer followed by 3 dense layers, finally average pooling layer is applied.

discriminative temporal segment level features from frame Method MSE PCC MSE wgt PCC wgt
level features and in turn, forming discriminative video level RF 0.09 0.17 0.05 0.26
feature using segment level features using statistical or deep LSTM 0.09 0.05 0.06 0.10
DNN 0.15 0.01 0.08 0.01
learning based methods.
We have given baseline results for the dataset using dif- TABLE II
ferent approaches with feature fusion of the Eye Gaze and P ERFORMANCE COMPARISON OF DIFFERENT APPROACHES FOR THE
FUSED FEATURES OF H EAD P OSE AND E YE G AZE BASED ON NON
Head Pose Movement in Table II. The results show better WEIGHTED AND WEIGHTED LABELS
performance on Random Forest (RF) regressors with single
video representation constructed using segment level features Method Relabeling Mode MSE MSE wgt
as given in Figure III as workflow ’B’. The LSTM based SVR Noisy 0.15 0.09
SVR Mode 0.15 0.09
deep network also shows comparable results whereas the
SVR Mean 0.15 0.09
conventional DNN based network fails to capture the dynamics
of the hierarchy of features used at different levels. Another TABLE III
BASELINE PERFORMANCE COMPARISON OF DIFFERENT ML APPROACHES
perspective is analyzing the results based on weighted labels. FOR LBPTOP. H ERE R ELABELING REFERS TO THE METHOD OF
As given in Table II, clearly the performance metrics improves SEGMENT- WISE LABELING .
in the weighted labels approach. We have performed subject
independence cross validation and reported the average results Method Relabeling Mode MSE MSE wgt
SVR Noisy 0.09 0.09
in all the cases.
SVR Mode 0.15 0.09
We also conducted experiments for the LBPTOP features with SVR Mean 0.15 0.09
different regressors like SVR, Bayesian Ridge Regressor and
TABLE IV
Random Forest Regressor to predict engagement level. We BASELINE P ERFORMANCE COMPARISON OF DIFFERENT RELABELING
also used various relabeling techniques like mean, median TECHNIQUES FOR F USED FEATURE OF H EAD P OSE AND E YE G AZE .
and mode over k-means clusters in combination with these
regression algorithms. The corresponding results of SVR are
shown in Table III. It is clear from the results that this
technique doesn’t work as it predict almost the same label results on the new dataset using various techniques. In au-
for all the videos so the MSE is low but the PCC is nearly 0 tomatic engagement detection, most of the current techniques
for all the combinations. used like self-reports, teacher introspective evaluation are cum-
Similarly in Table IV, where different relabeling techniques bersome and lack the ability to localize student engagement.
are applied in combination with different regressors on the There is a need to come up with an intelligent predictive tech-
feature fusion of Eye Gaze and Head Pose Movement. We nique for this problem. An important requirement to achieve
observed similar results as in the case of LBPTOP features. that is the availability of a real-world engagement dataset
The MSE and MSE wgt metrics are small but PCC is also with diverse subjects. In prior literature, different modalities
almost 0 which implies the failure of this modeling technique. such as student response, audience movie rating, psychological
This can be explained by arguing about less number of samples and neurological sensor readings etc. have been used for
and imbalanced distribution of the sample labels. Another engagement prediction. However, we have worked just with
common observation among the failed techniques is the same facial and head movement features of the subject and shown
label prediction for all the instances. that the segment based RF model approach and LSTM based
deep sequence network gives a good baseline performance.
VI. CONCLUSION & FUTURE SCOPE We also tried to solve the problem of long ( 5mins) labeled
In this paper, we presented engagement dataset that we have videos using various temporal segment labeling techniques.
prepared in the wild settings. We have also provided baseline The results show that these techniques do not work well with
either the LBPTOP or facial features as mentioned earlier. The [18] Z. Zeng, M. Pantic, G. Roisman, and T. Huang, “A Survey of Affect
presented dataset can be used for further impactful research Recognition Methods: Audio, Visual, and Spontaneous Expressions,”
IEEE Transaction on Pattern Analysis & Machine Intelligence, vol. 31,
work in this area. Also, the proposed predictive methods can no. 1, pp. 39–58, 2009.
be extensively used in designing study material for MOOCs [19] C. A. Corneanu, M. Oliu, J. F. Cohn, and S. Escalera, “Survey on rgb, 3d,
and ITS to substantially decrease the dropout rates. thermal, and multimodal approaches for facial expression recognition:
history, trends, and affect-related applications,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2016.
VII. ACKNOWLEDGMENTS [20] R. El Kaliouby, R. Picard, and S. BARON-COHEN, “Affective com-
puting and autism,” Annals of the New York Academy of Sciences, vol.
We would like to thank Nvidia for providing the Titan X 1093, no. 1, pp. 228–248, 2006.
card for research. [21] A. Tawari, S. Sivaraman, M. M. Trivedi, T. Shannon, and M. Tippelhofer,
“Looking-in and looking-out vision for urban intelligent assistance:
R EFERENCES Estimation of driver attentive state and dynamic surround for safe
merging and braking,” in Intelligent Vehicles Symposium Proceedings,
[1] M. E. Nicholls, K. M. Loveless, N. A. Thomas, T. Loetscher, and 2014 IEEE. IEEE, 2014, pp. 115–120.
O. Churches, “Some participants may be better than others: Sustained [22] R. Navarathna, P. Carr, P. Lucey, and I. Matthews, “Estimating audience
attention and motivation are higher early in semester,” The Quarterly engagement to predict movie ratings,” IEEE Transactions on Affective
Journal of Experimental Psychology, vol. 68, no. 1, pp. 10–18, 2015. Computing, 2017.
[2] J. A. Fredricks, P. C. Blumenfeld, and A. H. Paris, “School engagement: [23] S. Wu, M.-A. Rizoiu, and L. Xie, “Beyond views: Measuring
Potential of the concept, state of the evidence,” Review of educational and predicting engagement on youtube videos,” arXiv preprint
research, vol. 74, no. 1, pp. 59–109, 2004. arXiv:1709.02541, 2017.
[3] J. F. Grafsgaard, R. M. Fulton, K. E. Boyer, E. N. Wiebe, and [24] S. Attfield, G. Kazai, M. Lalmas, and B. Piwowarski, “Towards a science
J. C. Lester, “Multimodal analysis of the implicit affective channel in of user engagement (position paper),” in WSDM workshop on user
computer-mediated textual communication,” in Proceedings of the 14th modelling for Web applications, 2011, pp. 9–12.
ACM international conference on Multimodal interaction. ACM, 2012, [25] B. Sun, Q. Wei, J. He, L. Yu, and X. Zhu, “Bnu-lsved: a multimodal
pp. 145–152. spontaneous expression database in educational environment,” in Optics
[4] S. D’Mello, E. Dieterle, and A. Duckworth, “Advanced, analytic, auto- and Photonics for Information Processing X, vol. 9970. International
mated (aaa) measurement of engagement during learning,” Educational Society for Optics and Photonics, 2016, p. 997016.
psychologist, vol. 52, no. 2, pp. 104–123, 2017. [26] J. Amores, “Multiple instance classification: Review, taxonomy and
[5] K. R. Koedinger, J. R. Anderson, W. H. Hadley, and M. A. Mark, comparative study,” Artificial Intelligence, vol. 201, pp. 81–105, 2013.
“Intelligent tutoring goes to school in the big city,” 1997. [27] D. M. Tax, E. Hendriks, M. F. Valstar, and M. Pantic, “The detection
[6] S. K. D’Mello and A. Graesser, “Multimodal semi-automated affect of concept frames using clustering multi-instance learning,” in Pattern
detection from conversational cues, gross body language, and facial Recognition (ICPR), 2010 20th International Conference on. IEEE,
features,” User Modeling and User-Adapted Interaction, vol. 20, no. 2, 2010, pp. 2917–2920.
pp. 147–187, 2010. [28] T. Rao, M. Xu, H. Liu, J. Wang, and I. Burnett, “Multi-scale blocks
[7] X. Xiao, P. Pham, and J. Wang, “Dynamics of affective states during based image emotion classification using multiple instance learning,”
mooc learning,” in International Conference on Artificial Intelligence in in Image Processing (ICIP), 2016 IEEE International Conference on,
Education. Springer, 2017, pp. 586–589. 2016.
[8] E. Joseph, “Engagement tracing: using response times to model student [29] J. Wu, Y. Yu, C. Huang, and K. Yu, “Deep multiple instance learning for
disengagement,” Artificial intelligence in education: Supporting learning image classification and auto-annotation,” in Proceedings of the IEEE
through intelligent and socially informed technology, vol. 125, p. 88, Conference on Computer Vision and Pattern Recognition, 2015, pp.
2005. 3460–3469.
[9] B. S. Goldberg, R. A. Sottilare, K. W. Brawner, and H. K. Holden, [30] K. Sikka, A. Dhall, and M. Bartlett, “Weakly supervised pain lo-
“Predicting learner engagement during well-defined and ill-defined calization using multiple instance learning,” in Automatic Face and
computer-based intercultural interactions,” in International Conference Gesture Recognition (FG), 2013 10th IEEE International Conference
on Affective Computing and Intelligent Interaction. Springer, 2011, pp. and Workshops on. IEEE, 2013, pp. 1–8.
538–547. [31] 22nd International Conference on Pattern Recogni-
[10] M. Chaouachi, P. Chalfoun, I. Jraidi, and C. Frasson, “Affect and tion, ICPR 2014, Stockholm, Sweden, August 24-28,
mental engagement: towards adaptability for intelligent systems,” in 2014. IEEE Computer Society, 2014. [Online]. Available:
Proceedings of the 23rd International FLAIRS Conference. AAAI Press, http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6966883
2010. [32] T. Baltrušaitis, P. Robinson, and L.-P. Morency, “Openface: an open
[11] X. Xiao and J. Wang, “Undertanding and detecting divided attention in source facial behavior analysis toolkit,” in Applications of Computer
mobile mooc learning,” in Proceedings of the 2017 CHI Conference on Vision (WACV), 2016 IEEE Winter Conference on. IEEE, 2016, pp.
Human Factors in Computing Systems. ACM, 2017, pp. 2411–2415. 1–10.
[12] S. K. D’Mello, S. Craig, and A. Graesser, “Multimethod assessment of [33] G. Zhao and M. Pietikainen, “Dynamic texture recognition using local
affective experience and expression during deep learning.” International binary patterns with an application to facial expressions,” IEEE trans-
Journal of Learning Technology, vol. 4(3-4), pp. 165–187, 2009. actions on pattern analysis and machine intelligence, vol. 29, no. 6, pp.
[13] J. Whitehill, Z. Serpell, Y.-C. Lin, A. Foster, and J. R. Movellan, “The 915–928, 2007.
faces of engagement: Automatic recognition of student engagementfrom [34] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez, “Solving the
facial expressions,” IEEE Transactions on Affective Computing, vol. 5, multiple instance problem with axis-parallel rectangles,” Artificial in-
no. 1, pp. 86–98, 2014. telligence, vol. 89, no. 1, pp. 31–71, 1997.
[14] J. F. Grafsgaard, K. E. Boyer, and J. C. Lester, “Predicting facial indi- [35] A. B. Ashraf, S. Lucey, J. F. Cohn, T. Chen, Z. Ambadar, K. M. Prkachin,
cators of confusion with hidden markov models,” in International Con- and P. E. Solomon, “The painful face–pain expression recognition using
ference on Affective Computing and Intelligent Interaction. Springer, active appearance models,” Image and vision computing, vol. 27, no. 12,
2011, pp. 97–106. pp. 1788–1796, 2009.
[15] A. Gupta and V. Balasubramanian, “Daisee: Towards user engagement [36] P. Lucey, J. Howlett, J. Cohn, S. Lucey, S. Sridharan, and Z. Ambadar,
recognition in the wild,” arXiv preprint arXiv:1609.01885, 2016. “Improving pain recognition through better utilisation of temporal
[16] C. R. Beal, R. Walles, I. Arroyo, and B. P. Woolf, “On-line tutoring information,” in International conference on auditory-visual speech
for math achievement testing: A controlled evaluation,” Journal of processing, vol. 2008. NIH Public Access, 2008, p. 167.
Interactive Online Learning, vol. 6, no. 1, pp. 43–55, 2007.
[17] N. Bosch, S. K. D’Mello, R. S. Baker, J. Ocumpaugh, V. Shute,
M. Ventura, L. Wang, and W. Zhao, “Detecting student emotions in
computer-enabled classrooms.” in IJCAI, 2016, pp. 4125–4129.

You might also like