Professional Documents
Culture Documents
DEFG,HF/I'%$:3'#
!"#$%"&'()'*+'"$
12'3%*'/-4/%55/!"#$%"&'#
A''./B':3%5/B'$C-30
678'-/)'9:'"&'
J!D/D-##'#
Fig. 2. Proposed deep multi-instance network: LBP-TOP features are extracted from facial video segments. Linear regression using DNN is
employed and instance level responses are ranked. Finally, max and mean pooling are used to predict the subject engagement intensity.
introduced in the work of Joshi et al. [31]. The notion of the from Three Orthogonal Planes (LBP-TOP) [33] is a standard
credibility of the user is defined in the work as consistency spatio-temporal texture descriptor and is used in our work.
of the annotator in giving engagement label to the same clip 1) Shorter segment: A sliding window vf =
which is shown at different time. We have used the same [f1 , f2 , · · · , fm ] of m frames is taken at a time
notion of credibility to calculate the weight of rating of each and LBP-TOP features are extracted. A smaller value
annotator based on the deviation in rating given to the same of m would fail to efficiently approximate the entire
clips shown at different time stamps. This gives the weighted expression while a larger value of m would yield results
labels and serves the purpose to penalize the raters who are with poor frame-wise specificity. We therefore deem
prone to give more erroneous ratings. m=50 (approx. 10 seconds in the sampled video) as
the appropriate length of each window for expression
III. P ROPOSED F RAMEWORK recognition. After extracting features for the first m
Our framework performs subject’s engagement detection frames, the sliding window is moved by p frames. The
and localization. The steps are as follows: The face and value of p is empirically set to 25 (overlap of 50%) as
facial landmark positions [32] are detected in each frame. The it was found to provide adequate accuracy.
video is then sub-sampled into small segments (instances) as
discussed in detail in Section III-A. Features are extracted for a C. Eye Gaze and Head Pose Feature Extraction
window of frames. The window then slides, next set of frames Student activities give cues to infer learning centered affec-
are extracted to compute features. This stage is discussed in tive states such as engagement and others. In this paper, social
Section III-B and III-C. The engagement prediction and local- signal in the form of eye gaze movement and head movement
ization are computed with the deep MIL network using mean features are extracted using OpenFace toolkit.
and top-k pooling for regression. The details are discussed 1) Head Pose (HP): The movement of head an indicator of
in Section III-D. Another model to process the sequence engagement intensity. OpenFace returns 3 points related
of segments is constructed using Long Short Term Memory to head movement in 3D space. These points give the
(LSTM) based network using mean pooling. The details are location of head w.r.t camera. Dynamics of these points
discussed in III-D give insights about head movement of a subject.
2) Eye Gaze (EG): The gaze of left and right eye is
A. Pre-Processing extracted using Open Face for every frame in the videos.
Let v = [f1 , f2 , f3 , · · · , fn ] be a video clip of a subject in It returns 3 dimensional direction vector for both the
response to a stimuli, where f denotes a video frame and n eyes. The movement of eyes give cues for the engage-
are the total number of frames. The OpenFace library [32] is ment level of the student such as closed eyes, looking
used to track the facial landmarks first in order to register the away from the screen could be negative indicators of the
facial region, which is then cropped to a final resolution of engagement level.
112 × 112. In order to speed up the experiments, each video Various challenges involved in constructing video level repre-
is therefore sampled at a different frame rate (they recorded sentation can be summarized as limited number of videos,
in varied conditions with different cameras) so that all have large decision unit (label is assigned to the whole video).
same number of frames. Aforementioned challenges highlight the fact that the whole
video contains lots of indicator of engagement in its short
B. LBP-TOP Feature Extraction temporal segments which in turn acts as a motivation to
We are interested in computing spatio-temporal features to construct segment level features using frame level features as
capture the changes in the facial region. Local Binary Patterns mentioned earlier. Each frame is represented as 9-dimensional
"#$%*)/$)"0*1234
!234,+,25.
2%3$%0*&
!"#$%&'%(%'& )%#*+"%,4& 2%3$%0*& A#078$&
)%#*+"%,&- 2*#*6,*61#'& EFG $%#04&567%8& !8"%,*& C
./%0!#1% $8$%0*,&9 '%(%'&)%#*+"%, A%3"%,,8"
,*77%(: $%#0
!"#$%&'($)* +,-. "#$%*)/$)"0*1234 "#$%*)/$)"0*1234 !2,+,6.
!234,+,25. !234,+,6.
;<<:&=2>?:&
F
;%%/?@=
Fig. 3. Different approaches discussed in experiments to regress engagement level of a video. Approach ’A’ is Multi instance learning
approach using Kmeans Clustering. Approach ’B’ is related with single video level and representation and Approach ’C’ is related to Deep
Multi-Instance learning approach.
feature vector. The statistical feature representation is useful 1) Top-l Pooling based Multi-instance Learning: A general
to capture the dynamics of various social cues present in the MIL framework for classification problems defines two kinds
form of eye gaze feature and head movement. We have applied of bags, positive and negative. In MIL, if at least one instance
various moment function in the form of standard deviation is positive then the bag is assigned a positive label [34].
and mean on eye gaze and head pose as different modalities. In our dataset, since the ground truth labels are given
For each modality segment level features are constructed as for the entire video spanning 5 minutes, we take the first l
follows. Eye gaze is returned as a 6-dimensional feature largest intensities of instances from the ranking layer instead
vector to capture the eye gaze behavior by the OpenFace of adopting the general multi-instance learning assumption
utility. Subsequently, two-moment functions such as mean of considering the largest contributing instance. The hyper-
and standard deviation are applied on the set of frames to parameter l has been chosen from the following set of values
construct a segment level features. As a result, eye gaze l ∈ {1, 5, 10, 20, 30} and is empirically set as 10.
movement for each segment is represented using 2*6=12 2) Mean Pooling based Multi-instance Learning: The en-
dimensional feature vector. Head pose modality represents the gagement intensity fluctuates among various instances over
movement of the head. It is a 3-dimensional vector, returned the span of the video. We adopted another scheme of pooling
by OpenFace. Following the application of moment functions, in which the final predicted intensity for ith bag is taken to
segment level feature is of 2*3=6 dimensional feature vector be the mean of all the intensities from the ranking layer i.e.
representing dynamics of head movement for a segment. These 1
∑
M
M xij , where xij represents the j th segment of ith video.
features are concatenated for each segment which makes it j=1
an 18-dimensional feature vector as shown in Figure III. 3) Sequence Network: LSTM based framework is success-
For dimensionality reduction, Principal Component Analysis ful in sequence based prediction task. It motivates us to
(PCA) is applied on a normalized segment level features of design a network which leverage insights from the sequence of
all the videos. segments. The intuition behind using the LSTM based network
is that engagement level varies across the segments. The
D. Deep Multi-Instance Network for Engagement Localization annotators gives attention to different segments of the recorded
and Prediction video and gives label on the basis of perceived engagement
We design a deep multi-instance network for the task of level which implies different segments contribute differently
user engagement level prediction and localization. Figure II to the overall annotation of the video.
shows the proposed network architecture with multiple dense The Figure 4 presents the proposed architecture of the se-
regression layers, one ranking layer and one multi-instance quence based architecture. In this network the first layer is
loss layer. Pooling functions play an important role in MIL LSTM layer which has 32 hidden units. The purpose of this
deep networks to fuse instance-level outputs. Two pooling layer is to compute an activation value for the sequence of
schemes are employed for combining multiple instances, i) segments of a video. This maps the multi segment represen-
the top k-pooling based multi-instance learning taking only tation of a video to a the feature vector which represents the
the largest ten elements (k = 10) from the ranking layer and activation for different segments of the video. It is followed
ii) mean-pooling based multi-instance learning taking mean of by flatten layer which serves the purpose to make a single
all the elements. The details of these schemes are given later feature vector of activations of all the segments of the video.
in the section. This feature vector is passed to another module which consists
of three dense layers with sigmoid activations followed by explained as workflow B of Figure III. . The 18 dimensional
average pooling. This module is responsible to learn the statistical feature representation is extracted using standard
function which maps the segment level activations to the deviation and mean of the segment. It is followed by dimen-
engagement level of a video. sionality reduction using PCA and as a result each segment is
represented using 4 dimensional feature vector which is nor-
IV. E XPERIMENTS malized. The video level feature is constructed as an average of
The details about the dataset which was used for experi- all 4 dimensional feature vector of all the segments. As a result
mentation can be found in Section II. We followed subject to each video is represented as a 4 dimensional feature vector
independent cross-validation for dataset split. In this section, which is passed to a regressor. In our case RandomForest (RF)
we discuss the two approaches followed to predict the engage- Regressor is optimized using cross validation based on grid
ment level of individuals and result comparisons are shown in search applied on hyperparameters. Using metric performance
Section V. the depth of the tree is selected as 5 and no of trees parameter
is selected as 10.
A. Multiple Instance Learning Approach
The following section explains the workflow ’A’ of Figure C. Deep Multi Instance Learning
III. In this section, the workflow C of Figure III is introduced.
In this approach a temporal scanning window is taken and DNN:
features are extracted for each window segment of the video. The architecture for Deep MIL approach is shown in Figure
The training data is in the form of bags, B = {Xi , yi }N i=1 , II. Here segment level LBPTOP features are extracted. These
where Xi = {xij }M j=1 , y i ∈ Y and M (number of instances features are then passed to a neural network with four dense
per video) is equal to 150 and N is the total number of videos layers, obtaining a single feature for each segment. To obtain
in the training dataset. T a single label for the entire video, segment-wise features are
The segments obtained for each sequence are labeled using then passed through a Deep MIL loss layer using two pooling
the following two approaches: schemes, namely Top-l and Mean as mentioned in Section
1) Noisy/ Weak Labeling: All the segments in the training III-D.
data are assigned the same label as that of their corresponding We also evaluated Rectified Linear Unit (ReLU): f (x) =
video sequence. This strategy is similar to that employed in max(0, x), activation function the most frequently used activa-
previous works [35], [36]. tion function as it allows gradient flow for all positive values.
2) Relabeling segments using Kmeans clustering: In this Proposed Method: The architecture for LSTM based Deep
approach, the segments are divided into k clusters using MIL approach is shown in Figure 4 The eye gaze and head
the Kmeans clustering over the features. The hyperparam- pose features of the segment are fused for feature fusion as
eter k is selected from the following set of values k ∈ explained in section III-A. These features are passed through
{5, 10, 15, 20, 25, 30, 35, 40}. Segments falling in a cluster are the LSTM layer which returns activation for each segment
relabeled as follows: of the video, passed to the flatten layer and then flattened
i) All the segments in a cluster are relabeled by the mode feature vector is passed to the network of three dense layers
of the labels in that cluster. ii) All the segments in the cluster followed by average pooling which gives the regressed value
are relabeled by the mean of the labels in that cluster. of engagement level of a video.
3) Regressors: Several state of the art machine learning
models are trained for the regression task. i) Support Vector D. Metrics
Regression (SVR): This classifier works on the relabeled seg- Various performance metrics used in this work are Mean
ments and regress the value of the video as the mean value of Square Error (MSE) and Pearson Correlation Coefficient
all the segments belonging to that particular video. A Gaussian (PCC). The analysis performed on weighted labels is reported
kernel is trained by performing a grid search on the selection using weighted mean square error (MSE wgt) and weighted
of parameters C and sigma. The mse is obtained for the best Pearson Correlation Coefficient (PCC wgt).
combination of C and Sigma for different mode of segment
relabeling techniques as mentioned earlier. For mode analysis V. R ESULTS AND D ISCUSSIONS
C and Sigma is (1,1) whereas for mean analysis C and Sigma In this paper, we have presented the efficacy of different
is (1,4). approaches based on short temporal segments to regress the
BayesianRidge Regressor is used to regress the engagement engagement level of the videos. There are various challenges
level of a video from the regressed values of its segments with involved in the introduced video dataset. Firstly, the imbalance
different parameter settings. in the distribution of training and test samples for different
engagement levels. Secondly, the total number of labeled video
B. Traditional Machine Learning approach for single video samples across training and test sets is 267 which very less.
level representation Thirdly, the length of every labeled video in the dataset is
In this paradigm, the video level representation is con- 5mins on an average which is a much larger decision unit.
structed using segment level feature representation which is This makes the problem even more complex. Lastly, forming
!"#$% &'() 89: 89: 8;;
*+,%%-"
,-'.%*'/01/%22/!"#$%"&'#
A'%5/B0#'/C/DE'/F%G' H!I/I0##
345'0/)'67'"&'
!"#$%"&'()'*+'"$
<''=/>'7.%2/>'$?0.@
Fig. 4. The proposed deep multi-instance network pipeline. First, Eye-Gaze and Head Pose features are extracted from facial video segments.
Second, the sequence of segments is passed through a LSTM layer followed by 3 dense layers, finally average pooling layer is applied.
discriminative temporal segment level features from frame Method MSE PCC MSE wgt PCC wgt
level features and in turn, forming discriminative video level RF 0.09 0.17 0.05 0.26
feature using segment level features using statistical or deep LSTM 0.09 0.05 0.06 0.10
DNN 0.15 0.01 0.08 0.01
learning based methods.
We have given baseline results for the dataset using dif- TABLE II
ferent approaches with feature fusion of the Eye Gaze and P ERFORMANCE COMPARISON OF DIFFERENT APPROACHES FOR THE
FUSED FEATURES OF H EAD P OSE AND E YE G AZE BASED ON NON
Head Pose Movement in Table II. The results show better WEIGHTED AND WEIGHTED LABELS
performance on Random Forest (RF) regressors with single
video representation constructed using segment level features Method Relabeling Mode MSE MSE wgt
as given in Figure III as workflow ’B’. The LSTM based SVR Noisy 0.15 0.09
SVR Mode 0.15 0.09
deep network also shows comparable results whereas the
SVR Mean 0.15 0.09
conventional DNN based network fails to capture the dynamics
of the hierarchy of features used at different levels. Another TABLE III
BASELINE PERFORMANCE COMPARISON OF DIFFERENT ML APPROACHES
perspective is analyzing the results based on weighted labels. FOR LBPTOP. H ERE R ELABELING REFERS TO THE METHOD OF
As given in Table II, clearly the performance metrics improves SEGMENT- WISE LABELING .
in the weighted labels approach. We have performed subject
independence cross validation and reported the average results Method Relabeling Mode MSE MSE wgt
SVR Noisy 0.09 0.09
in all the cases.
SVR Mode 0.15 0.09
We also conducted experiments for the LBPTOP features with SVR Mean 0.15 0.09
different regressors like SVR, Bayesian Ridge Regressor and
TABLE IV
Random Forest Regressor to predict engagement level. We BASELINE P ERFORMANCE COMPARISON OF DIFFERENT RELABELING
also used various relabeling techniques like mean, median TECHNIQUES FOR F USED FEATURE OF H EAD P OSE AND E YE G AZE .
and mode over k-means clusters in combination with these
regression algorithms. The corresponding results of SVR are
shown in Table III. It is clear from the results that this
technique doesn’t work as it predict almost the same label results on the new dataset using various techniques. In au-
for all the videos so the MSE is low but the PCC is nearly 0 tomatic engagement detection, most of the current techniques
for all the combinations. used like self-reports, teacher introspective evaluation are cum-
Similarly in Table IV, where different relabeling techniques bersome and lack the ability to localize student engagement.
are applied in combination with different regressors on the There is a need to come up with an intelligent predictive tech-
feature fusion of Eye Gaze and Head Pose Movement. We nique for this problem. An important requirement to achieve
observed similar results as in the case of LBPTOP features. that is the availability of a real-world engagement dataset
The MSE and MSE wgt metrics are small but PCC is also with diverse subjects. In prior literature, different modalities
almost 0 which implies the failure of this modeling technique. such as student response, audience movie rating, psychological
This can be explained by arguing about less number of samples and neurological sensor readings etc. have been used for
and imbalanced distribution of the sample labels. Another engagement prediction. However, we have worked just with
common observation among the failed techniques is the same facial and head movement features of the subject and shown
label prediction for all the instances. that the segment based RF model approach and LSTM based
deep sequence network gives a good baseline performance.
VI. CONCLUSION & FUTURE SCOPE We also tried to solve the problem of long ( 5mins) labeled
In this paper, we presented engagement dataset that we have videos using various temporal segment labeling techniques.
prepared in the wild settings. We have also provided baseline The results show that these techniques do not work well with
either the LBPTOP or facial features as mentioned earlier. The [18] Z. Zeng, M. Pantic, G. Roisman, and T. Huang, “A Survey of Affect
presented dataset can be used for further impactful research Recognition Methods: Audio, Visual, and Spontaneous Expressions,”
IEEE Transaction on Pattern Analysis & Machine Intelligence, vol. 31,
work in this area. Also, the proposed predictive methods can no. 1, pp. 39–58, 2009.
be extensively used in designing study material for MOOCs [19] C. A. Corneanu, M. Oliu, J. F. Cohn, and S. Escalera, “Survey on rgb, 3d,
and ITS to substantially decrease the dropout rates. thermal, and multimodal approaches for facial expression recognition:
history, trends, and affect-related applications,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2016.
VII. ACKNOWLEDGMENTS [20] R. El Kaliouby, R. Picard, and S. BARON-COHEN, “Affective com-
puting and autism,” Annals of the New York Academy of Sciences, vol.
We would like to thank Nvidia for providing the Titan X 1093, no. 1, pp. 228–248, 2006.
card for research. [21] A. Tawari, S. Sivaraman, M. M. Trivedi, T. Shannon, and M. Tippelhofer,
“Looking-in and looking-out vision for urban intelligent assistance:
R EFERENCES Estimation of driver attentive state and dynamic surround for safe
merging and braking,” in Intelligent Vehicles Symposium Proceedings,
[1] M. E. Nicholls, K. M. Loveless, N. A. Thomas, T. Loetscher, and 2014 IEEE. IEEE, 2014, pp. 115–120.
O. Churches, “Some participants may be better than others: Sustained [22] R. Navarathna, P. Carr, P. Lucey, and I. Matthews, “Estimating audience
attention and motivation are higher early in semester,” The Quarterly engagement to predict movie ratings,” IEEE Transactions on Affective
Journal of Experimental Psychology, vol. 68, no. 1, pp. 10–18, 2015. Computing, 2017.
[2] J. A. Fredricks, P. C. Blumenfeld, and A. H. Paris, “School engagement: [23] S. Wu, M.-A. Rizoiu, and L. Xie, “Beyond views: Measuring
Potential of the concept, state of the evidence,” Review of educational and predicting engagement on youtube videos,” arXiv preprint
research, vol. 74, no. 1, pp. 59–109, 2004. arXiv:1709.02541, 2017.
[3] J. F. Grafsgaard, R. M. Fulton, K. E. Boyer, E. N. Wiebe, and [24] S. Attfield, G. Kazai, M. Lalmas, and B. Piwowarski, “Towards a science
J. C. Lester, “Multimodal analysis of the implicit affective channel in of user engagement (position paper),” in WSDM workshop on user
computer-mediated textual communication,” in Proceedings of the 14th modelling for Web applications, 2011, pp. 9–12.
ACM international conference on Multimodal interaction. ACM, 2012, [25] B. Sun, Q. Wei, J. He, L. Yu, and X. Zhu, “Bnu-lsved: a multimodal
pp. 145–152. spontaneous expression database in educational environment,” in Optics
[4] S. D’Mello, E. Dieterle, and A. Duckworth, “Advanced, analytic, auto- and Photonics for Information Processing X, vol. 9970. International
mated (aaa) measurement of engagement during learning,” Educational Society for Optics and Photonics, 2016, p. 997016.
psychologist, vol. 52, no. 2, pp. 104–123, 2017. [26] J. Amores, “Multiple instance classification: Review, taxonomy and
[5] K. R. Koedinger, J. R. Anderson, W. H. Hadley, and M. A. Mark, comparative study,” Artificial Intelligence, vol. 201, pp. 81–105, 2013.
“Intelligent tutoring goes to school in the big city,” 1997. [27] D. M. Tax, E. Hendriks, M. F. Valstar, and M. Pantic, “The detection
[6] S. K. D’Mello and A. Graesser, “Multimodal semi-automated affect of concept frames using clustering multi-instance learning,” in Pattern
detection from conversational cues, gross body language, and facial Recognition (ICPR), 2010 20th International Conference on. IEEE,
features,” User Modeling and User-Adapted Interaction, vol. 20, no. 2, 2010, pp. 2917–2920.
pp. 147–187, 2010. [28] T. Rao, M. Xu, H. Liu, J. Wang, and I. Burnett, “Multi-scale blocks
[7] X. Xiao, P. Pham, and J. Wang, “Dynamics of affective states during based image emotion classification using multiple instance learning,”
mooc learning,” in International Conference on Artificial Intelligence in in Image Processing (ICIP), 2016 IEEE International Conference on,
Education. Springer, 2017, pp. 586–589. 2016.
[8] E. Joseph, “Engagement tracing: using response times to model student [29] J. Wu, Y. Yu, C. Huang, and K. Yu, “Deep multiple instance learning for
disengagement,” Artificial intelligence in education: Supporting learning image classification and auto-annotation,” in Proceedings of the IEEE
through intelligent and socially informed technology, vol. 125, p. 88, Conference on Computer Vision and Pattern Recognition, 2015, pp.
2005. 3460–3469.
[9] B. S. Goldberg, R. A. Sottilare, K. W. Brawner, and H. K. Holden, [30] K. Sikka, A. Dhall, and M. Bartlett, “Weakly supervised pain lo-
“Predicting learner engagement during well-defined and ill-defined calization using multiple instance learning,” in Automatic Face and
computer-based intercultural interactions,” in International Conference Gesture Recognition (FG), 2013 10th IEEE International Conference
on Affective Computing and Intelligent Interaction. Springer, 2011, pp. and Workshops on. IEEE, 2013, pp. 1–8.
538–547. [31] 22nd International Conference on Pattern Recogni-
[10] M. Chaouachi, P. Chalfoun, I. Jraidi, and C. Frasson, “Affect and tion, ICPR 2014, Stockholm, Sweden, August 24-28,
mental engagement: towards adaptability for intelligent systems,” in 2014. IEEE Computer Society, 2014. [Online]. Available:
Proceedings of the 23rd International FLAIRS Conference. AAAI Press, http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6966883
2010. [32] T. Baltrušaitis, P. Robinson, and L.-P. Morency, “Openface: an open
[11] X. Xiao and J. Wang, “Undertanding and detecting divided attention in source facial behavior analysis toolkit,” in Applications of Computer
mobile mooc learning,” in Proceedings of the 2017 CHI Conference on Vision (WACV), 2016 IEEE Winter Conference on. IEEE, 2016, pp.
Human Factors in Computing Systems. ACM, 2017, pp. 2411–2415. 1–10.
[12] S. K. D’Mello, S. Craig, and A. Graesser, “Multimethod assessment of [33] G. Zhao and M. Pietikainen, “Dynamic texture recognition using local
affective experience and expression during deep learning.” International binary patterns with an application to facial expressions,” IEEE trans-
Journal of Learning Technology, vol. 4(3-4), pp. 165–187, 2009. actions on pattern analysis and machine intelligence, vol. 29, no. 6, pp.
[13] J. Whitehill, Z. Serpell, Y.-C. Lin, A. Foster, and J. R. Movellan, “The 915–928, 2007.
faces of engagement: Automatic recognition of student engagementfrom [34] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez, “Solving the
facial expressions,” IEEE Transactions on Affective Computing, vol. 5, multiple instance problem with axis-parallel rectangles,” Artificial in-
no. 1, pp. 86–98, 2014. telligence, vol. 89, no. 1, pp. 31–71, 1997.
[14] J. F. Grafsgaard, K. E. Boyer, and J. C. Lester, “Predicting facial indi- [35] A. B. Ashraf, S. Lucey, J. F. Cohn, T. Chen, Z. Ambadar, K. M. Prkachin,
cators of confusion with hidden markov models,” in International Con- and P. E. Solomon, “The painful face–pain expression recognition using
ference on Affective Computing and Intelligent Interaction. Springer, active appearance models,” Image and vision computing, vol. 27, no. 12,
2011, pp. 97–106. pp. 1788–1796, 2009.
[15] A. Gupta and V. Balasubramanian, “Daisee: Towards user engagement [36] P. Lucey, J. Howlett, J. Cohn, S. Lucey, S. Sridharan, and Z. Ambadar,
recognition in the wild,” arXiv preprint arXiv:1609.01885, 2016. “Improving pain recognition through better utilisation of temporal
[16] C. R. Beal, R. Walles, I. Arroyo, and B. P. Woolf, “On-line tutoring information,” in International conference on auditory-visual speech
for math achievement testing: A controlled evaluation,” Journal of processing, vol. 2008. NIH Public Access, 2008, p. 167.
Interactive Online Learning, vol. 6, no. 1, pp. 43–55, 2007.
[17] N. Bosch, S. K. D’Mello, R. S. Baker, J. Ocumpaugh, V. Shute,
M. Ventura, L. Wang, and W. Zhao, “Detecting student emotions in
computer-enabled classrooms.” in IJCAI, 2016, pp. 4125–4129.