Recognition of Learning-Centered Emotions Using A Convolutional Neural Network

Journal of Intelligent & Fuzzy Systems xx (20xx) x–xx 1
DOI:10.3233/JIFS-169514
IOS Press
1 Recognition of learning-centered emotions

2 using a convolutional neural network
f
roo
3 Francisco González-Hernández∗ , Ramon Zatarain-Cabada, Maria Lucia Barrón-Estrada
4 and Hector Rodrı́guez-Rangel
5 Posgrado en Ciencias de la Computación, Instituto Tecnológico de Culiacán, Culiacán,
6 Sinaloa, México
rP
tho
7 Abstract. This work presents the application of a convolutional neural network (CNN) used to identify emotions through
8 taken images to students, which are learning Java language with an Intelligent Learning Environment. The CNN contains
9 three convolutional layers, three max-pooling layers, and three neural networks with intermediate dropout connections. The
10 CNN was trained using different emotional databases. One of them was a posed database (RaFD) and two of them were
11 spontaneous databases created specially by us with a content focused on learning-centered emotions. The results show a
Au
12 comparison among three emotion recognition systems. One applying a local binary pattern approach with facial patches,
13 another applying a geometry-based method, and the last one applying the convolutional network. The analysis presented
14 satisfactory results; the CNN obtained a 95% accuracy for the RaFD database, an 88% accuracy for a learning-centered
15 emotion database and a 74% accuracy for a second learning-centered emotion database. Results are compared against the
16 classifiers support vector machine, k-nearest neighbors, and artificial neural network.
d
17 Keywords: Convolutional neural network, educational emotion recognition, face expression database, machine learning,
18 feature extraction
cte
19 1. Introduction emotions. Emotional recognition through the face 31
is an issue that has been extensively addressed by 32

rre
20 Emotional recognition is the process of predicting researchers in the field of affective computing; this 33
21 affective content from low-level signals. These signs issue is usually named facial expression recognition. 34
22 are manifested from physical expressions. These To implement our facial expression recognizer we 35
23 expressions have important features that help us how developed three different processes [2] explained 36
co
24 to differentiate an emotion from another. For exam- below. The first step is the extraction of features where 37
25 ple, in speech, we have features like loudness and the facial image receives a set of operations; these 38
26 pitch. In body expressions, we have features like operations generate a set of features expressed as a 39
27 the position of the body. Even, the heart rate and vector, a matrix or some computational data structure. 40
Un
28 brain signals are features not detectable but they The operators are based on the appearance (filters) or 41
29 express emotions as well [1]. One of the most expres- the geometry (distances) of the image and are usu- 42
30 sive parts of the human body is the face, being one ally handcrafted. The second stage is the selection 43
of the main channels of communication to express of features where frequently repeated features are 44
discarded because these features make the images 45
∗ Corresponding
have little difference among them confusing the clas- 46
author. Francisco González-Hernández, Pos-
grado en Ciencias de la Computación, Instituto Tecnológico
sifier. In addition, this step also serves to decrease 47
de Culiacán, Culiacán, Sinaloa, México. E-mail: rzatarain@ the dimensionality of feature vectors. The third step 48
itculiacan.edu.mx. is the building or use of a classifier. Classifiers 49
1064-1246/18/$35.00 © 2018 – IOS Press and the authors. All rights reserved
2 F. González-Hernández et al. / Recognition of learning-centered emotions
50 usually are designed for classifying discrete values way finding out the effectiveness to classify these 102
51 that are represented by labels or classes. The most emotions. 103
52 used approaches to create facial expressions recog- According to our knowledge, there are no works 104
53 nizers are based on Ekman’s theory of emotions [2]. dealing with convolutional neural networks and 105
54 This theory explains how human beings express basic educational issues together. This work presents a con- 106
55 emotions. The basic emotions are feelings of short volutional neural network architecture, tested with 107
56 duration with clearly visible and well-defined expres- a database of basic emotions; and tested with two 108
57 sions. Although the number of emotions may vary databases of learning-centered emotions designed 109
f
58 according to the author, usually the selected emo- and created by the authors of this work. In addition, 110
roo
59 tions are anger, joy, sadness, fear, contempt, disgust we present a comparison of the results when using the 111
60 and surprise. In current research papers, databases of local binary pattern and geometry-based approaches. 112
61 basic emotions are used in the three mentioned stages This paper presents an approach based on convolu- 113
62 for building and testing an emotion recognizer. tional neural networks to the recognition and use of 114
63 In a previous work [3], we presented an intel- learning-centered emotions and a comparison against 115
rP
64 ligent learning environment (ILE) for the Java other approaches that we have tested for facing this 116
65 programming language. Intelligent learning environ- topic. 117
66 ments provide personalized instruction in a particular The paper is structured as follows: Section 2 shows 118
67 domain. In addition, we added to the environment the work related to the databases of facial expressions 119
tho
68 an emotion recognizer to provide the ability to use and the techniques used for the recognition of facial 120
69 emotions as part of pedagogical strategies. In this expressions. Section 3 presents the previous work we 121
70 paper, we present the design and implementation of have done to create both databases and recognizers. 122
71 a convolutional neuronal network to give our ILE Section 4 shows the test results of the recognizers 123
72 the ability to recognize emotions with another type as well as the discussion about those results. Finally, 124
Au
73 of classifier. It is worth mentioning that the work of Section 5 presents conclusions and future work. 125
74 creating an emotion recognizer was addressed twice

75 before [4, 5]. The difference and main contribution
76 of this work are that the convolutional neural net- 2. Related works 126
77 work automatically performs the feature selection

process avoiding the need to create a handmade pro- Related work is divided into two parts. The first part
d
78 127
79 cess, which is not known with precision if it works is about facial expression databases and the second 128
cte
80 properly for educational domains. In addition, we part is about emotional recognition in facial expres- 129
81 created a database with spontaneous facial images sions using different techniques and approaches. The 130
82 that represent emotions focused on learning. These database section includes posed and spontaneous 131
83 emotions have a longer duration and occur when expressions as well as basic and non-basic emotions. 132
84 performing intellectual activities [6]. Some of the The emotional recognition section includes topics 133
rre
85 most common emotions of this type are frustration, about appearance-based, geometric-based, and deep 134
86 engagement, excitement, boredom, and relaxation. learning approaches. 135
87 Also in previous works, we built two classifiers to

88 recognize learning-oriented emotions. All this work 2.1. Facial expression databases 136
co
89 is explaining further on.

90 There has been research works in the area Face expression databases are a set of images 137
91 of machine learning based on the deep learning that express an emotion, a situation, or an experi- 138
92 approach. One type of deep learning approach is the ence. There are several available databases. Some 139
Un
93 convolutional neural network which consists of a set of them contain posed faces to represent specific 140
94 of interconnected filters where an image is trans- emotions; other ones contain spontaneous emotions 141
95 formed retaining its most outstanding features [7]. where the face represent the facial reaction to a 142
96 These networks transform the images in each convo- situation. Next, the most important databases are pre- 143
97 lutional layer obtaining that the images retain their sented and explained. Cohn-Kanade (CK) [8] and CK 144
98 most outstanding features. With this, the need to plus (CK+) [9] are databases that represent 6 basic 145
99 define an extractor and a feature selector is avoided. emotions and they include Action Units (AU) anno- 146
100 This gives the possibility to build and test these types tations. The images represent the image sequence 147
101 of networks on learning-oriented emotions and this inside the Facial Action Coding System (FACS). 148
F. González-Hernández et al. / Recognition of learning-centered emotions 3
Table 1
Facial Expression Datasets
Work Description Contents
CK y CK+ [8, 9] Two databases that represent facial expressions using FACS Images of 6 basic emotions.
action units.
RaFD [10] Database of posed facial expressions in 3 different face Images of 8 basic emotions.
positions with 3 gaze directions.
SEMAINE [11] Database of facial expressions of sporadic basic emotions Images of 8 basic emotions with 4 dimensions that
obtained from the interaction with a conversational agent. are valence, activation, power, and expectation
f
MMI [12] Database with manipulated web interface from front and Images and videos of 8 basic emotions.
roo
side-face perspective.
GEMEP [13] Database of posed facial expressions performed by actors. 8 basic emotions and 10 non-basic emotions stored
in videos and images.
149 Each expression begins as a neutral expression and training and construction of a new recognizer for 188
rP
150 then move to a peak expression (more intense expres- leraning-centered emotions. 189
151 sion). Each expression can receive an emotion label.

152 In the plus version were added spontaneous expres- 2.2. Facial expressions recognition methods 190
153 sions recording 84 novel subjects while they were

distracted among each photo session. Radboud Faces
tho
154 Appearance-based techniques apply operators and 191
155 Database (RaFD) [10] also includes photos of eight filters over the pixels of the image in order to obtain 192
156 basic emotions. The photos were taken using Cau- a set of representative features of the face. Local 193
157 casian Dutch adults and children. Participants showed Binary Pattern (LBP) is a method that takes the 194
158 the facial expressions with three gaze direction and pixel value of the image center as the threshold [14]. 195
Au
159 five camera angles. In addition, they complied req- Each pixel value is compared against the threshold; 196
160 uisites as wearing a type of shirt and having no if the threshold is bigger than the pixel value then the 197
161 hair on the face. In other works like SEMAINE resulted is zero, otherwise, it is one. This technique 198
162 [11] in addition to expressing six basic emotions, was applied to identify face expressions in [15], and 199
163 the database contains four dimensions of an emotion the results were satisfactory. Local Phase Quantiza- 200
which are Valence, Activation, Power, and Anticipa- tion (LPQ) uses blur insensitive texture classification
d
164 201
165 tion/Expectation. In addition, spontaneous emotions through local Fourier transformation neighborhood 202
cte
166 were added taking photos while participants talked by computing its local Zernike moments. The pro- 203
167 to an agent system. M&M Initiative (MMI) [12] cess generates LPQ codes, and collect them into a 204
168 contains image sequences of faces in frontal and pro- histogram. This descriptor is ideal for image blur- 205
169 file view. MMI contains more than 1500 samples ring. Some works have proven LPQ can be used for 206
170 and the database is contained in a web-based direct- expression recognition with FACS [16]. Histograms 207
rre
171 manipulation application. Two FACS coders labeled can reach up to 25,000 features so that indicates that 208
172 the images and videos. Geneva Multimodal Emotion LPQ covers an extension area of the face. Gabor 209
173 Portrayals Core Set (GEMEP) includes an important representation [17] is a representation of a convolv- 210
174 set of images [13]. In total, it contains 18 portrayed ing of an input image using a set of Gabor filters 211
co
175 discrete emotions labeled using FACS. The databases with various scales and orientations. Gabor filters 212
176 were built using 10 professional French-speaking encode componential information, and depending on 213
177 theater actors who were trained by a professional the registration scheme. The overall representation 214
178 director. The corpus is comprised of over 7000 audio- may implicitly convey configurable information. This 215
Un
179 visual emotion representations. technique can be used with simple dimensionality 216
180 Table 1 shows 6 of the most popular datasets reduction techniques such min, max and mean group- 217
181 for facial expressions. We can see that most of the ing. The representation is robust to registration errors 218
182 datasets only contain images of facial expressions to an extent as the filters are smooth and the magni- 219
183 representing the basic emotions. In addition, most of tude of filtered images are robust to small translation 220
184 these emotions are not spontaneous (they are actuated and rotations. The feature amount can reach up to 221
185 emotions). The work of building an own database 165,000 values. 222
186 fills the gap of not having a set of data with emo- Geometric-based techniques [18] frequently rep- 223
187 tions related to education which is necessary for the resent faces as a facial point representation. These 224
225 points describe a face by a concatenation of X and Y They preprocessed images transforming them in gray 277
226 coordinates of fiducial points. To represent a face in scale images and cropping the region of the face. 278
227 these techniques there is two type of representations LBP codes are mapped to a 3D space applying 279
228 of models: the first is the free model which detects fea- multi-dimensional scaling which is a code-to-code 280
229 ture points individually by performing a local search. dissimilarity scores based on an approximation to 281
230 The located points are called facial landmarks. The the Earth Mover’s Distance. Kim in [23] presents an 282
231 second is model-based focuses on measuring dis- interesting analysis of convolutional neural networks. 283
232 tances between the real face and a template. The tem- They present a new pattern recognition framework. 284
f
233 plate represents the most of the cases a neutral expres- The pattern consists of a set of deep CNNs that 285
roo
234 sion. Conditions as illumination variations are not an are interconnected with various committee machines 286
235 issue because the intensity of the pixels is ignored, (also known as classifier ensembles). Each CNN is 287
236 unlike appearance-based techniques. The most of the independently configured; this means that each CNN 288
237 research work complement their extraction feature is an individual member inside of the framework; 289
238 techniques adding facial points as additional data also, each CNN was trained using different datasets 290
rP
239 for improving the recognition. Majumder et al. [19] where each dataset is created using a distinct pre- 291
240 present a model of emotional recognition using a processed for the original image dataset. The work 292
241 Kohonen self-organization map (KSOM) which is in [24] the authors presented a method to recognize 293
242 trained with 26 dimensional geometric feature vector. static facial expressions; they use three techniques 294
tho
243 The vectors are built from feature points on eyes, lips, to detect faces in the SFEW 2.0 dataset: the joint 295
244 and eyebrows. The nose is the central part of mea- cascade detection and alignment, the Deep-CNN- 296
245 surements. The facial movements are measured as a based, and the mixtures of trees. They applied a 297
246 reference using the neutral expression. Some features pre-processing over the images; each one is resized to 298
247 are measured using the calculation of the area at the 48 × 48 and transformed to grayscale. They propose 299
Au
248 opening of the eyes, the distance from the opening of a CNN architecture of five convolutional layers but 300
249 the mouth from lip to lip, and the distance between the instead of adding pooling layers in each connection 301
250 corners of the lip and the edges of the nose. Salmam among convolutional layers, they use stochastic pool- 302
251 et al. [20] focus on introducing a new extraction ing because it has proven giving a good performance 303
252 feature technique using a geometry-based approach. with limited training data. The techniques used for 304
They used the Supervised Decentration Method building the CNN are the use of generating random-
d
253 305
254 (SDM) for nonlinear least squares (NLS) problems. ized perturbation in the dataset, the modification of 306
cte
255 In their extraction technique of facial points, they the loss function for considering the perturbation, a 307
256 obtain up to 80 features points of the face, eyes, lips, pre-training of CNN using the FER dataset, a fine- 308
257 eyebrows, mouth, and nose. After distances are mea- tuning of the CNN using the SFEW dataset, and 309
258 sured, they use the three types of formulas on mea- multiple networks for learning. 310
259 surements Euclidiana, Manhattan and Minkowski. As we can see, the previous works are oriented 311
rre
260 A convolutional neural network is composed of to the recognition of basic emotions, predominat- 312
261 multiple processing layers which are used to learn ing the emotions of the Ekman model. On the other 313
262 data representation with multiple levels of abstrac- hand, although the works have tried to improve the 314
263 tion. The method does not necessarily perform feature algorithms for convolutional neural networks, an 315
co
264 extraction or feature selection. A proposal of identifi- architecture for an educational domain has not been 316
265 cation of high-level features is presented in [21]. The designed and implemented, which is the main point 317
266 authors introduces their new deep learning approach of our work. 318
267 which consists of adding a new layer named Deep

Un
268 hidden IDentity features (DeepID) which identifies

269 a large number of classes using def-pooling. The 3. Facial expression recognition 319
270 work follows a normal configuration of a convolu-

271 tional network with the difference that DeepID layer Next, we present three methods for recognizing 320
272 is located between the last convolutional layer and the facial expressions. The first two methods were pre- 321
273 soft-max layer. In [22], the authors present a method viously reported in other works. These methods are 322
274 to reduce the complexity of the problem domain explained to give a proper context of how to they 323
275 removing confounding factors. The authors used the work. The third method is a CNN and we describe 324
276 feature extraction method local binary pattern (LBP). the features and parameters used in its architecture.
325 3.1. Local binary pattern landmark points are located on the face. These points 358
are located using a template previously trained by 359
326 In the work reported in [4] we described how dlib software [25]. The points are related to areas 360
327 the pattern recognizer was created using LBP. This of the human face that express an emotion. In these 361
328 method is based on the work of Happy [12]. The areas of the human face are the lip, eyes, eyebrows, 362
329 method detects the nose, mouth, eyebrows, and eyes and nose. The face landmarks are a part of all the 363
330 as separate objects. Those objects are transformed face features (X and Y coordinate values). However, 364
331 into six separate images. For each image, the fol- one problem is that coordinate values may change 365
f
332 lowing filters are applied (in the order they appear): depending on where the face is located in the photo. 366
roo
333 Gaussian Blur, Sobel, Otsu’s Threshold, Binary Dila- To solve that problem, the average value of both axes 367
334 tion, and Removing Small Objects. Then, the last (X and Y) are calculated, so the center of gravity of 368
335 pixels on left and right ends from eyebrows are estab- all face landmarks is obtained. Those values repre- 369
336 lished as key points. In the case of nose and eyes, their sent the position of all points relative to the central 370
337 central positions are established as key points. Using point. The distances from the center to every land- 371
rP
338 each key point on face, facial patches are calculated. mark points are obtained. Each line has a magnitude 372
339 These facial patches has a proportion of one sixteenth (distance between both points) and a direction whose 373
340 of the face width. A LBP uniform operator is applied value is an angle in relation to the image where a 374
341 to each facial patch. The operator has a configuration 0◦ is the value of a horizontal line. Another issue to 375
tho
342 of 9 neighborhoods with a radius of 2. A LBP operator consider is that of the tilted faces. It is normal that 376
343 is applied to each pixel in the facial patch. This action users move their necks during computing activities. 377
344 generates a binary number by comparing each pixel The rotations are corrected by offsetting all calculated 378
345 value against the center pixel value. When the pixel angles by the angle of the nose bridge. This rotates 379
346 value is less than the center pixel value then the result the set of feature so that tilted faces become similar 380
Au
347 is zero, else is one. The histograms obtained from to non-tilted faces with the same expression. In this 381
348 LBP images are utilized as features descriptors. Each case, the angle is calculated with function arctangent, 382
349 histogram is generated with 256 bins. Histograms depending on if the nose bridge is perpendicular to 383
350 are concatenated and normalized in a vector. A sup- the horizontal plane, for adding or subtracting a com- 384
351 port vector machine (SVM) classifier receives the pensation value (90 degrees). Coordinates, distances, 385
histogram and uses a one-vs-the-rest scheme to take and angles are concatenated as input inside a support
d
352 386
353 multi-class decisions. Figure 1 shows the left-to-right vector machine. Figure 2 shows the feature extraction 387
cte
354 process for extracting features using LBP. procedure in this method. 388
355 3.2. Geometric-based method 3.3. Convolutional neural network 389
Our work reported in [5] explains how the Convolutional Neural Networks (CNNs) are rep-
rre
356 390
357 geometry-based recognizer was developed. First, 68 resented in a multi-layer architecture. Each layer has 391
co
Un
Fig. 1. Process to extract features using LBP operator and Facial Patches.
Fig. 2. Process to extract features using geometric-based approach.
f
roo
rP
tho
Fig. 3. Convolutional neural network architecture.
392 a specialized function. The convolutional layers have 3.3.2. The convolutional layer 421
393 the goal of extracting features and patterns from the As mentioned above, a convolutional layer applies 422
Au
394 images. The pooling layers have the goal of decreas- the mathematical convolution operation or process 423
395 ing the number of final features and reducing bias which consists of 3 elements: one is the data input 424
396 problems. The neural network layers have the goal of which is usually expressed as a multidimensional 425
397 classifying the data obtained from the previous ad- array of data. Another one is the kernel which is a mul- 426
398 jacent layers. After trying several architectures and tidimensional array of parameters that are adapted 427
based on a similar work (LeNet [27], which uses by the learning algorithm. The last element is the
d
399 428
400 2 convolutional layers, 2 max-pooling, and 3 fully output which is known as the feature map. Multidi- 429
cte
401 connected). The architecture that showed a better per- mensional arrays are called tensors. The main idea 430
402 formance consists of nine layers (excluding dropout behind the network is that the kernel can identify the 431
403 connections); each convolutional layer contains 64 visual patterns that come from the input (edges, lines, 432
404 filters. Figure 3 shows the architecture designed, colors, etc.) and thus be able to differentiate the visual 433
405 which consists of three convolutional layers, 3 max- pattern between the different objects. In the convolu- 434
rre
406 pooling layers and 3 fully connected neural networks tional process, the kernel is overlapped on the input 435
407 layers. and then a crossover operation is performed, which is 436
equivalent to a convolutional operation. Each value 437
408 3.3.1. Preprocessing of the input is multiplied by the value at the same 438
co
409 Preprocessing is not a part of the architecture. position in the kernel and the resulting values are 439
410 However, a CNN has the inconvenience of needing a sums that are placed in the output. The process is 440
411 powerful hardware. The use of filters in a large num- as follows: first, the kernel is overlapped on top of 441
412 ber of images with multiple dimensionalities causes the input image; second, the product between each 442
Un
413 an important workload on the CPU. Applying a pre- number in the kernel and each number in the over- 443
414 processing step help to achieve converging a CNN lapped input is computed; third, a single number by 444
415 model adequately. The process consists of locating summing these products together is obtained; fourth, 445
416 a region of interest (ROI) in every facial image. The the obtained number is set in a convolutional output; 446
417 viola-jones method [26] and OpenCV software were fifth and last, the kernel is moved to the next section in 447
418 used. After the ROI is located, it is subtracted from the input image. An example of the convolution pro- 448
419 the image and transformed to a size of 75 × 75 pix- cess for a part of the face is shown in Fig. 4, where 449
420 els. At the end, the image is converted to a grayscale the numbers shown in the figure are just to simplify 450
image and saved into a new database. the example. The patterns we need to detect are more 451
f
roo
rP
Fig. 5. Max-pooling process on an input data.
is the most used grouping function, which holds the 482
maximum values as output. Other common pooling 483
tho
Fig. 4. Example of the convolutional process of part of the face. functions include the average of a rectangular neigh- 484
borhood, the L2 norm of a rectangular neighborhood, 485
452 related to the shapes rather than to the colors, so that or a weighted average based on the distance from 486
453 the images are preprocessed to black and white. This the central pixel. A pooling layer has the utility of 487
also helps to decrease the dimensionality of both the reducing the spatial dimension of a convolutional
454
Au 488
455 inputs and the kernel. In addition, a ReLU (Rectified layer before sending the data to the next convolutional 489
456 Linear Units) function was added which modifies the layer (or any other type of layer). The operation per- 490
457 values in the convolutional layer without affecting its formed by this layer leads to loss information and is 491
458 important properties. The function takes the values of referred to as “sample-down”. The operation used in 492
the output and in case they are less than zero, it leaves the designed architecture is max-pooling with a win- 493
d
459
460 them in zero. One of the characteristics of the function dow of size 2 × 2. The max-pooling process selects 494
ReLU is that it has a non-linear property. We consider the max value of a selected area (Window). Figure 5 495
cte
461
462 the function ReLU as part of the activation function of shows an example of application of max-pooling on a 496
463 the convolutional layers. The pixel values of the face data entry. Numbers shown in figure are a simplified 497
464 is interpreted as an array and the kernel is overlapped example. 498
465 with an output. In the architecture we placed 3 con-

rre
466 volutional layers with the activation function ReLU. 3.3.4. Classification layers 499
467 The configuration used is a kernel of size 3 × 3, 64 The architecture has three fully connected neural 500
468 filters for each one, and a stride of size 1. network layers. A fully connected layer takes all units 501
in the previous layer (no matter what type of layer 502

co
469 3.3.3. Max-pooling layers is). Fully connected layers are not spatially located 503
470 A typical convolutional neural network architec- anymore, so it is no possible having convolutional 504
471 ture consists of three stages for feature extraction. layers after a fully connected layer. The first two lay- 505
472 In the first stage, one or more convolutional layers ers use ReLU as activation in their outputs, but the 506
Un
473 perform in parallel a series of linear activations. In third layer (classification layer), uses Softmax as acti- 507
474 the second step, each convolutional layer executes vation function. Another feature of the first layers in 508
475 a linear activation function called ReLu. These first the CNN architecture is that they have a Dropout con- 509
476 two stages are sometimes called the detection stage nection. The intention is to reduce the saturation of 510
477 (similar to the feature extraction process from other data between the layers of the neural network thus 511
478 methods). In the third stage, we used a pooling func- avoid bias in the data that affects the classification 512
479 tion to modify the result obtained from previous steps. process. The Dropout connection selects randomly a 513
480 Pooling is a grouping function to replace the obtained part of the data input and places the values of that 514
481 values from the convolutional layers. Max-pooling part at zero. This connection performs the dropout 515
Table 2 ing of data in each layer. Max-pooling uses a strider 531

Description of the CNN architecture of 2 × 2 causing a reduction of up to almost half of 532
Name Layer Type Input Output the data in each layer. To improve the classification, 533
conv2d 1 input InputLayer (75,75,3) (75,75,3) we decided to flatten data before giving it as input to 534
conv2d 1 Conv2D (75,75,3) (73,73,64) a first dense layer. Dense layers do not suffer modifi- 535
max pooling2d 1 MaxPooling2D (73,73,64) (36,36,64)
conv2d 2 Conv2D (36,36,64) (34,34,64) cation in data dimension, except the final dense layer 536
max pooling2d 2 MaxPooling2D (34,34,64) (17,17,64) that reduces data to 15 units. 537
conv2d 3 Conv2D (17,17,64) (15,15,64)
f
max pooling2d 3 MaxPooling2D (15,15,64) (7,7,64)
roo
flatten 1 Flatten (7,7,64) (3136)
dense 1 Dense (3136) (500) 4. Process of creating facial expression 538
dropout 1 Dropout (500) (500) databases 539

dense 2 Dense (500) (500)
dropout 2 Dropout (500) (500)
dense 3 Dense (500) (15) An essential part for any recognition system is 540
the database for training (Fig. 6). Databases con- 541
rP
tain relevant information for any recognition system 542
516 task in each training stage iteration. The fraction of be able to discriminate important data to classify. 543
517 the selected data is 50%. We proposed a new method to build face expression 544
databases using two EEG-based Brain-Computer 545
tho
518 3.3.5. Configuration of the CNN architecture Interface (BCI) systems: Emotiv Epoc and Emotiv 546
519 Table 2 shows the configuration of the CNN archi- Insight. They are interface systems that capture the 547
520 tecture. It includes data about the name of the layer, brain activity and give information about the emotion 548
521 the type of layer, and the sizes of input and output that student is feeling. Next, we describe the used 549
522 dimension for each layer. Conv2D means a con- devices and methods. 550
Au
523 volutional layer of two dimensions; MaxPooling2D
524 represents a max-pooling layer of two dimensions; 4.1. EEG, Emotiv EPOC and Emotiv Insight 551
525 Flatten indicates a layer that transforms a matrix into a

526 one-dimension vector; Dense specifies a densely con- EEG is a technique for monitoring the brain’s 552
527 nected neural network; and Dropout denotes a layer encephalographic signals. It is a non-invasive tech- 553
d
528 that performs a dropout operation. Convolutional lay- nique where electrodes are placed on the scalp. 554
529 ers have different dimensions because a kernel of Emotiv EPOC is a device built by the bio-informatic 555
cte
530 3 × 3 is overlapped in layers; this generates a reduc- and technology company EMOTIV inc [27]. The set 556
rre
co
Un
Fig. 6. Photos of a part of three databases.

557 of tools of Emotiv is a wireless neuroheadset which

558 works with bluetooth signals, a SDK to develop appli-
559 cations to gather and analyze data, a suite of desktop
560 applications for the Emotiv EPOC, and a suite of
561 mobile applications for the Emotiv Insight.
562 4.2. Protocol for building and filtering the facial

563 expression database
f
roo
564 We looked for a method to capture expressions dur-
565 ing an educational context. In addition, we looked
566 for an activity related to the domain of the intelli-
567 gent tutoring system that uses the facial recognition
568 system. The protocol followed for the creation of
rP
569 the database was as follows: The data was captured
570 with 38 students from the Instituto Tecnológico de
571 Culiacán with 28 men and 10 women. The par-
572 ticipants were between 18 and 47 years old. The
tho
573 students wrote, compiled and executed programs in
574 Java with the Emotiv diadem obtaining their emo-
575 tional state during the coding of the program. In most
576 of the works for building facial expression databases, Fig. 7. Method to take photographs for face expression database.
577 experts in judging emotions participate in the annota-
Au
578 tion process to tag the captured images of students or
579 users. In our work, the labeling is carried out automat- judges to have a database with properly labeled facial 606
580 ically by an application and with the help of Emotiv. expressions.The Emotiv insight database obtained a 607
581 Figure 7 shows the used method and it is explained total of 5560 labeled images; this database has not 608
582 as follows: received a filtering process so it keeps its original 609
size. Figure 6 shows parts of both facial expressions 610

d
583 1. The student codes a Java program; meanwhile, databases. 611

584 the Emotiv device captures brain activity and the
cte
585 webcam takes a photograph every 5 seconds.

586 2. Every photograph is labeled by system with the 5. Tests and discussion 612
587 user emotion obtained at that moment from the
588 Emotiv device. The annotation is made by an We show and explain the tests performed to the 613
application that takes emotion from the Emo-
rre
589
recognizers that were trained and tested with differ- 614
590 tiv device and labels the photograph with that ent databases. The tests consisted of measuring the 615
591 emotion. accuracy of the recognizers using the RaFD database 616
592 3. The photo previously labeled is saved into the and the two databases built with Emotiv Epoc (dbE) 617
593 facial expression Database. and Insight (dbI). In the case of RaFD we decided to
co
618
594 4. Finishing the previous steps, a group of experts use 6 of the 8 basic emotions because they are the 619
595 evaluates if there is a match between the emo- most common emotions in other research work, so 620
596 tional label and the expression in the photo. If we could make comparisons of our results to better 621
so, the photo is saved; otherwise it is discarded.
Un
597
validate our emotion recognition. Table 3 describes 622
598 We obtained two databases, one by each Emotiv the contents of the three databases. 623
599 device. The Emotiv epoc database stored a total of

600 7,019 photographs. However, several photos had not Table 3
601 a proper matching with their labeled emotion. We Description of databases
602 proceeded to filter the database eliminating incor- RaFD Emotiv Epoc Emotiv Epoc
603 rectly labeled photos obtaining a database of 730 Database Insight
604 photographs. This debugging process helped us to Face number 1146 730 5056
605 have a database with a validation also from human Labels (classes) 6 4 6
Table 4 Table 5
Class distribution for databases built with Accuracy obtained with database RaFD
Emotiv Epoc and Insight
Classifier/Feature LBP Geometric-based CF
Emotion Emotiv Epoc Emotiv Epoc Extractor
Database Insight KNN 55% 70% 30%
Boredom 17 1040 ANN 88% 60% 31%
Engagement 519 1955 SVM 66% 92% 73%
Excitement 91 1661 CNN – – 95%
Frustration 104 –
f
Focus – 222
roo
Relax – 28 Table 6
Interesting – 150 Precision obtained with the database dbE
Classifier/Feature LBP Geometric-based CF
624 The RaFD database [10] contains a total of 1146 Extractor
625 photos; 191 photographs for each basic emotional KNN 85% 84% 85%
626 class. The Emotiv Epoc and Emotiv Insight databases ANN 85% 80% 85%
rP
627 contain images of learning-oriented facial expres- SVM 85% 84% 77%
628 sions and have three emotional labels in common. CNN – – 88%
629 The distribution of classes of the databases built with
630 Emotiv are shown in Table 4. Table 7
tho
631 The test consists of a cross validation of k = 10. Precision obtained with the database dbI
632 This means that the recognizers will be trained and Classifier/Feature LBP Geometric- CF
633 tested 10 times using a different segment of the input Extractor based
634 data in each training step. Data were divided into KNN 70% 61% 65%
635 90% for training and 10% for precision testing at ANN 74% 51% 61%
SVM 69% 61% 63%
Au
636 each iteration. The data selection was random for both
CNN – – 74%
637 the training data part and the testing part. The divi-
638 sion of the data did not take into account if a person
639 existed in both parts of the data since the objective is reason of the lower results is that this database con- 665
640 the general recognition of an emotion in people and tains an amount of photos five times greater than 666
not in a particular individual. The features obtained RaFD or dbE and they have not received a filtering
d
641 667
642 from the LBP and Geometric-based techniques were process. The CNN obtained an average accuracy of 668
cte
643 used to train three classification algorithms: Support 74% which we consider is a good result that should 669
644 Vector Machine (SVM), Artificial Neural Network improve a lot once the dbI database goes through a 670
645 (ANN) and K-Nearest Neighbors (KNN). In addi- filtering process. Table 7 shows the results for the 671
646 tion, a test was added where the convolutional filters database dbI. 672
647 (CF) are applied to the images. The results of these In the case of the RaFD database and using the 673
rre
648 filters were used as input features for the three classi- CNN architecture we obtained a precision of 95%. 674
649 fiers mentioned above. The accuracy obtained in the Only the combination of geometric-based method 675
650 three databases in combination with the classifiers with SVM came close to what was obtained with 676
651 and extraction techniques is shown below. this architecture. This clarifies that in the case of a 677
co
652 With the RaFD database, most of the methods did database of basic emotions with discrete and acted 678
653 not obtain significant results. Only two combinations emotions the architecture has no problem. In addi- 679
654 obtained results over 85% of accuracy. SVM and the tion, if we contrast this result against other works 680
655 Geometric-based method obtained a good result but that have been tested with different databases of 681
Un
656 the CNN obtained a value close to 100%. Table 5 basic emotions, we can find that we obtained satis- 682
657 shows the results for the tests with the RaFD database. factory results. For example, in Ilbeigy’s work [28], 683
658 With the dbE database, most methods obtained an they used a technique that combines traditional fea- 684
659 accuracy between 80% and 85%. Accuracy values ture extraction with fuzzy sets where they obtained 685
660 show how a filtered database can help classifiers to 93.95% accuracy and Recio [29] averaged less than 686
661 get a better accuracy. Table 6 shows the test data for 90% in all experiments. RafD has not yet been tested 687
662 the database dbE. with convolutional neural networks. However, we 688
663 With the dbI database, we obtained lower results can compare our work with [25], which uses con- 689
664 than the other two databases (RaFD and dbE). One volutional neural networks that were tested with two 690
691 databases of basic emotions (CK [9] and Geneva [13]) 3 facial expressions databases: one database contains 740
692 obtaining a maximum value of 98% accuracy. posed or acted basic emotions and two databases con- 741
693 In the case of dbE the obtained precision had a tains spontaneous learning-centered emotions. The 742
694 value of 88% using the CNN architecture, which is evaluation suggests that the architecture can perfectly 743
695 superior to the other combinations of proven recog- detect facial expressions of acted basic emotions by 744
696 nition techniques. In previous work [4, 5], in the having similar and superior results than other popular 745
697 tests with this database we reached a maximum methods when detecting similar emotions. Evidence 746
698 value of 86% accuracy, clarifying that there were also shows that learning-centered emotions can be 747
f
699 some notable differences between both tests. In these successfully recognized in the event that the database 748
roo
700 reported works the database was reduced because is validated and filtered. Our architecture is the first 749
701 the recognizer could not identify all the parts of one to prove itself with this type of emotions and 750
702 the face in several photos, which is fundamental for expressions. Many of the tests with this type of archi- 751
703 the feature extractor, reason why only 20% of the tectures have been done with databases for acted 752
704 faces were detected. Comparing this work with other expressions or for spontaneous expressions that have 753
rP
705 research is complicated. To the best of our knowl- no relation with the learning process. In addition, we 754
706 edge, the problem of recognizing non-basic emotions validate the importance of two new databases built by 755
707 using convolutional neural networks has not been us and their importance and effectiveness in the train- 756
708 addressed. In [30], we found an emotion recogni- ing of different types of classifiers. As a future work 757
tho
709 tion work using database Acted Facial Expressions we have to perform tasks such as increasing some 758
710 in Wild, a database that collects images of movies to classes (emotions) that are unbalanced in the database 759
711 catch more realistic expressions. This work obtained dbI, trying new architectures with other layers of 760
712 an average below 70%, where six of the seven emo- pooling, and performing different filtering and pre- 761
713 tions analyzed obtained a percentage below 70%. processing methods on our databases dbE and dbI. 762
Au
714 With this, we can conclude that our recognizer has
715 high accuracy since it identifies non-basic emotions
716 obtained from a real programming context. There References 763
717 have also been work that have made a comparison
718 with non-basic emotions but without using convo- [1] R.E. Kaliouby, R. Picard and S. Baron-Cohen, Affective 764
lutional networks. The work of Bosch [31] is an
d
719 Computing and Autism, Ann N Y Acad Sci, 1093(1) (2006), 765
720 example of this, where they obtained a precision less 228–248. 766
[2] P. Ekman, An argument for basic emotions, Cogn Emot 6(3) 767
cte
721 than 70%. (1992), 169–200. 768

722 In the case of dbI, lower percentages were obtained [3] R.Z. Cabada, M.L.B. Estrada, F.G. Hernandez and R.O. 769
723 compared to the previous databases. By joining the Bustillos, An Affective Learning Environment for Java, 770
724 LBP and ANN methods, we obtained similar pre- in 2015 IEEE 15th International Conference on Advanced 771
Learning Technologies (2015), pp. 350–354. 772
725 cisions to the CNN architecture. This helps us to [4] R. Zatarain-cabada, et al., Building a Corpus and a Local
rre
773
726 understand that the database still requires work sim- Binary Pattern Recognizer for Learning-Centered Emo- 774
727 ilar to dbE. A filter job has not yet been performed tions, Adv Artif Intell Its Appl 2016. 775
728 on this database and has an unbalanced class prob- [5] R. Zatarain-Cabada, M.L. Barrón-Estrada, F. González- 776
Hernández and H. Rodriguez-Rangel, Building a Face 777
729 lem. Even so, we consider our results as satisfactory, Expression Recognizer and a Face Expression Database 778
co
730 because the previous comparisons with [31] and [30] for an Intelligent Tutoring System, in Advanced Learn- 779
731 give a clear idea of how complex it is to work with a ing Technologies (ICALT), 2017 IEEE 17th International 780
732 database with spontaneous expressions. Conference on (2017), pp. 391–393. 781
[6] S. D’Mello and A. Graesser, Dynamics of affective 782
states during complex learning, Learn Instr 22(2) (2012),
Un
783
145–157. 784
733 6. Conclusion and future work [7] Y. LeCun, Y. Bengio and G. Hinton, Deep learning, Nature 785
521(7553) (2015), 436–444. 786
[8] T. Kanade, J. Cohn and Y. Tian, Comprehensive database for 787
734 This work presents an architecture of a con- facial expression analysis, in Automatic Face and Gesture 788
735 volutional neural network for the recognition of Recognition (2000), pp. 46–53. 789
736 learning-centered emotions. The proposed architec- [9] P. Lucey, J.F. Cohn, T. Kanade, J. Saragih, Z. Ambadar 790
and I. Matthews, The extended cohn-kande dataset 791
737 ture consists of 3 convolutional layers each followed (CK+): A complete facial expression dataset for action 792
738 by a max-pooling layer, and finally 3 layers of fully- unit and emotionspecified expression, in Cvprw (2010), 793
739 connected neural networks. The CNN was tested with pp. 94–101. 794
795 [10] O. Langner, R. Dotsch, G. Bijlstra, D.H.J. Wigboldus, S.T. [21] Y. Sun, X. Wang and X. Tang, Deep learning face repre- 838
796 Hawk and A. van Knippenberg, Presentation and validation sentation from predicting 10,000 classes, in Proceedings 839
797 of the Radboud Faces Database, Cogn Emot 24(8) (2010), of the IEEE Conference on Computer Vision and Pattern 840
798 1377–1388. Recognition (2014), pp. 1891–1898. 841
799 [11] G. McKeown, M. Valstar, R. Cowie, M. Pantic and M. [22] G. Levi, Emotion recognition in the wild via convolutional 842
800 Schr??der, The SEMAINE database: Annotated multimodal neural networks and mapped binary patterns, in Icmi (2015), 843
801 records of emotionally colored conversations between a per- pp. 503–510. 844
802 son and a limited agent, IEEE Trans Affect Comput 3(1) [23] B.-K. Kim, H. Lee, J. Roh and S.-Y. Lee, Hierarchical com- 845
803 (2012), 5–17. mittee of deep cnns with exponentially-weighted decision 846
804 [12] M.F. Valstar and M. Pantic, Induced Disgust, Happiness fusion for static facial expression recognition, in Proceed- 847
f
805 and Surprise: An Addition to the MMI Facial Expression ings of the 2015 ACM on International Conference on 848
roo
806 Database, in Proceedings of Int’l Conf. Language Resources Multimodal Interaction (2015), pp. 427–434. 849
807 and Evaluation, Workshop on EMOTION (2010), pp. 65–70. [24] Z. Yu, Image based Static Facial Expression Recognition 850
808 [13] T. Bänziger, M. Mortillaro and K.R. Scherer, Introducing with Multiple Deep Network Learning, in Proceedings of 851
809 the Geneva Multimodal expression corpus for experimen- the 2015 ACM on International Conference on Multimodal 852
810 tal research on emotion perception, Emotion 12(5) (2012), Interaction (2015), pp. 435–442. 853
811 1161–1179. [25] D.E. King, Dlib-ml: A Machine Learning Toolkit, J Mach 854
rP
812 [14] T. Ojala, M. Pietikäinen and T. Mäenpää, Gray Scale and Learn Res 10 (2009), 1755–1758. 855
813 Rotation Invariant Texture Classification with Local Binary [26] P. Viola and M. Jones, Rapid object detection using a 856
814 Patterns, IEEE Trans Pattern Anal Mach Intell 24(7) (2000), boosted cascade of simple features, in Proceedings of the 857
815 404–420. 2001 IEEE Computer Society Conference on Computer 858
816 [15] S.L. Happy and A. Routray, Automatic facial expression Vision and Pattern Recognition. CVPR 2001, 1 (2001), pp. I- 859
817 recognition using features of salient facial patches, IEEE 511–I-518. 860
tho
818 Trans Affect Comput 6(1) (2015), 1–12. [27] Emotiv, Emotiv Insight, Web Page, 2016. 861
819 [16] B. Jiang, M. Valstar, B. Martinez and M. Pantic, A dynamic [28] M. Ilbeygi and H. Shah-Hosseini, A novel fuzzy facial 862
820 appearance descriptor approach to facial actions temporal expression recognition system based on facial feature 863
821 modeling, IEEE Trans Cybern 44(2) (2014), 161–174. extraction from color face images, Eng Appl Artif Intell 864
822 [17] T. Wu, N.J. Butko, P. Ruvolo, J. Whitehill, M.S. Bartlett 25(1) (2012), 130–146. 865
823 and J.R. Movellan, Action unit recognition transfer across [29] G. Recio, A. Schacht and W. Sommer, Recognizing dynamic 866
Au
824 datasets, in 2011 IEEE International Conference on Auto- facial expressions of emotion: Specificity and intensity 867
825 matic Face and Gesture Recognition and Workshops, FG effects in event-related brain potentials, Biol Psychol 96 868
826 2011 (2011), pp. 889–896. (2014), 111–25. 869
827 [18] K. Huang, S. Huang and Y. Kuo, Emotion Recognition [30] M. Liu, R. Wang, S. Li, S. Shan, Z. Huang and X. Chen, 870
828 Based on a Novel Triangular Facial Feature, in Neural Net- Combining Multiple Kernel Methods on Riemannian Man- 871
829 works (IJCNN), The 2010 International Joint Conference ifold for Emotion Recognition in the Wild, in Proceedings 872
830 on (2010), pp. 18–23. of the 16th International Conference on Multimodal Inter- 873
d
831 [19] A. Majumder, L. Behera and V.K. Subramanian, Emotion action - ICMI ’14 (2014), pp. 494–501. 874
832 recognition from geometric facial features using self- [31] N. Bosch et al., Automatic Detection of Learning-Centered 875
cte
833 organizing map, Pattern Recognit 47(3) (2014), 1282–1293. Affective States in the Wild, in Proceedings of the 20th 876
834 [20] F.Z. Salmam, A. Madani and M. Kissi, Facial Expression International Conference on Intelligent User Interfaces - 877
835 Recognition Using Decision Trees, in 2016 13th Interna- IUI ’15 (2015), pp. 379–388. 878
836 tional Conference on Computer Graphics, Imaging and
837 Visualization (CGiV) (2016), pp. 125–130.
rre
co
Un

Recognition of Learning-Centered Emotions Using A Convolutional Neural Network

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Recognition of Learning-Centered Emotions Using A Convolutional Neural Network

Uploaded by

Copyright:

Available Formats

Journal of Intelligent & Fuzzy Systems xx (20xx) x–xx 1

1 Recognition of learning-centered emotions

19 1. Introduction emotions. Emotional recognition through the face 31

is an issue that has been extensively addressed by 32

discarded because these features make the images 45

itculiacan.edu.mx. is the building or use of a classifier. Classifiers 49

51 that are represented by labels or classes. The most emotions. 103

65 programming language. Intelligent learning environ- topic. 117

74 creating an emotion recognizer was addressed twice

77 work automatically performs the feature selection

86 engagement, excitement, boredom, and relaxation. learning approaches. 135

87 Also in previous works, we built two classifiers to

89 is explaining further on.

151 sion). Each expression can receive an emotion label.

153 sions recording 84 novel subjects while they were

267 which consists of adding a new layer named Deep

268 hidden IDentity features (DeepID) which identifies

270 work follows a normal configuration of a convolu-

are located using a template previously trained by 359

355 3.2. Geometric-based method 3.3. Convolutional neural network 389

Fig. 2. Process to extract features using geometric-based approach.

407 layers. and then a crossover operation is performed, which is 436

equivalent to a convolutional operation. Each value 437

is the most used grouping function, which holds the 482

maximum values as output. Other common pooling 483

borhood, the L2 norm of a rectangular neighborhood, 485

464 is interpreted as an array and the kernel is overlapped example. 498

465 with an output. In the architecture we placed 3 con-

in the previous layer (no matter what type of layer 502

Table 2 ing of data in each layer. Max-pooling uses a strider 531

conv2d 3 Conv2D (17,17,64) (15,15,64)

dropout 1 Dropout (500) (500) databases 539

the database for training (Fig. 6). Databases con- 541

databases using two EEG-based Brain-Computer 545

525 Flatten indicates a layer that transforms a matrix into a

Fig. 6. Photos of a part of three databases.

557 of tools of Emotiv is a wireless neuroheadset which

562 4.2. Protocol for building and ﬁltering the facial

582 as follows: received a filtering process so it keeps its original 609

size. Figure 6 shows parts of both facial expressions 610

583 1. The student codes a Java program; meanwhile, databases. 611

585 webcam takes a photograph every 5 seconds.

599 device. The Emotiv epoc database stored a total of

721 than 70%. (1992), 169–200. 768

You might also like