You are on page 1of 12

Journal of Intelligent & Fuzzy Systems xx (20xx) x–xx 1

DOI:10.3233/JIFS-169514
IOS Press

1 Recognition of learning-centered emotions


2 using a convolutional neural network

f
roo
3 Francisco González-Hernández∗ , Ramon Zatarain-Cabada, Maria Lucia Barrón-Estrada
4 and Hector Rodrı́guez-Rangel
5 Posgrado en Ciencias de la Computación, Instituto Tecnológico de Culiacán, Culiacán,
6 Sinaloa, México

rP
tho
7 Abstract. This work presents the application of a convolutional neural network (CNN) used to identify emotions through
8 taken images to students, which are learning Java language with an Intelligent Learning Environment. The CNN contains
9 three convolutional layers, three max-pooling layers, and three neural networks with intermediate dropout connections. The
10 CNN was trained using different emotional databases. One of them was a posed database (RaFD) and two of them were
11 spontaneous databases created specially by us with a content focused on learning-centered emotions. The results show a
Au
12 comparison among three emotion recognition systems. One applying a local binary pattern approach with facial patches,
13 another applying a geometry-based method, and the last one applying the convolutional network. The analysis presented
14 satisfactory results; the CNN obtained a 95% accuracy for the RaFD database, an 88% accuracy for a learning-centered
15 emotion database and a 74% accuracy for a second learning-centered emotion database. Results are compared against the
16 classifiers support vector machine, k-nearest neighbors, and artificial neural network.
d

17 Keywords: Convolutional neural network, educational emotion recognition, face expression database, machine learning,
18 feature extraction
cte

19 1. Introduction emotions. Emotional recognition through the face 31

is an issue that has been extensively addressed by 32


rre

20 Emotional recognition is the process of predicting researchers in the field of affective computing; this 33

21 affective content from low-level signals. These signs issue is usually named facial expression recognition. 34

22 are manifested from physical expressions. These To implement our facial expression recognizer we 35

23 expressions have important features that help us how developed three different processes [2] explained 36
co

24 to differentiate an emotion from another. For exam- below. The first step is the extraction of features where 37

25 ple, in speech, we have features like loudness and the facial image receives a set of operations; these 38

26 pitch. In body expressions, we have features like operations generate a set of features expressed as a 39

27 the position of the body. Even, the heart rate and vector, a matrix or some computational data structure. 40
Un

28 brain signals are features not detectable but they The operators are based on the appearance (filters) or 41

29 express emotions as well [1]. One of the most expres- the geometry (distances) of the image and are usu- 42

30 sive parts of the human body is the face, being one ally handcrafted. The second stage is the selection 43

of the main channels of communication to express of features where frequently repeated features are 44

discarded because these features make the images 45

∗ Corresponding
have little difference among them confusing the clas- 46
author. Francisco González-Hernández, Pos-
grado en Ciencias de la Computación, Instituto Tecnológico
sifier. In addition, this step also serves to decrease 47

de Culiacán, Culiacán, Sinaloa, México. E-mail: rzatarain@ the dimensionality of feature vectors. The third step 48

itculiacan.edu.mx. is the building or use of a classifier. Classifiers 49

1064-1246/18/$35.00 © 2018 – IOS Press and the authors. All rights reserved
2 F. González-Hernández et al. / Recognition of learning-centered emotions

50 usually are designed for classifying discrete values way finding out the effectiveness to classify these 102

51 that are represented by labels or classes. The most emotions. 103

52 used approaches to create facial expressions recog- According to our knowledge, there are no works 104

53 nizers are based on Ekman’s theory of emotions [2]. dealing with convolutional neural networks and 105

54 This theory explains how human beings express basic educational issues together. This work presents a con- 106

55 emotions. The basic emotions are feelings of short volutional neural network architecture, tested with 107

56 duration with clearly visible and well-defined expres- a database of basic emotions; and tested with two 108

57 sions. Although the number of emotions may vary databases of learning-centered emotions designed 109

f
58 according to the author, usually the selected emo- and created by the authors of this work. In addition, 110

roo
59 tions are anger, joy, sadness, fear, contempt, disgust we present a comparison of the results when using the 111

60 and surprise. In current research papers, databases of local binary pattern and geometry-based approaches. 112

61 basic emotions are used in the three mentioned stages This paper presents an approach based on convolu- 113

62 for building and testing an emotion recognizer. tional neural networks to the recognition and use of 114

63 In a previous work [3], we presented an intel- learning-centered emotions and a comparison against 115

rP
64 ligent learning environment (ILE) for the Java other approaches that we have tested for facing this 116

65 programming language. Intelligent learning environ- topic. 117

66 ments provide personalized instruction in a particular The paper is structured as follows: Section 2 shows 118

67 domain. In addition, we added to the environment the work related to the databases of facial expressions 119

tho
68 an emotion recognizer to provide the ability to use and the techniques used for the recognition of facial 120

69 emotions as part of pedagogical strategies. In this expressions. Section 3 presents the previous work we 121

70 paper, we present the design and implementation of have done to create both databases and recognizers. 122

71 a convolutional neuronal network to give our ILE Section 4 shows the test results of the recognizers 123

72 the ability to recognize emotions with another type as well as the discussion about those results. Finally, 124
Au
73 of classifier. It is worth mentioning that the work of Section 5 presents conclusions and future work. 125

74 creating an emotion recognizer was addressed twice


75 before [4, 5]. The difference and main contribution
76 of this work are that the convolutional neural net- 2. Related works 126

77 work automatically performs the feature selection


process avoiding the need to create a handmade pro- Related work is divided into two parts. The first part
d

78 127

79 cess, which is not known with precision if it works is about facial expression databases and the second 128
cte

80 properly for educational domains. In addition, we part is about emotional recognition in facial expres- 129

81 created a database with spontaneous facial images sions using different techniques and approaches. The 130

82 that represent emotions focused on learning. These database section includes posed and spontaneous 131

83 emotions have a longer duration and occur when expressions as well as basic and non-basic emotions. 132

84 performing intellectual activities [6]. Some of the The emotional recognition section includes topics 133
rre

85 most common emotions of this type are frustration, about appearance-based, geometric-based, and deep 134

86 engagement, excitement, boredom, and relaxation. learning approaches. 135

87 Also in previous works, we built two classifiers to


88 recognize learning-oriented emotions. All this work 2.1. Facial expression databases 136
co

89 is explaining further on.


90 There has been research works in the area Face expression databases are a set of images 137

91 of machine learning based on the deep learning that express an emotion, a situation, or an experi- 138

92 approach. One type of deep learning approach is the ence. There are several available databases. Some 139
Un

93 convolutional neural network which consists of a set of them contain posed faces to represent specific 140

94 of interconnected filters where an image is trans- emotions; other ones contain spontaneous emotions 141

95 formed retaining its most outstanding features [7]. where the face represent the facial reaction to a 142

96 These networks transform the images in each convo- situation. Next, the most important databases are pre- 143

97 lutional layer obtaining that the images retain their sented and explained. Cohn-Kanade (CK) [8] and CK 144

98 most outstanding features. With this, the need to plus (CK+) [9] are databases that represent 6 basic 145

99 define an extractor and a feature selector is avoided. emotions and they include Action Units (AU) anno- 146

100 This gives the possibility to build and test these types tations. The images represent the image sequence 147

101 of networks on learning-oriented emotions and this inside the Facial Action Coding System (FACS). 148
F. González-Hernández et al. / Recognition of learning-centered emotions 3

Table 1
Facial Expression Datasets
Work Description Contents
CK y CK+ [8, 9] Two databases that represent facial expressions using FACS Images of 6 basic emotions.
action units.
RaFD [10] Database of posed facial expressions in 3 different face Images of 8 basic emotions.
positions with 3 gaze directions.
SEMAINE [11] Database of facial expressions of sporadic basic emotions Images of 8 basic emotions with 4 dimensions that
obtained from the interaction with a conversational agent. are valence, activation, power, and expectation

f
MMI [12] Database with manipulated web interface from front and Images and videos of 8 basic emotions.

roo
side-face perspective.
GEMEP [13] Database of posed facial expressions performed by actors. 8 basic emotions and 10 non-basic emotions stored
in videos and images.

149 Each expression begins as a neutral expression and training and construction of a new recognizer for 188

rP
150 then move to a peak expression (more intense expres- leraning-centered emotions. 189

151 sion). Each expression can receive an emotion label.


152 In the plus version were added spontaneous expres- 2.2. Facial expressions recognition methods 190

153 sions recording 84 novel subjects while they were


distracted among each photo session. Radboud Faces

tho
154 Appearance-based techniques apply operators and 191

155 Database (RaFD) [10] also includes photos of eight filters over the pixels of the image in order to obtain 192

156 basic emotions. The photos were taken using Cau- a set of representative features of the face. Local 193

157 casian Dutch adults and children. Participants showed Binary Pattern (LBP) is a method that takes the 194

158 the facial expressions with three gaze direction and pixel value of the image center as the threshold [14]. 195
Au
159 five camera angles. In addition, they complied req- Each pixel value is compared against the threshold; 196

160 uisites as wearing a type of shirt and having no if the threshold is bigger than the pixel value then the 197

161 hair on the face. In other works like SEMAINE resulted is zero, otherwise, it is one. This technique 198

162 [11] in addition to expressing six basic emotions, was applied to identify face expressions in [15], and 199

163 the database contains four dimensions of an emotion the results were satisfactory. Local Phase Quantiza- 200

which are Valence, Activation, Power, and Anticipa- tion (LPQ) uses blur insensitive texture classification
d

164 201

165 tion/Expectation. In addition, spontaneous emotions through local Fourier transformation neighborhood 202
cte

166 were added taking photos while participants talked by computing its local Zernike moments. The pro- 203

167 to an agent system. M&M Initiative (MMI) [12] cess generates LPQ codes, and collect them into a 204

168 contains image sequences of faces in frontal and pro- histogram. This descriptor is ideal for image blur- 205

169 file view. MMI contains more than 1500 samples ring. Some works have proven LPQ can be used for 206

170 and the database is contained in a web-based direct- expression recognition with FACS [16]. Histograms 207
rre

171 manipulation application. Two FACS coders labeled can reach up to 25,000 features so that indicates that 208

172 the images and videos. Geneva Multimodal Emotion LPQ covers an extension area of the face. Gabor 209

173 Portrayals Core Set (GEMEP) includes an important representation [17] is a representation of a convolv- 210

174 set of images [13]. In total, it contains 18 portrayed ing of an input image using a set of Gabor filters 211
co

175 discrete emotions labeled using FACS. The databases with various scales and orientations. Gabor filters 212

176 were built using 10 professional French-speaking encode componential information, and depending on 213

177 theater actors who were trained by a professional the registration scheme. The overall representation 214

178 director. The corpus is comprised of over 7000 audio- may implicitly convey configurable information. This 215
Un

179 visual emotion representations. technique can be used with simple dimensionality 216

180 Table 1 shows 6 of the most popular datasets reduction techniques such min, max and mean group- 217

181 for facial expressions. We can see that most of the ing. The representation is robust to registration errors 218

182 datasets only contain images of facial expressions to an extent as the filters are smooth and the magni- 219

183 representing the basic emotions. In addition, most of tude of filtered images are robust to small translation 220

184 these emotions are not spontaneous (they are actuated and rotations. The feature amount can reach up to 221

185 emotions). The work of building an own database 165,000 values. 222

186 fills the gap of not having a set of data with emo- Geometric-based techniques [18] frequently rep- 223

187 tions related to education which is necessary for the resent faces as a facial point representation. These 224
4 F. González-Hernández et al. / Recognition of learning-centered emotions

225 points describe a face by a concatenation of X and Y They preprocessed images transforming them in gray 277

226 coordinates of fiducial points. To represent a face in scale images and cropping the region of the face. 278

227 these techniques there is two type of representations LBP codes are mapped to a 3D space applying 279

228 of models: the first is the free model which detects fea- multi-dimensional scaling which is a code-to-code 280

229 ture points individually by performing a local search. dissimilarity scores based on an approximation to 281

230 The located points are called facial landmarks. The the Earth Mover’s Distance. Kim in [23] presents an 282

231 second is model-based focuses on measuring dis- interesting analysis of convolutional neural networks. 283

232 tances between the real face and a template. The tem- They present a new pattern recognition framework. 284

f
233 plate represents the most of the cases a neutral expres- The pattern consists of a set of deep CNNs that 285

roo
234 sion. Conditions as illumination variations are not an are interconnected with various committee machines 286

235 issue because the intensity of the pixels is ignored, (also known as classifier ensembles). Each CNN is 287

236 unlike appearance-based techniques. The most of the independently configured; this means that each CNN 288

237 research work complement their extraction feature is an individual member inside of the framework; 289

238 techniques adding facial points as additional data also, each CNN was trained using different datasets 290

rP
239 for improving the recognition. Majumder et al. [19] where each dataset is created using a distinct pre- 291

240 present a model of emotional recognition using a processed for the original image dataset. The work 292

241 Kohonen self-organization map (KSOM) which is in [24] the authors presented a method to recognize 293

242 trained with 26 dimensional geometric feature vector. static facial expressions; they use three techniques 294

tho
243 The vectors are built from feature points on eyes, lips, to detect faces in the SFEW 2.0 dataset: the joint 295

244 and eyebrows. The nose is the central part of mea- cascade detection and alignment, the Deep-CNN- 296

245 surements. The facial movements are measured as a based, and the mixtures of trees. They applied a 297

246 reference using the neutral expression. Some features pre-processing over the images; each one is resized to 298

247 are measured using the calculation of the area at the 48 × 48 and transformed to grayscale. They propose 299
Au
248 opening of the eyes, the distance from the opening of a CNN architecture of five convolutional layers but 300

249 the mouth from lip to lip, and the distance between the instead of adding pooling layers in each connection 301

250 corners of the lip and the edges of the nose. Salmam among convolutional layers, they use stochastic pool- 302

251 et al. [20] focus on introducing a new extraction ing because it has proven giving a good performance 303

252 feature technique using a geometry-based approach. with limited training data. The techniques used for 304

They used the Supervised Decentration Method building the CNN are the use of generating random-
d

253 305

254 (SDM) for nonlinear least squares (NLS) problems. ized perturbation in the dataset, the modification of 306
cte

255 In their extraction technique of facial points, they the loss function for considering the perturbation, a 307

256 obtain up to 80 features points of the face, eyes, lips, pre-training of CNN using the FER dataset, a fine- 308

257 eyebrows, mouth, and nose. After distances are mea- tuning of the CNN using the SFEW dataset, and 309

258 sured, they use the three types of formulas on mea- multiple networks for learning. 310

259 surements Euclidiana, Manhattan and Minkowski. As we can see, the previous works are oriented 311
rre

260 A convolutional neural network is composed of to the recognition of basic emotions, predominat- 312

261 multiple processing layers which are used to learn ing the emotions of the Ekman model. On the other 313

262 data representation with multiple levels of abstrac- hand, although the works have tried to improve the 314

263 tion. The method does not necessarily perform feature algorithms for convolutional neural networks, an 315
co

264 extraction or feature selection. A proposal of identifi- architecture for an educational domain has not been 316

265 cation of high-level features is presented in [21]. The designed and implemented, which is the main point 317

266 authors introduces their new deep learning approach of our work. 318

267 which consists of adding a new layer named Deep


Un

268 hidden IDentity features (DeepID) which identifies


269 a large number of classes using def-pooling. The 3. Facial expression recognition 319

270 work follows a normal configuration of a convolu-


271 tional network with the difference that DeepID layer Next, we present three methods for recognizing 320

272 is located between the last convolutional layer and the facial expressions. The first two methods were pre- 321

273 soft-max layer. In [22], the authors present a method viously reported in other works. These methods are 322

274 to reduce the complexity of the problem domain explained to give a proper context of how to they 323

275 removing confounding factors. The authors used the work. The third method is a CNN and we describe 324

276 feature extraction method local binary pattern (LBP). the features and parameters used in its architecture.
F. González-Hernández et al. / Recognition of learning-centered emotions 5

325 3.1. Local binary pattern landmark points are located on the face. These points 358

are located using a template previously trained by 359

326 In the work reported in [4] we described how dlib software [25]. The points are related to areas 360

327 the pattern recognizer was created using LBP. This of the human face that express an emotion. In these 361

328 method is based on the work of Happy [12]. The areas of the human face are the lip, eyes, eyebrows, 362

329 method detects the nose, mouth, eyebrows, and eyes and nose. The face landmarks are a part of all the 363

330 as separate objects. Those objects are transformed face features (X and Y coordinate values). However, 364

331 into six separate images. For each image, the fol- one problem is that coordinate values may change 365

f
332 lowing filters are applied (in the order they appear): depending on where the face is located in the photo. 366

roo
333 Gaussian Blur, Sobel, Otsu’s Threshold, Binary Dila- To solve that problem, the average value of both axes 367

334 tion, and Removing Small Objects. Then, the last (X and Y) are calculated, so the center of gravity of 368

335 pixels on left and right ends from eyebrows are estab- all face landmarks is obtained. Those values repre- 369

336 lished as key points. In the case of nose and eyes, their sent the position of all points relative to the central 370

337 central positions are established as key points. Using point. The distances from the center to every land- 371

rP
338 each key point on face, facial patches are calculated. mark points are obtained. Each line has a magnitude 372

339 These facial patches has a proportion of one sixteenth (distance between both points) and a direction whose 373

340 of the face width. A LBP uniform operator is applied value is an angle in relation to the image where a 374

341 to each facial patch. The operator has a configuration 0◦ is the value of a horizontal line. Another issue to 375

tho
342 of 9 neighborhoods with a radius of 2. A LBP operator consider is that of the tilted faces. It is normal that 376

343 is applied to each pixel in the facial patch. This action users move their necks during computing activities. 377

344 generates a binary number by comparing each pixel The rotations are corrected by offsetting all calculated 378

345 value against the center pixel value. When the pixel angles by the angle of the nose bridge. This rotates 379

346 value is less than the center pixel value then the result the set of feature so that tilted faces become similar 380
Au
347 is zero, else is one. The histograms obtained from to non-tilted faces with the same expression. In this 381

348 LBP images are utilized as features descriptors. Each case, the angle is calculated with function arctangent, 382

349 histogram is generated with 256 bins. Histograms depending on if the nose bridge is perpendicular to 383

350 are concatenated and normalized in a vector. A sup- the horizontal plane, for adding or subtracting a com- 384

351 port vector machine (SVM) classifier receives the pensation value (90 degrees). Coordinates, distances, 385

histogram and uses a one-vs-the-rest scheme to take and angles are concatenated as input inside a support
d

352 386

353 multi-class decisions. Figure 1 shows the left-to-right vector machine. Figure 2 shows the feature extraction 387
cte

354 process for extracting features using LBP. procedure in this method. 388

355 3.2. Geometric-based method 3.3. Convolutional neural network 389

Our work reported in [5] explains how the Convolutional Neural Networks (CNNs) are rep-
rre

356 390

357 geometry-based recognizer was developed. First, 68 resented in a multi-layer architecture. Each layer has 391
co
Un

Fig. 1. Process to extract features using LBP operator and Facial Patches.
6 F. González-Hernández et al. / Recognition of learning-centered emotions

Fig. 2. Process to extract features using geometric-based approach.

f
roo
rP
tho
Fig. 3. Convolutional neural network architecture.

392 a specialized function. The convolutional layers have 3.3.2. The convolutional layer 421

393 the goal of extracting features and patterns from the As mentioned above, a convolutional layer applies 422
Au
394 images. The pooling layers have the goal of decreas- the mathematical convolution operation or process 423

395 ing the number of final features and reducing bias which consists of 3 elements: one is the data input 424

396 problems. The neural network layers have the goal of which is usually expressed as a multidimensional 425

397 classifying the data obtained from the previous ad- array of data. Another one is the kernel which is a mul- 426

398 jacent layers. After trying several architectures and tidimensional array of parameters that are adapted 427

based on a similar work (LeNet [27], which uses by the learning algorithm. The last element is the
d

399 428

400 2 convolutional layers, 2 max-pooling, and 3 fully output which is known as the feature map. Multidi- 429
cte

401 connected). The architecture that showed a better per- mensional arrays are called tensors. The main idea 430

402 formance consists of nine layers (excluding dropout behind the network is that the kernel can identify the 431

403 connections); each convolutional layer contains 64 visual patterns that come from the input (edges, lines, 432

404 filters. Figure 3 shows the architecture designed, colors, etc.) and thus be able to differentiate the visual 433

405 which consists of three convolutional layers, 3 max- pattern between the different objects. In the convolu- 434
rre

406 pooling layers and 3 fully connected neural networks tional process, the kernel is overlapped on the input 435

407 layers. and then a crossover operation is performed, which is 436

equivalent to a convolutional operation. Each value 437

408 3.3.1. Preprocessing of the input is multiplied by the value at the same 438
co

409 Preprocessing is not a part of the architecture. position in the kernel and the resulting values are 439

410 However, a CNN has the inconvenience of needing a sums that are placed in the output. The process is 440

411 powerful hardware. The use of filters in a large num- as follows: first, the kernel is overlapped on top of 441

412 ber of images with multiple dimensionalities causes the input image; second, the product between each 442
Un

413 an important workload on the CPU. Applying a pre- number in the kernel and each number in the over- 443

414 processing step help to achieve converging a CNN lapped input is computed; third, a single number by 444

415 model adequately. The process consists of locating summing these products together is obtained; fourth, 445

416 a region of interest (ROI) in every facial image. The the obtained number is set in a convolutional output; 446

417 viola-jones method [26] and OpenCV software were fifth and last, the kernel is moved to the next section in 447

418 used. After the ROI is located, it is subtracted from the input image. An example of the convolution pro- 448

419 the image and transformed to a size of 75 × 75 pix- cess for a part of the face is shown in Fig. 4, where 449

420 els. At the end, the image is converted to a grayscale the numbers shown in the figure are just to simplify 450

image and saved into a new database. the example. The patterns we need to detect are more 451
F. González-Hernández et al. / Recognition of learning-centered emotions 7

f
roo
rP
Fig. 5. Max-pooling process on an input data.

is the most used grouping function, which holds the 482

maximum values as output. Other common pooling 483

tho
Fig. 4. Example of the convolutional process of part of the face. functions include the average of a rectangular neigh- 484

borhood, the L2 norm of a rectangular neighborhood, 485

452 related to the shapes rather than to the colors, so that or a weighted average based on the distance from 486

453 the images are preprocessed to black and white. This the central pixel. A pooling layer has the utility of 487

also helps to decrease the dimensionality of both the reducing the spatial dimension of a convolutional
454
Au 488

455 inputs and the kernel. In addition, a ReLU (Rectified layer before sending the data to the next convolutional 489

456 Linear Units) function was added which modifies the layer (or any other type of layer). The operation per- 490

457 values in the convolutional layer without affecting its formed by this layer leads to loss information and is 491

458 important properties. The function takes the values of referred to as “sample-down”. The operation used in 492

the output and in case they are less than zero, it leaves the designed architecture is max-pooling with a win- 493
d

459

460 them in zero. One of the characteristics of the function dow of size 2 × 2. The max-pooling process selects 494

ReLU is that it has a non-linear property. We consider the max value of a selected area (Window). Figure 5 495
cte

461

462 the function ReLU as part of the activation function of shows an example of application of max-pooling on a 496

463 the convolutional layers. The pixel values of the face data entry. Numbers shown in figure are a simplified 497

464 is interpreted as an array and the kernel is overlapped example. 498

465 with an output. In the architecture we placed 3 con-


rre

466 volutional layers with the activation function ReLU. 3.3.4. Classification layers 499
467 The configuration used is a kernel of size 3 × 3, 64 The architecture has three fully connected neural 500
468 filters for each one, and a stride of size 1. network layers. A fully connected layer takes all units 501

in the previous layer (no matter what type of layer 502


co

469 3.3.3. Max-pooling layers is). Fully connected layers are not spatially located 503

470 A typical convolutional neural network architec- anymore, so it is no possible having convolutional 504

471 ture consists of three stages for feature extraction. layers after a fully connected layer. The first two lay- 505

472 In the first stage, one or more convolutional layers ers use ReLU as activation in their outputs, but the 506
Un

473 perform in parallel a series of linear activations. In third layer (classification layer), uses Softmax as acti- 507

474 the second step, each convolutional layer executes vation function. Another feature of the first layers in 508

475 a linear activation function called ReLu. These first the CNN architecture is that they have a Dropout con- 509

476 two stages are sometimes called the detection stage nection. The intention is to reduce the saturation of 510

477 (similar to the feature extraction process from other data between the layers of the neural network thus 511

478 methods). In the third stage, we used a pooling func- avoid bias in the data that affects the classification 512

479 tion to modify the result obtained from previous steps. process. The Dropout connection selects randomly a 513

480 Pooling is a grouping function to replace the obtained part of the data input and places the values of that 514

481 values from the convolutional layers. Max-pooling part at zero. This connection performs the dropout 515
8 F. González-Hernández et al. / Recognition of learning-centered emotions

Table 2 ing of data in each layer. Max-pooling uses a strider 531


Description of the CNN architecture of 2 × 2 causing a reduction of up to almost half of 532

Name Layer Type Input Output the data in each layer. To improve the classification, 533

conv2d 1 input InputLayer (75,75,3) (75,75,3) we decided to flatten data before giving it as input to 534
conv2d 1 Conv2D (75,75,3) (73,73,64) a first dense layer. Dense layers do not suffer modifi- 535
max pooling2d 1 MaxPooling2D (73,73,64) (36,36,64)
conv2d 2 Conv2D (36,36,64) (34,34,64) cation in data dimension, except the final dense layer 536

max pooling2d 2 MaxPooling2D (34,34,64) (17,17,64) that reduces data to 15 units. 537

conv2d 3 Conv2D (17,17,64) (15,15,64)

f
max pooling2d 3 MaxPooling2D (15,15,64) (7,7,64)

roo
flatten 1 Flatten (7,7,64) (3136)
dense 1 Dense (3136) (500) 4. Process of creating facial expression 538

dropout 1 Dropout (500) (500) databases 539


dense 2 Dense (500) (500)
dropout 2 Dropout (500) (500)
dense 3 Dense (500) (15) An essential part for any recognition system is 540

the database for training (Fig. 6). Databases con- 541

rP
tain relevant information for any recognition system 542

516 task in each training stage iteration. The fraction of be able to discriminate important data to classify. 543

517 the selected data is 50%. We proposed a new method to build face expression 544

databases using two EEG-based Brain-Computer 545

tho
518 3.3.5. Configuration of the CNN architecture Interface (BCI) systems: Emotiv Epoc and Emotiv 546

519 Table 2 shows the configuration of the CNN archi- Insight. They are interface systems that capture the 547

520 tecture. It includes data about the name of the layer, brain activity and give information about the emotion 548

521 the type of layer, and the sizes of input and output that student is feeling. Next, we describe the used 549

522 dimension for each layer. Conv2D means a con- devices and methods. 550
Au
523 volutional layer of two dimensions; MaxPooling2D
524 represents a max-pooling layer of two dimensions; 4.1. EEG, Emotiv EPOC and Emotiv Insight 551

525 Flatten indicates a layer that transforms a matrix into a


526 one-dimension vector; Dense specifies a densely con- EEG is a technique for monitoring the brain’s 552

527 nected neural network; and Dropout denotes a layer encephalographic signals. It is a non-invasive tech- 553
d

528 that performs a dropout operation. Convolutional lay- nique where electrodes are placed on the scalp. 554

529 ers have different dimensions because a kernel of Emotiv EPOC is a device built by the bio-informatic 555
cte

530 3 × 3 is overlapped in layers; this generates a reduc- and technology company EMOTIV inc [27]. The set 556
rre
co
Un

Fig. 6. Photos of a part of three databases.


F. González-Hernández et al. / Recognition of learning-centered emotions 9

557 of tools of Emotiv is a wireless neuroheadset which


558 works with bluetooth signals, a SDK to develop appli-
559 cations to gather and analyze data, a suite of desktop
560 applications for the Emotiv EPOC, and a suite of
561 mobile applications for the Emotiv Insight.

562 4.2. Protocol for building and filtering the facial


563 expression database

f
roo
564 We looked for a method to capture expressions dur-
565 ing an educational context. In addition, we looked
566 for an activity related to the domain of the intelli-
567 gent tutoring system that uses the facial recognition
568 system. The protocol followed for the creation of

rP
569 the database was as follows: The data was captured
570 with 38 students from the Instituto Tecnológico de
571 Culiacán with 28 men and 10 women. The par-
572 ticipants were between 18 and 47 years old. The

tho
573 students wrote, compiled and executed programs in
574 Java with the Emotiv diadem obtaining their emo-
575 tional state during the coding of the program. In most
576 of the works for building facial expression databases, Fig. 7. Method to take photographs for face expression database.
577 experts in judging emotions participate in the annota-
Au
578 tion process to tag the captured images of students or
579 users. In our work, the labeling is carried out automat- judges to have a database with properly labeled facial 606

580 ically by an application and with the help of Emotiv. expressions.The Emotiv insight database obtained a 607

581 Figure 7 shows the used method and it is explained total of 5560 labeled images; this database has not 608

582 as follows: received a filtering process so it keeps its original 609

size. Figure 6 shows parts of both facial expressions 610


d

583 1. The student codes a Java program; meanwhile, databases. 611


584 the Emotiv device captures brain activity and the
cte

585 webcam takes a photograph every 5 seconds.


586 2. Every photograph is labeled by system with the 5. Tests and discussion 612
587 user emotion obtained at that moment from the
588 Emotiv device. The annotation is made by an We show and explain the tests performed to the 613
application that takes emotion from the Emo-
rre

589
recognizers that were trained and tested with differ- 614
590 tiv device and labels the photograph with that ent databases. The tests consisted of measuring the 615
591 emotion. accuracy of the recognizers using the RaFD database 616
592 3. The photo previously labeled is saved into the and the two databases built with Emotiv Epoc (dbE) 617
593 facial expression Database. and Insight (dbI). In the case of RaFD we decided to
co

618
594 4. Finishing the previous steps, a group of experts use 6 of the 8 basic emotions because they are the 619
595 evaluates if there is a match between the emo- most common emotions in other research work, so 620
596 tional label and the expression in the photo. If we could make comparisons of our results to better 621
so, the photo is saved; otherwise it is discarded.
Un

597
validate our emotion recognition. Table 3 describes 622

598 We obtained two databases, one by each Emotiv the contents of the three databases. 623

599 device. The Emotiv epoc database stored a total of


600 7,019 photographs. However, several photos had not Table 3
601 a proper matching with their labeled emotion. We Description of databases
602 proceeded to filter the database eliminating incor- RaFD Emotiv Epoc Emotiv Epoc
603 rectly labeled photos obtaining a database of 730 Database Insight
604 photographs. This debugging process helped us to Face number 1146 730 5056
605 have a database with a validation also from human Labels (classes) 6 4 6
10 F. González-Hernández et al. / Recognition of learning-centered emotions

Table 4 Table 5
Class distribution for databases built with Accuracy obtained with database RaFD
Emotiv Epoc and Insight
Classifier/Feature LBP Geometric-based CF
Emotion Emotiv Epoc Emotiv Epoc Extractor
Database Insight KNN 55% 70% 30%
Boredom 17 1040 ANN 88% 60% 31%
Engagement 519 1955 SVM 66% 92% 73%
Excitement 91 1661 CNN – – 95%
Frustration 104 –

f
Focus – 222

roo
Relax – 28 Table 6
Interesting – 150 Precision obtained with the database dbE
Classifier/Feature LBP Geometric-based CF
624 The RaFD database [10] contains a total of 1146 Extractor
625 photos; 191 photographs for each basic emotional KNN 85% 84% 85%
626 class. The Emotiv Epoc and Emotiv Insight databases ANN 85% 80% 85%

rP
627 contain images of learning-oriented facial expres- SVM 85% 84% 77%
628 sions and have three emotional labels in common. CNN – – 88%
629 The distribution of classes of the databases built with
630 Emotiv are shown in Table 4. Table 7

tho
631 The test consists of a cross validation of k = 10. Precision obtained with the database dbI
632 This means that the recognizers will be trained and Classifier/Feature LBP Geometric- CF
633 tested 10 times using a different segment of the input Extractor based
634 data in each training step. Data were divided into KNN 70% 61% 65%
635 90% for training and 10% for precision testing at ANN 74% 51% 61%
SVM 69% 61% 63%
Au
636 each iteration. The data selection was random for both
CNN – – 74%
637 the training data part and the testing part. The divi-
638 sion of the data did not take into account if a person
639 existed in both parts of the data since the objective is reason of the lower results is that this database con- 665

640 the general recognition of an emotion in people and tains an amount of photos five times greater than 666

not in a particular individual. The features obtained RaFD or dbE and they have not received a filtering
d

641 667

642 from the LBP and Geometric-based techniques were process. The CNN obtained an average accuracy of 668
cte

643 used to train three classification algorithms: Support 74% which we consider is a good result that should 669

644 Vector Machine (SVM), Artificial Neural Network improve a lot once the dbI database goes through a 670

645 (ANN) and K-Nearest Neighbors (KNN). In addi- filtering process. Table 7 shows the results for the 671

646 tion, a test was added where the convolutional filters database dbI. 672

647 (CF) are applied to the images. The results of these In the case of the RaFD database and using the 673
rre

648 filters were used as input features for the three classi- CNN architecture we obtained a precision of 95%. 674

649 fiers mentioned above. The accuracy obtained in the Only the combination of geometric-based method 675

650 three databases in combination with the classifiers with SVM came close to what was obtained with 676

651 and extraction techniques is shown below. this architecture. This clarifies that in the case of a 677
co

652 With the RaFD database, most of the methods did database of basic emotions with discrete and acted 678

653 not obtain significant results. Only two combinations emotions the architecture has no problem. In addi- 679

654 obtained results over 85% of accuracy. SVM and the tion, if we contrast this result against other works 680

655 Geometric-based method obtained a good result but that have been tested with different databases of 681
Un

656 the CNN obtained a value close to 100%. Table 5 basic emotions, we can find that we obtained satis- 682

657 shows the results for the tests with the RaFD database. factory results. For example, in Ilbeigy’s work [28], 683

658 With the dbE database, most methods obtained an they used a technique that combines traditional fea- 684

659 accuracy between 80% and 85%. Accuracy values ture extraction with fuzzy sets where they obtained 685

660 show how a filtered database can help classifiers to 93.95% accuracy and Recio [29] averaged less than 686

661 get a better accuracy. Table 6 shows the test data for 90% in all experiments. RafD has not yet been tested 687

662 the database dbE. with convolutional neural networks. However, we 688

663 With the dbI database, we obtained lower results can compare our work with [25], which uses con- 689

664 than the other two databases (RaFD and dbE). One volutional neural networks that were tested with two 690
F. González-Hernández et al. / Recognition of learning-centered emotions 11

691 databases of basic emotions (CK [9] and Geneva [13]) 3 facial expressions databases: one database contains 740

692 obtaining a maximum value of 98% accuracy. posed or acted basic emotions and two databases con- 741

693 In the case of dbE the obtained precision had a tains spontaneous learning-centered emotions. The 742

694 value of 88% using the CNN architecture, which is evaluation suggests that the architecture can perfectly 743

695 superior to the other combinations of proven recog- detect facial expressions of acted basic emotions by 744

696 nition techniques. In previous work [4, 5], in the having similar and superior results than other popular 745

697 tests with this database we reached a maximum methods when detecting similar emotions. Evidence 746

698 value of 86% accuracy, clarifying that there were also shows that learning-centered emotions can be 747

f
699 some notable differences between both tests. In these successfully recognized in the event that the database 748

roo
700 reported works the database was reduced because is validated and filtered. Our architecture is the first 749

701 the recognizer could not identify all the parts of one to prove itself with this type of emotions and 750

702 the face in several photos, which is fundamental for expressions. Many of the tests with this type of archi- 751

703 the feature extractor, reason why only 20% of the tectures have been done with databases for acted 752

704 faces were detected. Comparing this work with other expressions or for spontaneous expressions that have 753

rP
705 research is complicated. To the best of our knowl- no relation with the learning process. In addition, we 754

706 edge, the problem of recognizing non-basic emotions validate the importance of two new databases built by 755

707 using convolutional neural networks has not been us and their importance and effectiveness in the train- 756

708 addressed. In [30], we found an emotion recogni- ing of different types of classifiers. As a future work 757

tho
709 tion work using database Acted Facial Expressions we have to perform tasks such as increasing some 758

710 in Wild, a database that collects images of movies to classes (emotions) that are unbalanced in the database 759

711 catch more realistic expressions. This work obtained dbI, trying new architectures with other layers of 760

712 an average below 70%, where six of the seven emo- pooling, and performing different filtering and pre- 761

713 tions analyzed obtained a percentage below 70%. processing methods on our databases dbE and dbI. 762
Au
714 With this, we can conclude that our recognizer has
715 high accuracy since it identifies non-basic emotions
716 obtained from a real programming context. There References 763
717 have also been work that have made a comparison
718 with non-basic emotions but without using convo- [1] R.E. Kaliouby, R. Picard and S. Baron-Cohen, Affective 764
lutional networks. The work of Bosch [31] is an
d

719 Computing and Autism, Ann N Y Acad Sci, 1093(1) (2006), 765

720 example of this, where they obtained a precision less 228–248. 766
[2] P. Ekman, An argument for basic emotions, Cogn Emot 6(3) 767
cte

721 than 70%. (1992), 169–200. 768


722 In the case of dbI, lower percentages were obtained [3] R.Z. Cabada, M.L.B. Estrada, F.G. Hernandez and R.O. 769
723 compared to the previous databases. By joining the Bustillos, An Affective Learning Environment for Java, 770

724 LBP and ANN methods, we obtained similar pre- in 2015 IEEE 15th International Conference on Advanced 771
Learning Technologies (2015), pp. 350–354. 772
725 cisions to the CNN architecture. This helps us to [4] R. Zatarain-cabada, et al., Building a Corpus and a Local
rre

773
726 understand that the database still requires work sim- Binary Pattern Recognizer for Learning-Centered Emo- 774

727 ilar to dbE. A filter job has not yet been performed tions, Adv Artif Intell Its Appl 2016. 775

728 on this database and has an unbalanced class prob- [5] R. Zatarain-Cabada, M.L. Barrón-Estrada, F. González- 776
Hernández and H. Rodriguez-Rangel, Building a Face 777
729 lem. Even so, we consider our results as satisfactory, Expression Recognizer and a Face Expression Database 778
co

730 because the previous comparisons with [31] and [30] for an Intelligent Tutoring System, in Advanced Learn- 779

731 give a clear idea of how complex it is to work with a ing Technologies (ICALT), 2017 IEEE 17th International 780

732 database with spontaneous expressions. Conference on (2017), pp. 391–393. 781
[6] S. D’Mello and A. Graesser, Dynamics of affective 782
states during complex learning, Learn Instr 22(2) (2012),
Un

783
145–157. 784

733 6. Conclusion and future work [7] Y. LeCun, Y. Bengio and G. Hinton, Deep learning, Nature 785
521(7553) (2015), 436–444. 786
[8] T. Kanade, J. Cohn and Y. Tian, Comprehensive database for 787
734 This work presents an architecture of a con- facial expression analysis, in Automatic Face and Gesture 788
735 volutional neural network for the recognition of Recognition (2000), pp. 46–53. 789

736 learning-centered emotions. The proposed architec- [9] P. Lucey, J.F. Cohn, T. Kanade, J. Saragih, Z. Ambadar 790
and I. Matthews, The extended cohn-kande dataset 791
737 ture consists of 3 convolutional layers each followed (CK+): A complete facial expression dataset for action 792
738 by a max-pooling layer, and finally 3 layers of fully- unit and emotionspecified expression, in Cvprw (2010), 793
739 connected neural networks. The CNN was tested with pp. 94–101. 794
12 F. González-Hernández et al. / Recognition of learning-centered emotions

795 [10] O. Langner, R. Dotsch, G. Bijlstra, D.H.J. Wigboldus, S.T. [21] Y. Sun, X. Wang and X. Tang, Deep learning face repre- 838
796 Hawk and A. van Knippenberg, Presentation and validation sentation from predicting 10,000 classes, in Proceedings 839
797 of the Radboud Faces Database, Cogn Emot 24(8) (2010), of the IEEE Conference on Computer Vision and Pattern 840
798 1377–1388. Recognition (2014), pp. 1891–1898. 841
799 [11] G. McKeown, M. Valstar, R. Cowie, M. Pantic and M. [22] G. Levi, Emotion recognition in the wild via convolutional 842
800 Schr??der, The SEMAINE database: Annotated multimodal neural networks and mapped binary patterns, in Icmi (2015), 843
801 records of emotionally colored conversations between a per- pp. 503–510. 844
802 son and a limited agent, IEEE Trans Affect Comput 3(1) [23] B.-K. Kim, H. Lee, J. Roh and S.-Y. Lee, Hierarchical com- 845
803 (2012), 5–17. mittee of deep cnns with exponentially-weighted decision 846
804 [12] M.F. Valstar and M. Pantic, Induced Disgust, Happiness fusion for static facial expression recognition, in Proceed- 847

f
805 and Surprise: An Addition to the MMI Facial Expression ings of the 2015 ACM on International Conference on 848

roo
806 Database, in Proceedings of Int’l Conf. Language Resources Multimodal Interaction (2015), pp. 427–434. 849
807 and Evaluation, Workshop on EMOTION (2010), pp. 65–70. [24] Z. Yu, Image based Static Facial Expression Recognition 850
808 [13] T. Bänziger, M. Mortillaro and K.R. Scherer, Introducing with Multiple Deep Network Learning, in Proceedings of 851
809 the Geneva Multimodal expression corpus for experimen- the 2015 ACM on International Conference on Multimodal 852
810 tal research on emotion perception, Emotion 12(5) (2012), Interaction (2015), pp. 435–442. 853
811 1161–1179. [25] D.E. King, Dlib-ml: A Machine Learning Toolkit, J Mach 854

rP
812 [14] T. Ojala, M. Pietikäinen and T. Mäenpää, Gray Scale and Learn Res 10 (2009), 1755–1758. 855
813 Rotation Invariant Texture Classification with Local Binary [26] P. Viola and M. Jones, Rapid object detection using a 856
814 Patterns, IEEE Trans Pattern Anal Mach Intell 24(7) (2000), boosted cascade of simple features, in Proceedings of the 857
815 404–420. 2001 IEEE Computer Society Conference on Computer 858
816 [15] S.L. Happy and A. Routray, Automatic facial expression Vision and Pattern Recognition. CVPR 2001, 1 (2001), pp. I- 859
817 recognition using features of salient facial patches, IEEE 511–I-518. 860

tho
818 Trans Affect Comput 6(1) (2015), 1–12. [27] Emotiv, Emotiv Insight, Web Page, 2016. 861
819 [16] B. Jiang, M. Valstar, B. Martinez and M. Pantic, A dynamic [28] M. Ilbeygi and H. Shah-Hosseini, A novel fuzzy facial 862
820 appearance descriptor approach to facial actions temporal expression recognition system based on facial feature 863
821 modeling, IEEE Trans Cybern 44(2) (2014), 161–174. extraction from color face images, Eng Appl Artif Intell 864
822 [17] T. Wu, N.J. Butko, P. Ruvolo, J. Whitehill, M.S. Bartlett 25(1) (2012), 130–146. 865
823 and J.R. Movellan, Action unit recognition transfer across [29] G. Recio, A. Schacht and W. Sommer, Recognizing dynamic 866
Au
824 datasets, in 2011 IEEE International Conference on Auto- facial expressions of emotion: Specificity and intensity 867
825 matic Face and Gesture Recognition and Workshops, FG effects in event-related brain potentials, Biol Psychol 96 868
826 2011 (2011), pp. 889–896. (2014), 111–25. 869
827 [18] K. Huang, S. Huang and Y. Kuo, Emotion Recognition [30] M. Liu, R. Wang, S. Li, S. Shan, Z. Huang and X. Chen, 870
828 Based on a Novel Triangular Facial Feature, in Neural Net- Combining Multiple Kernel Methods on Riemannian Man- 871
829 works (IJCNN), The 2010 International Joint Conference ifold for Emotion Recognition in the Wild, in Proceedings 872
830 on (2010), pp. 18–23. of the 16th International Conference on Multimodal Inter- 873
d

831 [19] A. Majumder, L. Behera and V.K. Subramanian, Emotion action - ICMI ’14 (2014), pp. 494–501. 874
832 recognition from geometric facial features using self- [31] N. Bosch et al., Automatic Detection of Learning-Centered 875
cte

833 organizing map, Pattern Recognit 47(3) (2014), 1282–1293. Affective States in the Wild, in Proceedings of the 20th 876
834 [20] F.Z. Salmam, A. Madani and M. Kissi, Facial Expression International Conference on Intelligent User Interfaces - 877
835 Recognition Using Decision Trees, in 2016 13th Interna- IUI ’15 (2015), pp. 379–388. 878
836 tional Conference on Computer Graphics, Imaging and
837 Visualization (CGiV) (2016), pp. 125–130.
rre
co
Un

You might also like