You are on page 1of 14

A supervised approach to support the analysis and the classification of non verbal humans communications

Vitoantonio Bevilacqua12*, Marco Suma1 , Dario DAmbruoso1, Giovanni Mandolino1, Michele Caccia1, Simone Tucci1, Emanuela De Tommaso1, Giuseppe Mastronardi12

Dipartimento di Elettrotecnica ed Elettronica, Polytechnic of Bari, Italy, e.B.I.S. s.r.l. (electronic Business in Security), Spin-Off of Polytechnic of Bari, Italy *corresponding author:

Abstract. Background: It is well known that non verbal communication is sometimes more useful and robust than verbal one in understanding sincere emotions by means of spontaneous body gestures and facial expressions analysis acquired from video sequences. At the same time, the automatic or semi-automatic procedure to segment humans from a video stream and then figure out several features to address a robust supervised classification is still a relevant field of interest in computer vision and intelligent data analysis algorithms. Materials and Methods: We obtained data from four datasets: first dataset contains 100 images of humans silhouettes (or templates) acquired from a video sequence dataset available in the web site, second dataset contains 543 images of gestures from a preregistered video of MotoGp driver Jorge Lorenzo, third contains 200 images of mouths and finally fourth one contains 100 images of noses. The third and forth datasets contain images acquired by a tool implemented from the authors and also samples available in literature in public databases. We used supervised methods to train the proposed classifiers and, in particular, three different EBP Neural-Network architectures for humans templates, mouths and noses and J48 algorithm for gestures. Results: We obtained on average a 80% correct classification for binary classifier of humans templates (no false positives), 90% correct classification for happy/non happy emotion, 85% of binary disgust/non disgust emotion and 80% correct classification related to the 4 different gestures. Keywords: Neural Network, Emotions Recognition, Humans Silhouetts, Gesture Recognition, Facial Expressions Recognition, Human Detection, Hands, Action Units, Centre of Gravity, Pose Estimation.

1 Introduction
Good communication is the foundation of successful relationships, both personally and professionally. But we communicate with much more than words. In fact, many researches show that the majority of our communication is nonverbal. Nonverbal communication, or body language, includes facial expressions, gestures, eye contact, posture and even the tone of our voice. Emotions by means of facial expressions gained more attention than body expression (gestures) although evidence suggests that bodily behaviors may be associated with specific emotions [1],[2],[3],[4]. Although the details of his theory have evolved substantially since the 1960s, Ekman remains the most vocal proponent of the idea that emotions are discrete entities. In a survey he outlined his theory of basic emotions and their relationship with facial

expressions [5]. The human face is extremely expressive, able to express countless emotions without saying a word. And unlike some forms of nonverbal communication, facial expressions are universal. In particular the facial expressions for happiness, sadness, anger, surprise, fear, and disgust are the same across cultures. About gesture recognition we consider how our perceptions of people are affected by the way they sit, walk, stand up and move or hold their head. The way we move communicates a wealth of information to the world. This type of nonverbal communication includes our posture, stance, and subtle movements. Gestures are omnipresent in our daily lives. We wave, point, beckon, and use our hands when we are arguing or speaking animatedly, expressing ourselves with gestures often without thinking. However, the meaning of gestures can be very different across cultures and regions, so it is important being careful to avoid misinterpretation. Using these ideas, we want to provide an automatic system which is able to evaluate emotions in particular situations (videoconference, meetings, neurological examination, investigation,).

2 Materials
Materials for all the four datasets containing images of humans silhouettes, gestures, mouths and noses have been collected with the goal of increasing the variance of their samples and then supporting the amount of information in the training examples necessary for the proposed supervised classifiers. 2.1 Humans silhouettes The human silhouettes used in this paper come from those walking in a video stream dataset where the training examples consist of only 20 different silhouettes binary images obtained after a pre-processing phase of background subtraction. By this methods the training examples consist of a number of different and several human silhouettes extracted from each frame.

Fig. 1. a) and b) samples frames and c) 4 different examples of humans silhouettes with their several dimensions and behaviours.

2.2 Facial Expression The facial expression classification used in this paper comes from studies of P. Ekman. First of all we introduce the concept of Action Units (AUs) as minimal facial actions and therefore not separable, elements for the construction of facial expressions. Combination of these, with different intensities, generate facial expression [6].

Fig. 2. Action Units (AUs) in order from the left: AU-10, AU-12, AU-13

According to our previous work [7] we can assert that, generally, prescinding other AUs, the presence of AU-10 discriminates unequivocally disgust emotion; the presence of AU-12 or AU-13 discriminates unequivocally happy emotion. For this reason we are able to recognize two of the six primary emotions declared by Paul Ekman: happy and disgust emotions. To extract middle and lower part of the face we have used our tool, moreover we have used different public databases of faces [8][9][10] and then we have taken our regions of interest. 2.3 Gestures A preregistered video of MotoGp driver Jorge Lorenzo has been analyzed, gathering 543 gestures. Each frame has a resolution of 640x480 pixel. As for the automatic classification of gestures, the research has been based on different studies by psychologist David McNeill [11], who divides them into four main categories: deictic gestures: typical indicating movements, usually emphasized by the movement of fingers or by other parts of the body that can be used for this purpose. These gestures rarely appear within discourse, except when indicating concrete entity. Their meaning is independent from the area in which they occur;

Fig. 3. Representation of hands position in the deictic gestures

iconic gesture: gestures that express formal relation in respect to the semantic content of discourse. They usually represent an action or an object, and mainly occur in the area occupied by the torso of the prototype being focused;

Fig. 4. Representation of hands position in the iconic gestures

metaphoric gestures: similar to iconic gestures, they represent real figures. These refer to abstract concepts, as moods or language. The density of such gestures is concentrated in the lower part of the torso;

Fig. 5. Representation of hands position in the metaphoric gestures

beat gestures: these do not show distinguished meanings, and may be recognized by only focusing the attention on the characteristics of their movements. They are usually subdivided into two small phases in which a rapid movement of fingers or hands is noticeable.

Fig. 6. Representation of hands position in the beat gestures

In order to distinguish the type of gesture made by the prototype being shot, it has been decided to monitor the movement of the center of gravity (CG) of the hands in each frame so as to be able to calculate various parameters of evaluation, such as the velocity with which gestures are made, the study of the gestures path respect to the position of the body or the fact that they begin or end abruptly, all in accordance with the classification mentioned above.

3 Methods
The application of supervised neural network using Error Back Propagation algorithm gives easier solution to complex problems such as in correct classification of silhouettes shapes, facial expression and gestures. Advantages of neural networks include their high tolerance to noise as well as their ability to classify patterns not used for training and then consist in their peculiar good performance in terms of generalization. In particular we implemented neural networks supervised classifier for the classification of silhouettes, mouths and noses emotions features and the J48 classifier for gestures. 3.1 Silhouettes classification The neural network classifier used to correctly detect silhouettes in the available frames is a two layers feed-forward with 396 inputs (corresponding to 33*12 dimensions of the smallest figure previously resized to contain the smallest human silhouette) with 6 logistic neurons in the first layer and 1 neuron as output. The images passed to the neural networks have the following characteristics: The height bigger than the width. The ratio between height and width ranging 1.9 and 4. The height bigger than 33 pixels. The width bigger than 12 pixels. All images are divided in more images and then each image contains a singular human silhouette always resized to 33*12 pixels in order to have the same number of inputs for each neural network classification sample. . This procedure guarantee a constant number of neural networks input. In any case to achieve good performance in terms of generalization the images training sample is selected with large variability in terms of poses and movements and then contains those positive (a,b,c) and negative (d) examples: a. People not staring the cameras (not frontal images);. b. People with their arms far or closed to the body; c. People not very well indentified owing of the presence of just one arm; d. Objects similar to people but wrongly people detected (as contrary examples); 3.2 Facial Expression classification We have realized two NNs, that work in parallel; the first one receives the lower part of the face, in particular the form of the mouth: in happy expressions the mouth should be open, the teeth should be visible and its shape is curved (AU-12, AU-13); the second one receives the middle part of the face, which contains the nose: in disgust expressions nasolabial furrows are visible (AU-10).



ROI extraction

Gray-scale conversion



Fig. 7. Segmentation and vectorization of the face.

Each bitmap gray-scale image is a band of 40x80 pixels which contains respectively the lower and the middle part of the face; to use it as input for the neural network they have been arranged in an array and then normalized, obtaining a 1x50 vector (a function calculates a mean value each 8x8 pixels). In case of no happy and no disgust expressions, the network returns 0 (zero); in the other case (happy for the first NN, disgust for the second NN) the network returns 1 (one). The transfer function of each layer is logarithmic. Backpropagation network training function is based according to gradient descent with momentum. 3.2.1 Mouth To train this NN we have used a training set of 200 photos that are composed of 100 negative and 100 of positive examples in 20000 epochs.

Fig. 8. Examples of mouths obtained from our tool.

Fig. 9. Examples of mouths obtained from public databases.

The NN comes with a structure of the first layer of 300 neurons, the second layer of 200 neurons, the third layer of 10 neurons and 1 output neuron (300x200x10x1). 3.2.2 Nose To train this NN we have used a training set of 100 examples that are composed of 50 negative and 50 positive examples in 20000 epochs.

Fig. 10. Examples of noses obtained from our tool.

Fig. 11. Examples of noses obtained from public databases.

The NN comes with a structure of the first layer of 400 neurons, the second layer of 80 neurons, the third layer of 10 neurons and 1 output neuron (400x80x10x1). 3.3 Gestures For gestures analysis instead the supervised classifier is implemented by means of J48 algorithm instead of using a EBP NN classifier. J48 is one of the most famous algorithm in the field of data mining that builds trees is C4.5; originally developed by Quinlan [12] and is one of the standard algorithms for translating raw data in useful knowledge. Rule induction systems are currently employed in several different environments ranging from loan request evaluation to fraud detection, bioinformatics and medicine [13]. In particular the main goal of this scheme is to minimize the number of tree levels and tree nodes, thereby maximizing data generalization. It uses a measure taken from information theory to help with the attribute selection process. For any choice point in the tree, it selects the attribute that splits the data so as to show the largest amount of gain in information. In this case the input is an array 10 elements and then the particular features is formed by: the x coordinate of the right hand CG, the y coordinate of the right hand CG; the x coordinate of the left hand CG; the y coordinate of the left hand CG; position of the right hand (respect to the torso of the prototype being shot); position of the left hand; right hand slant (measured in radiant); left hand slant; velocity of the movement of the left hand. To find CG, frames have been processed according to the follow workflow:

skin detection by color-space conversion from RGB to HSV, because the second is independent from lightness variation; background subtraction technique to exalt only the hands region; image smoothing; image binarization; tracing of rectangles that contain hands; CG identification; edge and features detection; template matching to notice resting position of hands; gestures classification and storing data on .csv file.

4 Experimental results

4.1 Silhouettes recognition results The next figure (Fig. 7-8-9-10-11-12-13-14) shows some frames adopted to test and report results for silhouettes recognition obtained by the neural network where no false positives are obtained.

Fig. 12. Frame 9.

Fig. 13. Frame 15.

Fig. 14. Frame 17.

Fig. 15. Frame 23.

Fig. 16. Frame 25.

Fig. 17. Frame 31.

Fig. 18. Frame 33.

Fig. 19. Frame 39.

Table 1. The elaboration results. Frame number Foreground figure number 29 24 24 27 28 31 23 20 Number of human images passed to neural networks 5 6 6 4 5 5 4 5 Number of humans correctly detected 4 4 4 3 2 2 3 4

9 15 17 23 25 31 33 39

For each frame is shown the number of figure that are individuated in the foreground, the number of figure that are passed to the neural network and the number of people that are recognized from the neural network (they are border in red). 4.2 Emotion recognition results In this paper we have presented a multimodal system that recognize shapes, two of six primary emotions and analyze information derived from gestures. The complete project expects to recognize all primary emotions, both using facial expressions and speech. About facial processing, using about 150 test images, the results of NNs have achieved about 90% for happy/no-happy emotion and 85% for disgust/no-disgust

emotion of success rate. We can assert that the results are reliable, also because in some particular cases nor human beings can distinguish exactly emotions.

Fig. 20. Picture 1.

Fig. 21. Picture 5.

Fig. 22. Picture 9.

Fig. 23. Picture 10.

Fig. 24. Picture 12.

Fig. 25. Picture 18.

Fig. 26. Picture 7.

Fig. 27. Picture 17.

Fig. 28. Picture 1a.

Fig. 29. Picture 3a.

Fig. 30. Picture 6a.

Fig. 31. Picture 18a.

Fig. 33. Picture 28a.

Fig. 32. Picture 32a.

Fig. 34. Picture 44a.

Fig. 35. Picture 47a.

Fig. 35. Picture 86a.

Fig. 35. Picture 88a.

Fig. 36. Picture 1b.

Fig. 37. Picture 3b.

Fig. 38. Picture 4b.

Fig. 39. Picture 14b.

Fig. 40. Picture 22b.

Fig. 41. Picture 1c.

Fig. 42. Picture 5c.

Fig. 43. Picture 10c.

Fig. 44. Picture 31c.

Table 2. The elaboration results. Picture number 1 5 9 10 12 18 7 17 1a 1b 3b 4b 14b 22b Output 0.9518 0.9518 0.9518 0.9518 0.9518 0.9518 0.0220 0.0331 0.0172 0,0105 0.9660 0.9660 0.9660 0.9660 Picture number 3a 6a 18a 28a 32a 44a 47a 86a 88a 1c 5c 10c 31c Outpu 0.0172 0.0386 0.0172 0.0172 0.0386 0.0386 0.0277 0.1871 0.0172 0,0105 0,0105 0,0105 0,0105

4.3 Gestures recognition results About gestures, the confusion matrix is shown in Table 2. The NN has been correctly classified approximately 80% of gestures (Table 3). The network has specifically been able to label metaphoric gestures in a precise way, in view of the speed in which these same gestures are made, allowing an easy interpretation on behalf of a potential listener. Performances are not optimal as for the recognition of deictic gestures and beat gestures, because if movements that belong to the first category are made very quickly, they might be confused with those of the second category.

Table 3. Confusion matrix of data set for gestures. Deictic Gestures Deictic Gestures Spontaneous gesture Beat Gestures Gesture non recognized and not spontaneous Metaphoric Gestures 53 6 1 1 0 Spontaneous gesture 71 280 9 2 1 Beat Gestures Gesture non recognized and not spontaneous 0 3 1 20 0 Metaphoric Gestures 0 1 0 0 25

3 10 54 1 1

Table 4. Results for gestures recognition.

Correctly Classified Instances: Incorrectly Classified Instances:





The goal of this paper is to investigate emotion-related and realize a multimodal system to recognize emotional patterns of the body and face using Neural Networks. The research aims at developing an intelligent system that can interpret intellectual conversation between human beings. When we interact with others, we continuously give and receive countless wordless signals. The nonverbal signals we send either produce a sense of interest, trust, and desire for connection, or they generate disinterest, distrust, and confusion. The analyzed gestures and facial emotions represent non-verbal communication; they provide the user to what the speaker is saying, thus helping the listener to interpret the meaning of words.

1. Coulson M., Attributing emotion to static body postures: recognition accuracy, confusions, and viewpoint dependence. Journal of Nonverbal Behavior, 28(2),117-139 (2004); De Meijer M., The contribution of general features of body movement to the attribution of emotions. Journal of Nonverbal Behavior, 13(4), 247-268, (1989); Kleinsmith A., De Silva, P. R., Bianchi-Berthouze N., Cross-cultural differences in recognizing affect from body posture. Interacting with Computers, 18(6), 1371-1389, (2006); Wallbott, H. G., Bodily expression of emotion. European Journal of Social Psychology, 28(6), 879-896, (1998); Paul Ekman, FACS: Facial Action Coding System, Research Nexus division of Network Information Research Corporation, Salt Lake City, UT 84107, (2002); Anatomical basic of facial expression learning tool, Victoria Contreras Flores, 2005,; Vitoantonio Bevilacqua, Dario DAmbruoso, Giovanni Mandolino, Marco Suma, A new tool to support diagnosis of neurological disorders by means of facial expressions, (submitted to ICIC 2011); The Japanese Female Facial Expression (JAFFE) Database,; Psychological Image Collection at Stirling (PICS),; Project dedicated for researches on facial emotionality,; Center for Gesture and Speech Research,;

2. 3.

4. 5. 6. 7.

8. 9. 10. 11.

12. 13.

Quinlan, J.R. C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, CA. 1993. Filippo Menolascina, Vitoantonio Bevilacqua et al . Novel Data Mining Techniques in aCGH based Breast Cancer Subtypes Profiling: the Biological Perspective - Proceedings of the 2007 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2007) pp.9-16