Professional Documents
Culture Documents
net/publication/357548985
CITATION READS
1 694
7 authors, including:
Madhuka Nadeeshani
Sri Lanka Institute of Information Technology
10 PUBLICATIONS 82 CITATIONS
SEE PROFILE
All content following this page was uploaded by Madhuka Nadeeshani on 07 February 2023.
Abstract— Globalization and technology have made virtual models that predict the overall interview scores with other
interviews to be the choice of recruitment. Even though online interview-specific traits as excitement, friendliness,
interviews/viva have eliminated time, budgetary, and engagement, and awkwardness. Support Vector Regression
geographical barriers, the lack of comprehension regarding the (SVR), Lasso and Random Forest (RF) models have
interviewee’s behavioural aspects is yet to overcome. Therefore,
produced the best regression and classification results.
a machine-based approach is proposed in this research for
detecting and assessing changes in interviewees' behaviour and Interview Performance Analyzer is a suggested system
personality traits based on nonverbal cues. Additionally, a that analyses an interviewee's performance by combining
group analysis of other applicants, as well as a comparison of emotion detection with speech fluency recognition [3]. A
the interview environment with the non-interview environment
Convolutional Neural Network (CNN) model that uses the
is also being obtained. To achieve this, we focus on the
candidate’s emotion, eye movement, smile, and head HaarCascade classifier and Gabor filters to recognize seven
movements. The system was carried out using deep learning and primary emotions and another model employing Mel
machine learning models which achieved accuracies over 85% Frequency Cepstral Coefficient (MFCC) characteristics and
for all smile, eye gaze, emotion, and head pose analysis. logistic regression to classify speech into four categories:
Furthermore, several machine learning models were developed Fluent, Stuttering, Cluttering, and Pauses have been
based on the analysed behavioural outcomes of the interviewee developed. Predictions from both models had been
to identify big five personality traits with Random Forest model combined to give the interviewee a performance rating.
yielding highest accuracy rate of over 75%. Our findings They had utilized the FER2013 dataset and CK+ dataset for
indicate that nonverbal behavioural cues can be utilized to
speech fluency and facial expression detection.
determine personality traits.
Studies have looked at the importance of both verbal and
Keywords—Deep learning, Personality traits, Emotion nonverbal cues through three different methods as audio,
analysis, Head movement, Eye gaze, Smile analysis video, and a questionnaire in determining the hirability of a
marketing and business analyst job positions [4, 5]. The
I. INTRODUCTION
dataset has been manually labeled and a questionnaire had
Traditional methods of conducting viva/interviews have been used to classify personality traits along with several
been transformed into a virtual format in the recent past. classification models. Nonverbal behavior along with
When compared to other traditional approaches, the virtual prosodic features were found to have predictive validity for
style of interviewing presents additional challenges for both hirability and stress resistance in the jobs positions.
the interviewer and the interviewee. Understanding the
interviewee's psychological state during the viva/interview is According to current systems, the “Fetcher” AI
one of the most complicated tasks. Virtual interviews, unlike recruitment platform employs AI to monitor the database and
face-to-face interviews, fail to provide a complete description delivers a supply of diverse and qualified candidates based on
of the interviewee as it limits the interviewer from observing the filled-out description [6]. The system only displays the
the interviewee's subtle behavioural changes due to the contact percentage, good fit, and bad fit percentage along
restricted appearance and low quality videos within poor with the number of views on the job post.
network connections. However Artificial Intelligence can “MyInterview” is another AI based solution that assists in
overcome the above issues and even eliminate interviewer getting to know candidates and identifying the best fit for the
bias. According to the survey mentioned in section III-A, the job role [7]. It builds a personality profile of the candidates
majority of respondents stated that they had a low confidence based on vocals and behaviour, where their algorithm
in tracking the interviewee's behaviour throughout a virtual assesses their matchability. It uses the Big Five personality
interview in terms of nonverbal behaviours. model to provide personality insights about the individual.
Most of the previous researches have considered one or These studies have primarily relied on audio analysis and
two nonverbal features at a time to address the problem raised emotions to identify various nonverbal signs, with time as a
in this study. Prosodic, lexical, and facial aspects have been variable. They have been able to predict the best fit using
gathered to develop systems for predicting and analysing job personality features. Furthermore, most of existing systems
interviews using interview videos of students at MIT have targeted on candidate resume filtering, screening,
university [1,2]. Prosodic reflecting the speaking style and scanning, and reference checks, as well as automating Human
rhythm of the speech, lexical providing data on the counts of Resource (HR) management procedures. For instance,
specific words, where facial expressions such as smiles, and Paradox [8] uses AI-powered processes to arrange interviews
head gestures. The three types of features have been then with reminders, and systems like XOR [9] and Humanely
concatenated and used to train regression and classification [10] use chatbots as a modern communication tool.
978-1-6654-2637-4/21/$31.00©2021 IEEE
Furthermore, technologies such as Loxo [11] and Eye blink
Seekout [12] aid in the initial screening and evaluation of details detectio
n
applicant resumes, consequently becoming irrelevant to the Emotion
scope of our system. detection
through video
Currently, none of the existing researches focus on a
Head nodding
question-based approach to analyse personality traits based and shaking
on the interviewee's behaviour. And recent studies have been detection
limited to analysing interviewees through verbal qualities in Head pose
addition they have focused solely on whether the candidate is features (Pitch,
Roll, and
good or not for the organization rather than a detailed Yaw) detectio
behaviour analysis. The main challenge is identifying n
personal traits, which are regarded as the most significant Used for viva
aspects when evaluating a person. Our method identifies analysis
personal traits and provides a fair evaluation of the individual
by combining behaviours gathered through emotional state, II. METHODOLOGY
smile analysis, eye gazing, and head movement analysis.
Which are then employed to detect personal attributes and A. Datasets
make an unbiased assessment of the candidate. Since the proposed system is consisted of four main parts
as smile, emotion, eye, and head analysis, each segment was
Furthermore, the system also focuses on providing a developed and trained using its own dataset. The ‘SMILEs’
group analysis where the performance of an individual dataset which included 13665 images of smile and non-smile
candidate can be compared with the average performance. was used to train the CNN model for smile detection
Also, the proposed system provides a comparison of the [13]. The ‘SPOS’ dataset comprised of 84 posed and 147
candidate performance in an interview environment with spontaneous facial expression clips and USTC-NVIE dataset
none interview (normal environment). Apart from HR were used to train the CNN model for detecting the
recruitment interviews, which are currently the main focus of genuineness of the smile [14-16].
existing systems, our approach can be used to evaluate
student performance in viva examinations as Kayvan Shan's ‘Eye-Dataset’ with 14500 photos divided
well. The system may also be used as a substitute or in into forward look, left look, right look, and down look
conjunction with physical interviews. A more detailed categories was used for the eye gaze direction CNN [17].
comparison of the existing systems and researchers are ‘Eye Aspect Ratio’ dataset which contained 771 records was
tabulated in Table I and Table II respectively. used to detect eye blinks [18]. Eye blink status for eye open
and eye blink, were designated as 0 and 1. The ‘Chon Kanade'
dataset, which contains 5876 images classified into Happy,
TABLE I. EXISTING VS PROPOSED SYSTEM COMPARISON Angry, Contempt, Disgust, Fear, Sadness, and Surprise
Proposed categories, was used for the emotion detection CNN [19]. The
Features Fetcher [6] MyInterview [7] ‘Biwi’ and ‘Helen’ datasets were used for the head movement
System
Role fit component, with 10,000 images randomly selected from each
Personality Traits dataset [20]. On yaw, the head position range is +-75, and on
pitch, it is +-60.
Behavioural Analysis
Description 1) Interview Dataset
Comparison with the
average behaviour Mock interviews were conducted as online virtual
interviews over the ‘Zoom’ platform. Initially, 10 participants
Comparison with the
normal environment participated in the mock interviews. Each participant was
Used for viva analysis asked an identical set of 26 questions during the session by
the interviewer. The recorded videos were analysed and
TABLE II. LITERATURE REVIEW COMPARISON labelled by experienced HR personnel and was split into 8:2
ratio for training and testing data. The personality traits of the
Automated interviewees were ranked on a five-point scale. The
Leveraging Interviewee personality traits which were ranked included agreeableness,
prediction Proposed
Features Multimodal performance
Framework
Analysis [2] Analyzer [3]
System openness, neuroticism, conscientiousness, and extraversion
MIT [1] [21].
Speaker
diarization B. Preliminary Phase
through
audio Speech-To-Text API offered by Google Cloud Platform
was used for speaker diarization to differentiate the speakers
Smile analysis and separate the questions in the video [22]. Which were then
used to calculate the time it took the interviewee to
Eye Gaze
direction
respond. Afterwards, the video frames were extracted to the
detection exact time period given by the question's start and finish
timings. Subsequently, the dlib face detector was used to
identify the landmarks of the interviewee’s face [23]. All the The average person blinks between 12 and 15 times per
identified frames with associated question number and face minute, but this can vary depending on the individual [25].
landmarks details were sent to all other subcomponents. Therefore, attention was calculated by storing the blinks per
C. Subsystem minute for each question in an array and utilizing it to
determine the interviewee's quartiles one and three blinks per
minute values. The blinks per minute obtained for the
subsequent questions are compared with the values obtained
for quartiles one and three. If blinks per minute are less than
the quartile one value, it indicates increased attention, and if
blinks per minute are greater than the quartile three value, it
shows decreased attention, while the remaining values are
within the normal range [26-28].
3) Emotion Analysis:
In order to detect the emotion, a CNN model with
multiclass classification as happy, sad, angry, contempt,
Fig. 1. System Diagram surprise, disgust and fear was implemented. Since the
emotion cannot be changed within 30 seconds, the frames
S0 - Smiling percentage and smile genuineness level were grouped into one second and were analysed to
E0 - Eye blinking rate, attention and drowsiness determine the emotion. Subsequently, the average predictions
Em0 – Prominent Emotion were used by the system to determine the most prominent
H0 – Head nodding and head shaking count emotion of the interviewee in a question-based manner by
identifying the variation of the emotion.
1) Smile Analysis:
4) Head Movement Analysis:
Detecting the smile and identifying its genuineness were
completed using two CNN models. The smile detection The CNN regression model serves as the foundation for
model was built through Lenet architecture and binary the head movement analysis component. The head motions
classification was used to predict results as ‘smiling’ or ‘non- were not precise to 30 milliseconds, and to minimize the error
smiling’ [24]. Another CNN model was implemented using rate of the prediction, the component evaluates at a rate of 18
Resnet-18 and convLSTM networks in order to identify frames per second. The processed image was scaled to
discriminative smiling features and predict whether the 224x224 pixels as the deep learning model's input size. The
observed smile was 'spontaneous' or 'posed'. If the output was scaling was done using the bounding box coordinates
predicted to be 'smiling,' the frame associated with the provided by the dlib face detector. The model was then used
prediction, as well as its previous and next frames, were to determine the yaw, pitch, and roll angles for that video
scaled to 48x48 pixels and sent to smile category detection frame.
CNN model. Consecutive frames were employed to aid in the Head nodding and head shaking gestures were computed
identification of discriminative features of the smile. based on the data obtained from the preceding phase. The
Average percentages of smiles and non-smiles from the pitch and yaw values represent the rotation around the X-axis
frames were calculated by the system depending on the Y-axis respectively. According to functional testing, a 5-
duration for each question after the model predictions were degree threshold value difference on three consecutive series
completed. of pitch and yaw values was considered a head-nodding
movement and head-shaking action respectively. The yaw,
2) Eye Gaze Analysis and Blink Detection:
pitch, and roll data were utilized to calculate the average head
The three subcategories of eye component comprises of rotation in response to a specific question. The average
determining the interviewee’s eye gaze direction, values were determined by dividing the sum of individual
ascertaining the eye blinking average and finally identifying displacement magnitudes by the whole frame.
the interviewee attention and drowsiness. Initially the points
of the right and left eyes were detected using dlib. The Eye 5) The Personality Model:
Aspect Ratio (EAR) was calculated (1) using six points: The interviewee's personality traits can be determined to
starting from the left corner of the eye with two points in the assess the candidate's strengths, weaknesses, and adaptability
upper lid to the right corner of the eye and two points in the to consequently enabling the interviewers in identifying the
lower lid (p1– p6) in both left and right eyes. The EAR was most suitable candidate. The system follows Big Five
then sent into the blink detection model to determine whether traits as neuroticism, extraversion, openness, agreeableness,
a blink occurred in that frame according to the EAR value. and conscientiousness when considering the interviewee’s
EAR = || p2 – p6 || + || p3 – p5 || / 2 ||p1 – p4|| (1) personality characteristics [21].
Subsequently, the image was cropped by identifying the RF model per each personality trait was utilized to
top, left, right, and down points of both left and right eyes provide a rating for the interviewee due to its capacity to
separately which was then passed to a CNN model to forecast high-dimensionality datasets. The inputs to the
determine gaze direction. The model identifies whether the prediction model are determined by the outputs of the four
interviewee is looking at the screen, down, left, or right. major components including smile percentage, genuineness
of the smile, most prominent and second most prominent gaze
direction and emotion, as well as blinking rate, and head 95%. Also, the precision, recall and f1-score gained for this
gesture motions. model ranging from 25% 45%.
6) Group Behavioural Analysis and Comparison with the However, when the ‘Adadelta,’ optimizer was used the
Normal Environment training accuracy increased to 97.51%. Despite the fact that
model trained by both optimizers training and validation
The interviewer is presented with an overall analysis that accuracies were nearly identical, the model's testing accuracy
compares all of the interviews in a candidate group. Average was found to be 85%. This model also has superior Precision,
group values are calculated by adding all the results obtained Recall, and F1-Score.
from the subcomponents of each interview done under the
given group and comparing them to the individual findings The machine learning model developed to determine the
obtained. blinking status, given the input feature as EAR value was
compared with 4 other models to determine which model
Interviewers have the option of adding a video of the
provided the best accuracy. The Support Vector Machine
interviewee in a non-interview context to compare with the
interview environment where non-interview context is an (SVM) was made using the linear kernel. A number of 2
environment with no interview questions. The video is neighbours were used for the K-Nearest Neighbour. This was
identified to be the least effective. The RF model with a
broken down into frames and they are provided into the
maximum depth of 2 obtained the best accuracy; as the depth
normal environment component where it identifies the
was raised, the accuracy dropped to 93%, but the random
average values for nonverbal cues.
state had no effect on the accuracy. With high sensitivity and
III. RESULT AND DISCUSSION specificity, the RF Classifier model outperforms all others.
A. Survey TABLE IV. EYE BLINKING DETECTION MODEL
A total of 30 professionals comprised interviewers from Specificity
major companies and lecturers who conduct viva participated Model Accuracy (%) Sensitivity (%)
(%)
in the study. The survey included a questionnaire that SVM 90 91.3 89.6
compared physical and virtual interviews, as well as their Decision Tree 93 98.4 87.7
Naïve Bays 90 91.2 89.6
preferences as shown in TABLE III.
RF 94 98.4 89.6
KNN 79 69.8 91.5
TABLE III. SURVEY QUESTIONS
Question Yes No 3) Emotion Analysis
(%) (%)
Do you prefer traditional interview over virtual 70 30 The model was initially trained using the SPOS dataset.
interview Despite the fact that the SPOS dataset yielded a high accuracy
Can you determine the interviewee genuineness 10 90
the results were limited to six output labels. Therefore, the
through a virtual interview
Can you monitor interviewee’s eye gaze and head 23.5 76.5 CK+ and FER datasets were used to train the model [29].
movement throughout the interview Since the data labelling for the datasets differ, both datasets
Can you track the smile and emotion of interviewee 71.5 28.5 were trained separately to acquire the most suitable training
dataset for classifying the emotion. TABLE V summarizes
B. Result
the model results obtained for the datasets and the augmented
1) Smile Analysis CK+ was chosen as the most suitable dataset for training the
model.
The smile has been analysed using separate CNN models.
The smile detection CNN model was able to attain its best TABLE V. EMOTION DETECTION MODEL RESULTS
performance when it was trained for 15 epochs along with Dataset Name Number of Training Validation
using the 'Adam optimizer'. The model was able to obtain an Emotion Accuracy Accuracy
overall accuracy of 93% along with precision, recall, and f1- Categories (%) (%)
scores ranging from 80%-90%. SPOS 6 96 94
FER 7 80 80
The second CNN model for identifying the genuineness Original CK+ 7 65 62
of a smile was first trained using SPOS dataset. Due to the Cropped & Augmented 7 90 90
small size of the smile data in the SPOS dataset, the model CK+
only achieved a low training accuracy rate of 72%. Then the
size of the training dataset was increased by combining the 4) Head Analysis
USTC-NVIE dataset, resulting in a training accuracy of 88%
and a testing accuracy of 85%. Due to the fact that both CNN The images which were used to train the model, were pre-
models provided a binary classification, binary crossentrophy processed and resized using YOLO head object detection.
was used while training the models. The backbone of the model is EffiecientNetB0 architecture.
The evaluation of the model was done using Mean Squared
2) Eye Gaze Analysis Error (MSE) and Root Mean Squared Error to identify error
The base model used for the CNN model was VGG16 and of predicted value.
model obtained by training with a value of 21 epochs was The Mean Absolute Error Percentage (MAEP) was
identified as the best performed. When this model was trained measured to compare the models which were derived from
with the ‘Adam’ optimizer the training accuracy obtained was BIWI dataset. The results were derived from a set of models
which were trained on random data selection. MAEP of the Table IX. Two studies which obtain the personality traits
model was identified to be 0.15%. In addition, the WHENET through text [31] and facial images [32] were utilized.
model is tested using transfer learning methods to predict
head movement angles, since its weights are trained with TABLE IX. COMPARATIVE ANALYSIS WITH EXISTING SYSTEM
larger number of images sets, the model provides a high Through
accuracy [30]. TABLE VI summarizes the methods used to Through facial Proposed
determine the optimal CNN model strategy. Method text[31] images[32] system
Extraversion 77.18% 73.23% 74.42%
TABLE VI. REGRESSION CNN MODEL EVALUATION- MSE FOR YAW, Neuroticism 61.47% 64.35% 86.05%
Accuracy
PITCH AND ROLL.
Agreeableness 75.51% 60.68% 79.07%
Method Yaw MSE Pitch MSE Roll MSE Average
MSE Conscientiousness 70.34% 69.56% 72.09%
Using Load weights of 4.39 4.45 3.46 4.1 Openness 80.38% 61.48% 74.42%
WHENET
Complexity of the system Low High Fair
Trained model – using 5.05 6.2 4.8 5.35
300W_LP, BIWI Ease of use High Low High
Input features Sentences Facial images Videos
The model includes three output layers to measure yaw, ChaLearn
pitch and roll values. The ‘Adam’ optimizer was used to Several First
essay Impression Interview
minimize the cost function of the model when training. The Input dataset datasets [33] dataset
MSE approach was utilized as a step in the custom
implemented loss function.
6) Sample test results
TABLE VII shows the model performances that were
Fig. 2 illustrates images acquired from the extracted test
used to assess interviewee behavioural analysis. All datasets
results through each smile detection and categorization, eye
used to train the models in sub systems were divided into
gaze, emotion, and head movement detection models. A set
subsets for training, validation, and testing at a ratio of 3:1:1.
of frames per specific duration were analysed and utilized to
TABLE VII. CNN MODEL PERFORMANCE SUMMARY
obtain predictions for each personality trait. The results of the
personality trait analysis were stated as a rating obtained from
CNN Model Validation F1 a five-point scale.
Precision Recall score
Name Accuracy (%)
Smile 93 0.86 0.88 0.87
Detection
Smile 86 0.79 0.88 0.83
Category
Eye Gaze 97.41 0.95 0.94 0.95
Detection
Emotion 90 0.85 0.90 0.87
Detection