You are on page 1of 9

Measurement: Sensors 25 (2023) 100655

Contents lists available at ScienceDirect

Measurement: Sensors
journal homepage: www.sciencedirect.com/journal/measurement-sensors

Emotional speech-based personality prediction using NPSO architecture in


deep learning
Kalpana Rangra a, *, Virender Kadyan a, *, Monit Kapoor b
a
Speech and Language Research Centre (SLRC), University of Petroleum and Energy Studies, (UPES), Energy Acres, Bidholi, Deheradun, 248007, Uttrakhand, India
b
Chitkara University Institute of Engineering and Technology, Chitkara University, Punjab, India

A R T I C L E I N F O A B S T R A C T

Keywords: Speech is an effective way for analyzing mental and psychological health of a speaker’s. Automatic speech
Personality-classification recognition has been efficiently investigated for human-computer interaction and understanding the emotional &
Emotion-classification psychological anatomy of human behavior. Emotions and personality are studied to have a strong link while
Speech features
analyzing the prosodic speech parameters. The work proposes a novel personality and emotion classification
MFCC
PSO
model using PSO (particle swarm optimization) based CNN (convolution neural network): (NPSO) that predicts
CNN both (emotion and personality) The model is computationally efficient and outperforms language models.
OCEAN (Big five) Cepstral speech features MFCC (mel frequency cepstral constants) is used to predict emotions with 90% testing
accuracy and personality with 91% accuracy on SAVEE(Surrey Audio-Visual Expressed Emotion) individually.
The correlation between emotion and personality is identified in the work. The experiment uses the four corpora
SAVEE, RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song), CREMAD (Crowd-sourced
Emotional Multimodal Actors Dataset, TESS (Toronto emotional speech set) corpus, and the big five personality
model for finding associations among emotions and personality traits. Experimental results show that the clas­
sification accuracy scores for combined datasets are 74% for emotions and 89% for Personality classifications.
The proposed model works on seven emotions and five classes of personality. Results prove that MFCC is enough
effective in characterizing and recognizing emotions and personality simultaneously.

1. Introduction which include Facebook and twitter to proportion their thoughts and
feelings, in addition to their evaluations and feelings about modern and
Voice signals are strongly influenced by the autonomic and the so­ beyond information and activities. The way a person presents himself or
matic nervous system. The speech utterances are modulated by herself on the internet reflects their attitude, behavior, and personality.
numerous psychological and mental conditions. Speech analysis has The work in Refs. [10,11], suggests a clear association of temperamental
been intensive part of research for understanding the mental state of and the way they behavioral from social media platforms in form of text
individuals. The speech prosody is articulated by the physiology of the audio and images.
respiratory system and the vagus nerve. Voice production involves The different sections of research article are organized as: Section-2
Central Nervous System (CNS). Because auditory feedback is so impor­ presents work related to personality trait recognition and multitasking
tant in one’s voice, voice patterns are susceptible to coordination among learning; Theoretical background is in Section-3.; Section-4 contains
phonation, articulation, and audition [1].The human respiratory system proposed model and related information; Section-5 has the experiments
influences the speaker prosody due to the involvement of nerves.Speech with the proposed work approach; Section 6 discusses the results ob­
signal allows researchers to investigate a variety of psychology affected tained. Finally, Section-5 is the work conclusion along with future scope.
domains, including emotion [2–6], mood, and mental stress [7–9].
Personality not only predicts and describes person’s behavior, 2. Related work
however also encompasses the way they assume and experience, in
addition to influencing their reasons, preferences, feelings, and even Researchers’ current center of attention on the improvement of
mental fitness. Individuals are increasingly using social networking sites automated personality assessment. Consciousness structures

* Corresponding authors.
E-mail addresses: rangrakalpana26@gmail.com (K. Rangra), vkadyan@ddn.upes.ac.in (V. Kadyan).

https://doi.org/10.1016/j.measen.2022.100655
Received 13 October 2022; Received in revised form 8 December 2022; Accepted 19 December 2022
Available online 5 January 2023
2665-9174/© 2022 Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
K. Rangra et al. Measurement: Sensors 25 (2023) 100655

demonstrates the importance of personality cognizance on social net­ K-nearest neighbor classifier, an integrated feature selection algorithms
works. These purposes are based totally on the central philosophy of produces results that exceeded the baseline results. The authors [28]
numerous standard personality models, such as the DiSC Assessment identify seven types of speaker traits based on the Interspeech Speaker
[12],and the Big Five Factor Personality Model [13]. Trait Challenge and followed the approach includes a variety prosodic
The objective of automatically assessing the personality is to forecast and cepstral, and a subset of the OpenSMILE [29]functions. GMM-UBM ,
the speaker’s personality based on the nonverbal behaviors [14]. eigenchannel, (support vector machine)SVM and distance-based classi­
Conscientiousness, extroversion, the speaker characteristics considered fiers are all included in this paper as classification algorithms [30].
are agreeableness, neuroticism, and openness. Understanding human Before the identification process, it is necessary to parameterize
personality is crucial for natural and social connections [15]. Personality speech signals for extraction of feature set from a speech waveform. The
a psychological construct that attempts to explain the great variety of features are to be selected with a criterion to provide appropriate rep­
speaker behaviors using a few consistent and quantifiable qualities resentation of the speech signal and must distinctly discriminate among
exhibited by an individual. It is relevant to any field of computers that tones. Study of short-term speech window relates to the time varying
requires the comprehension, prediction, or synthesis of human expres­ nature of speech. Parameters do not change assumingly for the selected
sion [16]. Numerous methods have been used to establish the attributes window [31].Data augmentation integrated with transfer learning is
that play a significant role in technology, concluding that personality studied in Ref. [32]. Various variants of MFCC as a speech feature set has
and computers are inextricably linked [17–19]. been proposed for Punjabi automatic speech recognition system [33].
An overview of technologies dealing with personality has been pro­ Neural network techniques for automatically recognizing the Hindi and
posed which models, namely APR (identification of a person’s true Punjabi speech respectively were also studied in Refs. [34,35].
personality from behavioral evidence), APP (determining personality, After examining different information exchange mechanisms for
the other attribute to a person based on their observed behavior) and personality-detection & emotion-detection, we propose a MFCC-based
APS (e-learning based artificial personalities detection) [20].Numerous NPSO_CNN model for simultaneous personality trait and emotion
studies established the characteristics that speech has a significant detection that defines a basic model to work with sound based
technical role in linking personality and computing inextricably for personality-emotion dataset.
human-computer interaction [17–21]. The aim of the paper is personality detection using an NPSO_CNN
Prosody based personality system that predicted Extroversion and framework. Because personality and emotion are complementary, we
conscientiousness accurately listed with design [22] using the low-level constructed a model that performs both emotion and personality pre­
variables (pitch, first two formants, speech signal energy,and duration of diction as an auxiliary task in the framework. The work provides a
voiced speech frames along with unvoiced segments) and four properties comprehensive review of personality trait identification, emotion
deploying the statistics named as minimum, maximum, mean, and recognition, and associative learning in the relevant literature.
low-level feature difference relative entropy) were recovered from
sequential analysis windows. Logistic regression along with support 3. Theoretical background
vector machine (SVM) classifiers, both were deployed that performed
equally well in terms of accurate prediction of personality. The accuracy Personality can be detected from various sources such as text, social
ranges from 60% to 72%. Authors presented [17] a an automatic media, and video. Multiple deep learning methods have been deployed
detection and evaluation of personality using frequency domain linear for the identification of personality in humans. Researchers used digital
prediction (FDLP) [23]. interactivity with humans [19,22,24]. The footprint [20,36], text [37], and facial features [38,39] as a predictor.
research work in Ref. [25] [26], summarizes differences based on cul­ Visual-based personality recognition model, combinational feature set
tural and personality in the unspoken mode of expression aligned with at a different level, and their relationship [18].
two attitudes and emotions. The work database, consisted of videos Formant frequencies are the synchronous frequencies. Shape and size
detailing nonverbal communication in various settings as well as the of the of the vocal tract tube can alter these frequencies. Formants are
impact of personality on nonverbal behavior. The automatic detection of significant for speech since they significantly contribute to voice anal­
big-five personality traits on the basis of Facebook data has been pro­ ysis. Selective sound clips along with formant details with highest sig­
posed in Ref. [26]. Researchers in Refs. [24,25] assed and predicted a nificance in specific category are examined for personality trait
speaker’s independent perceived personality from speech. The data at­ prediction.
tributes, labeling scheme, and predictive quality were discussed. The ( )
f
work gives a framework for assessing an anonymous speaker’s person­ Mel(f ) = 2595 log10 1 + (1)
700
ality by people and machines using SVM regression. Model training used
sample-level features derived from audio descriptions. Feature selection
where f is the linear scale frequency and Mel(f ) is the Mel-scale fre­
solves high-dimensional classification issues [27]. The data was sub­
quency as shown in equation (1).
mitted by the Interspeech Speaker Trait Challenge organizers. Using a
The big five personality check follows the Five-Factor Model [3], an
empirical-primarily based principle in psychology that appraises five
predominant scopes of personality: Openness, Conscientiousness,
Extroversion, Agreeableness, and Neuroticism. We can use the BIG FIVE
(OCEAN) check to rent for all task roles. The large five tests should now
not be used for hiring selections, however, as a tool to advantage a
deeper understanding of taking a look at-takers’ persona inclination.
The insights into personality trends can assist increase gaining knowl­
edge of and upgrading conversations. The big five-character check [in­
spection] is predominantly, based on the 5-factor model (FFM) [40]
belief, that posits 5 extensive trait dimensions or domains as the premise
of multiple personalities. The version has been fashioned by utilizing the
paintings of numerous researchers over three a long time (from the
Sixties to the 1990s) who analyzed verbal descriptors of human conduct.
Each personality is quantified as a spectrum: openness to revel in levels
from resourceful and snooping to steady and vigilant, thoroughness
Fig. 1. Particle Swarm Optimizer for obtaining Learning Rate.

2
K. Rangra et al. Measurement: Sensors 25 (2023) 100655

Fig. 2. Underlying architecture for the Convolutional Neural Network.

algorithmatically.Deep neural networks, especially convolutional neural


networks (CNN), provided solution to image related problems [42–44],
however, their performance greatly depends on dependent selective
hyperparameters values, which are time-consuming to fine-tune.
CNN is a well-studied deep learning algorithm that shows excep­
tional performance in the image classification task. CNN consists of
convolutional layer, pooling layer and fully connected layer. The
convolution layer preserves some image related spatial information
related and adjusts the learnable weight for learning. The final output is
the product of the input and learned weights [37,45,46]. Fig. 2 is the
diagrammatic representation of layered CNN architecture.

4. Proposed model

The given flow diagram in Fig. 3 elucidates the steps for the proposed
model.

4.1. Data augmentation

A speech signal S(t) is seen as an output signal at a specific period


when modeling an ASR system (t). For the ith sine-wave, these signals
had a glottal excitation e(t) pulse. It is calculated using linear equation
[32] as presented by Eq. (3).
∑N (∫ t )
e(t) = Ai (t) ∗ cos wi (E)dE + ɸi (3)
Fig. 3. Proposed model for Emotional Speech Based Personality Prediction. i=1 0

degrees from efficient and prepared to extravagant and careless, extra­ where wi is a fixed phase offset, Ai (t) is the audio signal’s amplitude, and
version levels from outgoing and energetic to solitary and reserved, N is the count of sine waves that correspond to the audio speech signal.
agreeableness ranges spans, from friendly and compassionate to hard
and callous, and neuroticism degrees from touchy and fearful to resilient 4.2. Vocal tract length perturbation (VTLP)
and assured [13]. The Big Five model of personality is generally
considered to be the most scientifically strong way [means] to explain Speaker-differentiation issues are addressed earlier by using a
personality differences. To take this test one has to rate the following disruption-based method called VTLP approach. It used the VTLP
statements on a scale of 1–5. These statements are accessible on the approach as a standard procedure [17]. The audios listened to by various
Trusty website [4]. speakers had a variety of characteristics, the most prevalent of which
Researchers proposed particle swarm algorithm that allowed varia­ was Vocal Track Length. The vocal tract length is standardized with the
tions for multiple targets for optimization [40,41]. The process consists help of a frequency warping factor (α) that warps the matching fre­
of constantly searching for the best solution for particle x and y in each quency axis. VTLP also attempted to generate synthetic data expansion
iteration. The expected result is that the particle swarm converges to the using several VTL apps. The distinct frequencies f(k) are calculated
best solution as shown in Fig. 1. ThePSO does not use gradient descent, using Eq. (4) and then mapped to another frequency at certaintime (t)
so it can be used for nonlinear problems when it is not required that the for a given input S(t) as given in Eq. (5).
problem be differentiable.
Swarm intelligence have been widely used and achieved tremendous
success in solving many highly nonlinear, multimodal problems

3
K. Rangra et al. Measurement: Sensors 25 (2023) 100655

( )
f
Mel(f ) = 2595 ∗ log10 1+ (6)
700
From this it follows that for N filter banks, the center frequency (fcm )
must lie between the minimum frequency (fmin ) and the maximum fre­
quency (fmax ) [34], which according to Eq. (7) is:
( ) ( )
Mel(fmin ) ∗ (k − 1)
fcm k = m− 1 ∗ Mel(fmin ) + Mel(fmax ) − (7)
(N − 1)

1
where m− is the inverse of the Mel-scale function defined by Eq. (3).

4.4. Particle swarm optimization algorithm

General PSO set of rules represents every particle represents a ability


method to the challenge inside the seek area. Fig. 4 represents the PSO
optimization for improving learning rate for the experiments carried out
in current work. Within the space, the placement vector and ith particle
speed vector is ui =(ui1 , ui2, ….., uiD ) and vi =(vi1 , vi2, ….., viD ),
respectively. The value of ui and vi for the ith particle is updated after
randomization using Eqs. (8) and (9),
( ( ( )
vi (t + 1)wvi (t) + c1r1 pi − xi (t) + c2r2 pg − xi t (8)

xi (t + 1) + xi (t) + vi (t + 1) (9)
Where w = inertia weight; the parameters c1 and c2 are two con­
stants to determine weight for pi and pg ;
pi = the first-rate previous position of the ith character;
pg = the first-class previous function of all particles within the
Fig. 4. PSO optimization for improving learning rate.
present-day technology;
r1 and r2 represent uniformly distribute within the variety of [0,1].
⎧ The flowchart of preferred particle swarm optimization is shown in fig.
fb (k)min (α, 1)


⎪ f (k).α , where f(k) <= Four. For our experiments, the quality fitness value for getting to know
⎪ α


⎨ )⎞ charge i.e. (pi ) is kept 1.
⎛(fs (k)
fn (t, k) = − fb (k)min (α , 1) (4)

⎪ fs (k) ⎜ 2 ⎟




− ⎝
fs (k) fb (k)min (α, 1)
⎠ otherwise 4.5. CNN layer description for model
⎩ 2 −
2 α
The model uses the following layers.
Here fs (k) is a sampling frequency (i.e., 16kHz) and fb (k) is a frequency
that lies at the boundaries to cover the significant formats. i. Convolution layer

4.3. Mel filter bank for MFCC feature extraction The input speech pattern is used to extract neighborhood features
using the convolution layer. The width and peak of the entry volume are
The MFCCs were extracted as features and were found as best feature convolved with the convolution kernels when the result is passed into
for speech recognition in experiments. Each attribute has its qualities the convolution layer. The dot product is then computed to create a
that use similar features to distinguish among speaker personalities. feature map among the entries of the kernel and the input is computed as
MFCC is characterized as time -function. The coefficients are presented according to eq. (10).
in frames with a constant sampling of 25 ms. Spectral examination is ∑
done on analyzed results using on a Mel frequency scale and the con­ fmap [t] = (x ∗ w)[t] = x(m) ∗ w(m + t) (10)
version of mel spectrogram values is converted to MFCC.
m

The extracted set of MFCC features is treated with CNN for emotion where fmap → features map
and personality classification. x→ input features, MFCC
Mel-filter banks are the spaces between two successive triangular- w→ kernel function
shaped filters [31].The information recorded inside signal frame at
certain duration through different channel is employed in such a way ii. Relu layer (activation)
that a small quantity of distortion is visible. It is also expressed as S (t) ( )

g(t) − →max 0, fmap (11)


after boosting the high frequency component in signal and is evaluated
as per Eq. (5).
S’(t) = S(t) − a ∗ S(t − 1) (5)
iii. Dropout layer
where S(t) is an actual input signal and a is varying constant that ranges ( )2
over t frames. The Mel-scale Mel for a frequency f is characterized as per 1 ∑ m
Dlayer → t − pi w i g i (12)
Eq. (6): 2 i=1

Where pi is learning rate calculated based on the Bernoulli probability

4
K. Rangra et al. Measurement: Sensors 25 (2023) 100655

Fig. 5. The dense layer before final output in CNN for personality detection.

equation 23. activation_5 (Activation) (None, 145, 3, 256)


24. _________________________________________________________________
iv. Dense (SoftMax) 25. max_pooling2d (MaxPooling2D) (None, 72, 1, 256)
26. _________________________________________________________________
The architecture’s classifier, SoftMax, predicts values based on the 27. dropout (Dropout) (None, 72, 1, 256)
features entered. A multiclass classification issue generalization of lo­ 28. _________________________________________________________________
gistic regression is called SoftMax. More than two values can be assigned 29. flatten (Flatten) (None, 18432)
to the class label y. The SoftMax function is defined in equation (13). 30. _________________________________________________________________
31. dense (Dense) (None, 100)
exp (gi )
y(i) = ∑ (13) 32. _________________________________________________________________
j exp (gi ) 33. activation_6 (Activation) (None, 100)
∑ 34. _________________________________________________________________
Where gi = hj wji , the input of SoftMax function.
j 35. dropout_1 (Dropout) (None, 100)
hj = activation in the penultimate layer 36. _________________________________________________________________
wji = weight connection between last layer and SoftMax layer 37. dense_1 (Dense) (None, 7)
Based on the y(i) the predicted class label would be
5. Experiments and results
ÿ = argmaxpi (14)
The final output layer for personality detection in used CNN is shown The following section discuss the details of datasets used and the task
in Fig. 5 and the description for the layers in network is shown as carried out for the achievement of the objectives.
follows:
Layer (type) Output Shape. 5.1. Dataset description and preparation
1. conv2d (Conv2D) (None, 155, 13, 16)
2. _________________________________________________________________ The experiments have been carried out on four data sets using four
3. activation (Activation) (None, 155, 13, 16) datasets.
4. _________________________________________________________________
5. conv2d_1 (Conv2D) (None, 153, 11, 32) 1. Crema D
6. _________________________________________________________________ 2. Radvess
7. activation_1 (Activation) (None, 153, 11, 32) 3. Tess
8. _________________________________________________________________ 4. Savee
9. conv2d_2 (Conv2D) (None, 151, 9, 32)
10. _________________________________________________________________ The Ryerson Audio-Visual Database of Emotional Speech and Song
11. activation_2 (Activation) (None, 151, 9, 32) RAVDESS incorporates 24 actors (12 lady, 12 male), in three modalities.
12. _________________________________________________________________ Speech emotions consist of calm, happy, sad, indignant, anxious,
13. conv2d_3 (Conv2D) (None, 149, 7, 64) amazed, and disgusted expressions. Expression display normal, strong
14. _________________________________________________________________ and neutral stages of emotion. Speech-effective files (16-bit, 48khz.
15. activation_3 (Activation) (None, 149, 7, 64) Wav) from the database are used for experiments.
16. _________________________________________________________________ A data set of 7,442 real footage from 91 performers is called CREMA-
17. conv2d_4 (Conv2D) (None, 147, 5, 128) D [47]. These clips are taken from 48 male and 43 female actors age
18. _________________________________________________________________ 20yrs–74yrs and representing a variety of racial and ethnic back­
19. activation_4 (Activation) (None, 147, 5, 128) grounds. Actors used a choice of 12 phrases to speak. The statements
20. _________________________________________________________________ were introduced using distinct emotions (angry, sad, fear, disgust,
21. conv2d_5 (Conv2D) (None, 145, 3, 256) happy, neutral) and four distinct emotional states.
22. _________________________________________________________________ The TESS [31] dataset is exclusively female and has amazing audio.

5
K. Rangra et al. Measurement: Sensors 25 (2023) 100655

Table 1
Ratings obtained for Personality on Emotional Dataset on scale of 100.
Emotion Openness to experience Conscientiousness Extraversion Agreeableness Neuroticism

Anger 30–40 60 above 50 10 20–30


Fear 22 20 20 10 50 above
Surprise 24 30 50–60 10 17
Sad 10 13 11 14 30–50
happy 21 28 30–50 7 to 10 14–20
neutral 10 20 20–30 5 5
disgust 25 25 30–50 5 20

Table 2
Numbering of sound samples from each dataset.
Emotion Personality Crema D Radvess Tess Savee

Anger Open to ex 1–30 31–222 223–622 623–682


Disgust Conscientiousness 683–712 713–904 905–1304 1305–1364
Fear Neuroticism 1365–1394 1395–1586 1587–1986 1987–2046
Happy Extraversion 2047–2076 2077–2268 2269–2588 2589–2648
Sad Neuroticism 2649–2678 2679–2870 2871–3270 3271–3330
Neutral Conscientiousness – 3331–3426 3427–3826 3827–3946
Surprise Neuroticism – 3947–4138 4139–4538 4539–4598

There are 200 objective phrases in the service phrase "Say the word,"
Table 3
delivered by two women (26 and 64 years old), and speech recordings
Label encoding for emotions.
expressing seven emotions are present (disgust, happiness, anger, sur­
Emotion Label prise(pleasant), fear, sadness, and neutral). The WAV format is used for
Anger 0 audio files.
Disgust 1 The SAVEE [48] database contains data from four English male
Fear 2
speaker (DC, JE, JK and KL) from the University of Surrey aged 27–31
Happy 3
Sad 4 years. Anger, fear, happiness, disgust, sadness, and surprise have all
Neutral 5 been classified as different emotions in psychology. This is supported by
Surprise 6 Ekman’s cross-cultural studies [6], and studies on automatic emotion
recognition [49]. Neutral was added emotion categories. The tests
included 15 TIMIT phrases per emotion: three general phrases, two
Table 4 emotion-associated phrases, and ten generalized phrases that were
Label encoding for BIG FIVE Traits of phonetically-balanced and differed for each emotion. It contained 30
Personality. neutral utterances.
Personality Label All four datasets are combined to avoid running problems with
overfitting. Initially, the model was tested on a single dataset. The ac­
O 1
C 2 curacy achieved was high but the model failed to give good results on
E 3 real-time audio because classifier training on the same data set and the
N 4 similar recording environment.
A 5 The speech samples were randomly selected from all of the 4
mentioned datasets and labeled as follows. The personality mapping was

Fig. 6. Samples of audio from the dataset.

6
K. Rangra et al. Measurement: Sensors 25 (2023) 100655

Emotion Performance Comparision

100 91.25 91.196 90.7143 90.9282 100


90
80

Values in %
80
Values in %

70
60
60
50 40
40
30 20
20 Accuracy Presicion Recall F1 score
Accuracy Presicion Recall F1 score
Emotions Personality

Fig. 7. Performance metrics for SAVEE Emotion classification.


Fig. 9. Performance Comparison for SAVEE Personality classification.

obtained for both emotion and personality classification are shown in


Personality Figs. 5–7. It can be seen that the model achieved high accuracy on
SAVEE alone. The only drawback the model faced was while validating
100 91.875 91.5168 91.5 91.3777 with different voice samples. The results were not acceptable and there
90 was a misclassification of emotions as neutral, sad, and angry. Most of
80 the time the sad and angry emotions were classified as neutral for new
sound samples.
Values in %

70
To overcome this issue there was a need of incorporating more voice
60
samples so four datasets (as mentioned in section number) were incor­
50
porated into the experiments. It was observed that the training and
40
testing metric scores were reduced for both emotion and personality
30 classification but the model performed better in real-time emotion and
20 personality classification. The performance metrics for the same are
Accuracy Presicion Recall F1 score
shown in Figs. 8–11 and Table 5 – 7.
Chi-square Analysis was performed to test the validity of the values
Fig. 8. Performance metrics for SAVEE Personality classification. obtained for personality classification. The test was performed to iden­
tify the correlation between emotion and personality. The observations
done with the help of a psychologist who labeled the selected files for found that the level of significance was 0.50 and the hypothesis
personality and is described in Table 1as follows. Table 2 describes the formulated based on table one was correct and could be used for inte­
groups of various samples taken from four datasets. The label encoding grating the model’s future applications.
done for the experiments is shown in Table 3 and Table 4.
7. Conclusion and future scope
Data preprocessing
The experiments involved four different datasets so data was Extracting significant information from speech signals is a way to
normalized and resampled to a frame of 3 s for each voice sample. The find emotions and corresponding personality. The present paper pro­
noise was added to make the model learn more and distinguish between posed a novel approach for personality-emotion classification model
various emotions and personalities. using NPSO technique. For data preprocessing, four different datasets
The shape of the vocal tract, including the tongue, teeth, and so on, (emotion-based) are normalized into single frame which is a 3s voice
filters the sounds made by humans. This form influences what sound is sample. The authors intentionally added noises signals to check the
produced. If we can identify the shape, we should be able to get an ac­ robustness of the proposed system under external factors. The cepstral
curate depiction of the phoneme being produced. The envelope of the speech features represented by mel scale (MFCC) from SAVEE dataset
short-term power spectrum manifests the form of the vocal tract, and the were given as input that resulted in accuracy cores of 90% for emotion
aim of MFCCs is to appropriately reflect this envelope. With a sample detection and 91% for personality classification. Further the approach
rate of 25 Hz and a frame size of 3 s, MFCC was extracted using Python was applied to a find the associations among emotions from four corpora
language libraries. Fig. 6 shows an audio sample from the dataset. and big five personality model. The model achieved the accuracy of were
Steps. obtained from the classification results of 74% for emotions and 89% for
Personality classifications. The classification results prove that MFCC
1. Loading the dataset (trainset, test set) can be considered as is effective feature for in characterizing and
2. Features extraction by using the MFCC method recognizing emotions and personality simultaneously from audio.
3. Calculate learning rate by using PSO optimization technique The model can be integrated with various modalities like text and
4. Select features based on PSO optimizes cuckoo search algorithm images for analyzing the performance. The combined feature set (audio,
5. Implement CNN architecture model on selected features video, text) can be useful in deriving the actual short-term personality
6. Train the features to emotions label, and personality label from emotional behavior of the speaker. The observations can further be
7. Test the test features by using emotion model and personality model used as a record for medical and psychometric records.

6. Results & discussion CRediT authorship contribution statement

The given figure shows the results of experiments on described Kalpana Rangra: data Acquisition, Data curation, Formal analysis,
datasets. Initially, only SAVEE was used for experiments. The results Investigation, Validation, Testing, Writing – original draft. Virender

7
K. Rangra et al. Measurement: Sensors 25 (2023) 100655

Fig. 10. Confusion matrix for Emotion classification on Combined dataset.

Fig. 11. Confusion matrix for personality classification on Combined dataset.

Table 5 Table 7
Training Metrics of NPSO_CNN for Emotion and Personality classification. Actual Vs. Predicted Results for classification.
TRAINING (%) EMOTION PERSONALITY Actual emotion Predicted Emotion Actual Personality Predicted
personality
Accuracy 0.865869 0.999819
Precision 0.879201 0.999819 Anger Anger O O (99%)
Recall 0.864976 0.999819 Fear Fear C C (97%)
F1-score 0.866330 0.999819 Surprise Surprise E E (35%), C (65%)
Sad Neutral N N (95%)
Happy Happy A A (45%), C (55%)
Neutral Neutral C C (50%)
Table 6 Disgust Neutral N N (45%)
Testing Metrics of NPSO_CNN for Emotion and Personality classification.
TESTING (%) EMOTION PERSONALITY
Kadyan: Conceptualization, Methodology, Supervision, Writing – re­
Accuracy 0.735254 0.888285 view & editing, final approval. Monit Kapoor: Conceptualization, Su­
Precision 0.766725 0.888285
Recall 0.739211 0.888285
pervision, Writing – review & editing, final approval.
F1-score 0.742049 0.888285

8
K. Rangra et al. Measurement: Sensors 25 (2023) 100655

Declaration of competing interest and usability lab, in: TU-Berlin/Telekom Innovation Laboratories ; Germany
Department of Computing Science ; University of Glasgow ; UK,” in Proceddings of
Interspeech, 2012, pp. 2–5.
The authors declare that they have no known competing financial [25] Tim Polzehl, Sebastian Möller, Florian Metze, Automatically assessing personality
interests or personal relationships that could have appeared to influence from speech, in: 2010 IEEE Fourth International Conference on Semantic
the work reported in this paper. Computing. IEEE, 2010, pp. 134–140.
[26] Firoj Alam, Giuseppe Riccardi, Predicting personality traits using multimodal
information, in: Proceedings of the 2014 ACM Multi Media on Workshop on
Data availability Computational Personality Recognition, 2014, pp. 15–18.
[27] Jouni Pohjalainen, Paavo Alku, Multi-scale modulation filtering in automatic
detection of emotions in telephone speech, in: 2014 IEEE International Conference
The data that has been used is confidential. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 980–984.
[28] Shashidhar G. Koolagudi, Ramu Reddy, K. Sreenivasa Rao, Emotion recognition
References from speech signal using epoch parameters, in: 2010 International Conference on
Signal Processing and Communications (SPCOM). IEEE, 2010, pp. 1–5.
[29] Florian Eyben, Martin Wöllmer, Björn Schuller, Opensmile: the munich versatile
[1] Andrea Guidi, et al., Analysis of speech features and personality traits, Biomed.
and fast open-source audio feature extractor, in: Proceedings of the 18th ACM
Signal Process Control 51 (2019) 1–7.
International Conference on Multimedia, 2010, pp. 1459–1462.
[2] Dimitrios Ververidis, Constantine Kotropoulos, Emotional speech recognition:
[30] Sabur Ajibola Alim, N. Khair Alang Rashid, Some Commonly Used Speech Feature
resources, features, and methods, Speech Commun. 48 (9) (2006) 1162–1181.
Extraction Algorithms, IntechOpen, London, UK, 2018, pp. 2–19.
[3] Pavan Paikrao, et al., Smart Emotion Recognition Framework: A Secured IOVT
[31] Meenakshi Sood, Shruti Jain, Speech recognition employing MFCC and dynamic
Perspective, IEEE Consumer Electronics Magazine, 2021.
time warping algorithm, Innov. Inform. Commun. Technol. (IICT-2020) (2021)
[4] Ioulia Grichkovtsova, Michel Morel, Anne Lacheret, The role of voice quality and
235–242. Springer, Cham.
prosodic contour in affective speech perception, Speech Commun. 54 (3) (2012)
[32] Virender Kadyan, Puneet Bawa, Transfer learning through perturbation-based in-
414–429.
domain spectrogram augmentation for adult speech recognition, Neural Comput.
[5] William Apple, Lynn A. Streeter, Robert M. Krauss, Effects of pitch and speech rate
Appl. (2022) 1–19.
on personal attributions, J. Pers. Soc. Psychol. 37 (5) (1979) 715.
[33] Virender Kadyan, Puneet Bawa, Taniya Hasija, In domain training data
[6] Zhong Yin, et al., Recognition of emotions using multimodal physiological signals
augmentation on noise robust Punjabi Children speech recognition, J. Ambient
and an ensemble deep learning model, Comput. Methods Progr. Biomed. 140
Intell. Hum. Comput. 13 (5) (2022) 2705–2721.
(2017) 93–110.
[34] Mohit Dua, Rajesh Kumar Aggarwal, Mantosh Biswas, Discriminatively trained
[7] Kun-Yi Huang, et al., Mood detection from daily conversational speech using
continuous Hindi speech recognition system using interpolated recurrent neural
denoising autoencoder and LSTM, in: 2017 IEEE International Conference on
network language modeling, Neural Comput. Appl. 31 (10) (2019) 6747–6755.
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.
[35] Virender Kadyan, Mohit Dua, Poonam Dhiman, Enhancing accuracy of long
[8] Sharifa Alghowinem, From joyous to clinically depressed: mood detection using
contextual dependencies for Punjabi speech recognition system using deep LSTM,
multimodal analysis of a person’s appearance and speech, in: 2013 Humaine
Int. J. Speech Technol. 24 (2) (2021) 517–527.
Association Conference on Affective Computing and Intelligent Interaction. IEEE,
[36] Clemens Stachl, et al., Predicting personality from patterns of behavior collected
2013.
with smartphones, Proc. Natl. Acad. Sci. USA 117 (30) (2020) 17680–17687.
[9] Hilmy Muhammad Syazani Hafiy, Stress classification based on speech analysis of
[37] Navonil Majumder, et al., Deep learning-based document modeling for personality
MFCC feature via machine learning, in: 2021 8th International Conference on
detection from text, IEEE Intell. Syst. 32 (2) (2017) 74–79.
Computer and Communication Engineering (ICCCE). IEEE, 2021.
[38] Chanchal Suman, et al., A multi-modal personality prediction system, Knowl. Base
[10] Yash Mehta, et al., Recent trends in deep learning based personality detection,
Syst. 236 (2022), 107715.
Artif. Intell. Rev. 53 (4) (2020) 2313–2339.
[39] Xu Jia, Weijian Tian, Yangyu Fan, Physiognomy in new era: a survey of automatic
[11] Di Xue, et al., Deep learning-based personality recognition from text posts of online
personality prediction based on facial image, in: International Conference on
social networks, Appl. Intell. 48 (11) (2018) 4232–4246.
Internet of Things as a Service, Springer, Cham, 2018.
[12] Favaretto, Rodolfo Migon, et al., Detecting personality and emotion traits in
[40] Yulong Wang, Haoxin Zhang, Guangwei Zhang, cPSO-CNN: an efficient PSO-based
crowds from video sequences, Mach. Vis. Appl. 30 (5) (2019) 999–1012.
algorithm for fine-tuning hyper-parameters of convolutional neural networks,
[13] Alessandro Vinciarelli, Gelareh Mohammadi, A survey of personality computing,
Swarm Evol. Comput. 49 (2019) 114–123.
IEEE Trans. Affect. Comput. 5 (3) (2014) 273–291.
[41] Tianyang Li, Haoyan Luo, Chenyu Wu, A PSO-based fine-tuning algorithm for CNN,
[14] Zachariah NK. Marrero, et al., Evaluating voice samples as a potential source of
in: 2021 5th Asian Conference on Artificial Intelligence Technology (ACAIT). IEEE,
information about personality, Acta Psychol. 230 (2022), 103740.
2021, pp. 704–709.
[15] Zachariah NK. Marrero, et al., Evaluating voice samples as a potential source of
[42] Navya Damodar, H.Y. Vani, M.A. Anusuya, Voice emotion recognition using CNN
information about personality, Acta Psychol. 230 (2022), 103740.
and decision tree, Int. J. Innovative Technol. Explor. Eng. 8 (2019) 4245–4249.
[16] Aiste Dirzyte, et al., Computer programming E-learners’ personality traits, self-
[43] Li Deng, Geoffrey Hinton, Brian Kingsbury, New types of deep neural network
reported cognitive abilities, and learning motivating factors, Brain Sci. 11 (9)
learning for speech recognition and related applications: an overview, in: 2013
(2021) 1205.
IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE,
[17] J. Sangeetha, R. Brindha, S. Jothilakshmi, Speech-based automatic personality trait
2013, pp. 8599–8603.
prediction analysis, Int. J. Adv. Intell. Paradigms 17 (1–2) (2020) 91–108.
[44] Ruhul Amin Khalil, et al., Speech emotion recognition using deep learning
[18] Xiaoming Zhao, Zhiwei Tang, Shiqing Zhang, Deep personality trait recognition: a
techniques: a review, IEEE Access 7 (2019) 117327–117345.
survey, Front. Psychol. (2022) 2390.
[45] Semiye Demircan, Humar Kahramanlı Örnek, Comparison of the effects of mel
[19] Alessandro Vinciarelli, Gelareh Mohammadi, A survey of personality computing,
coefficients and spectrogram images via deep learning in emotion classification,
IEEE Trans. Affect. Comput. 5 (3) (2014) 273–291.
Trait. Du. Signal 37 (2020) 51–57.
[20] Le Vy Phan, John F. Rauthmann, Personality computing: new frontiers in
[46] Andrew L. Maas, et al., Building DNN acoustic models for large vocabulary speech
personality assessment, Soc. Personal. Psychol. Compass 15 (7) (2021), e12624.
recognition, Comput. Speech Lang 41 (2017) 195–213.
[21] Clifford Nass, Lee Kwan Min, Does computer-synthesized speech manifest
[47] Steven R. Livingstone, Frank A. Russo, The Ryerson audio-visual database of
personality? Experimental tests of recognition, similarity-attraction, and
emotional speech and Song (RAVDESS): a dynamic, multimodal set of facial and
consistency-attraction, J. Exp. Psychol. Appl. 7 (3) (2001) 171.
vocal expressions in north American English, PLoS One 13 (5) (2018), e0196391.
[22] Aidan GC. Wright, Current directions in personality science and the potential for
[48] Houwei Cao, et al., Crema-d: crowd-sourced emotional multimodal actors dataset,
advances through computing, IEEE Trans. Affect. Comput. 5 (3) (2014) 292–296.
IEEE trans. affect. comput. 5 (4) (2014) 377–390.
[23] Ravi R. Shenoy, Seelamantula Chandra Sekhar, Frequency domain linear
[49] Philip Jackson, Haq SjuoSG, Surrey Audio-Visual Expressed Emotion (Savee)
prediction based on temporal analysis, in: 2014 IEEE International Conference on
Database, University of Surrey: Guildford, UK, 2014.
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014.
[24] P. Tim, S. Katrin, M. Sebastian, M. Florian, G. Mohammadi, A. Vinciarelli, On
speaker-independent personality perception and prediction from speech quality

You might also like