Emotion Analysis and Prediction From Voice Recordings

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/342453009
Emotion Analysis and Prediction from Voice Recordings
Article · June 2020
CITATIONS READS
0 227
1 author:
Narek Tumanyan
American University of Armenia
1 PUBLICATION 0 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Emotion Analysis and Prediction from Voice Recordings View project
All content following this page was uploaded by Narek Tumanyan on 25 June 2020.
The user has requested enhancement of the downloaded file.

Emotion Analysis and Prediction from

Voice Recordings
Narek Tumanyan
Edvard Avagyan
Supervisor: Ashot Harutyunyan

BS Capstone Project
American University of Armenia
Akian College of Science and Engineering
Yerevan, Armenia
2020

1

Abstract

This project intends to solve the problem of mood recognition of a speaker based on
their voice. Statistical and exploratory analysis were performed on speech data, as well
as statistical learning and deep learning models and architectures were used along
with data features obtained from diverse signal processing techniques. Various
classification and regression approaches were experimented. The obtained results and
identified difficulties were analysed.

2
1. Introduction 4
2. Related Work 5
3. Datasets 6
4. Feature Extraction 7
4.1 Frequency Decomposition 8
4.2 Mel Frequency Cepstral Coefficients (MFCCs) 8
4.3 Continuous Wavelet Transform (CWT) 11
4.4 Component-Activation Decomposition 13
4.5 Fundamental Frequency Features 14
5. Methodology 16
5.1 Statistical Analysis 16
5.2 Prediction Approaches 17
5.2.1 Discrete Emotion Classification 17
5.2.2 Emotional Dimensions, Continuous Sentiment Arousal Space 17
5.3 Neural Network Architectures 19
5.3.1 Convolutional Neural Networks 19
5.3.2 Long Short Term Memory Networks 20
6. Experiments 24
6.1 Exploratory Data Analysis 24
6.1.1 Analysing Frequency Spectrum 24
6.1.2 Analysing Cepstral Peaks 28
6.1.3 MFCC Mean and Variance 30
6.1.4 Component-Activation Analysis 30
6.2 Classification Using Discrete 8 Emotions 32
6.2.1 CNN model on MFCCs 32
6.2.2 LSTM model on MFCCs 35
6.3 Classification Using Sentiment-Arousal Space 37
6.3.1 LSTM Models on MFCCs for Sentiment-Arousal Space Zones 37
6.3.2 CNN Model on Wavelet Transform for Sentiment-Arousal Space Zones 40
6.4 Mapping Emotion to Continuous Sentiment Arousal Space 42
7. Conclusion and Future Work 44

7.1 Retrospective and Future Work 44
7.2 Systems and Applications 44
3
1. Introduction
Existing approaches to determining speaker emotion make use of spectral features of the
voice signal. In particular, mel frequency cepstral coefficients are found to be popular
amongst approaches towards solving this problem. The popularity of this feature stems
from speech recognition systems which use this feature with success. Moreover, due to
their well-known predictive power, machine learning algorithms are widely used to solve
the emotion recognition problem. However, there are several hindrances towards
providing generalizable solutions. There is a lack of labeled speech datasets for speaker
emotion recognition. This is a problem as there is a large variability of human speech
which stems from different languages, pronunciation and difference in human voice.
Hence, it is hard for ML algorithms to generally classify emotions in speech based on
small datasets. Also, there is a question of how good mel frequency cepstral coefficients
describe emotions in human speech. To overcome these issues, some approaches also use
other features in combination with spectral features of the speech. For instance, facial
features are used as descriptive features of human emotion.

This paper tries to tackle speaker emotion recognition problems from several viewpoints.
The experiments are as follows; we tried using statistical analysis methods on spectral
features of the human voice to classify emotion in speech and to identify which
characteristics determine the emotion in the speech. We also experimented on methods
such as continuous wavelet transform, component-activation decomposition and
fundamental frequencies in order to compare performance against widely-used mel
frequency cepstral coefficients. We used several neural network architectures to solve
prediction problems due to their high predictive capacity. And lastly we explored some
potential methods of reduction of complexity of the speaker emotion recognition
problem. Reduced problems yield better results and can be used instead of the original
problem depending on the application. Namely, we experimented on reducing discrete
emotion classification problem to classifying activity, binary sentiment and the
combination of both of emotions in the speech. The work on feature extraction, methods
and results are further delineated in section 4, 5 and 6.

In the end, there is also a retrospective on the experiments and explanation of some
potential applications for an emotion classification algorithm.
4
2. Related Work
There have been previous approaches to classifying emotion of the speech using machine
learning models. Mostly, the approaches to the classification problem were similar to each
other. The steps can be generalized in the following way:

1. Extraction of waveforms of voices from databases. Each voice recording in the
database is being sampled with a fixed sampling rate. This step results in a floating
point time series which describes the amplitude of the recording through the time
axis.
2. Conversion of time domain representation of the recording to the frequency domain.
This step is accomplished using a fast fourier transform algorithm. This results in a
sequence which describes the presence of different frequency ranges in original time
series.
3. Extraction of Mel Frequency Cepstral Coefficients (MFCCs) from frequency
representation of each voice recording. MFCCs are the most commonly used features
for classification problems related to speech.
4. Training a classifier with extracted features to classify the mood of the recording. Most
of the approaches tried to classify discrete emotions or to classify the relative intensity
of the speech which also implies emotion.

Results varied depending on the chosen architectures and databases. What is observed
though commonly is the limitations. Mostly, limitations are related to databases, there is a
lack of number of recordings and number of actors who perform those recordings. These
commonly lead to classifiers learning speaker specific information during classification
which prevents generalization to new speakers or new phrases.

There is a paper that uses SVM classification algorithm to classify natural human emotion
expressions into 5 categories: angry, happy, neutral, sad, or excited. The dataset used is
the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database [7]. It is
distinguished because it also considers facial expression of the speaker during speech to
classify it into one of 5 categories [5]. Another paper examines the Deep Neural Network
Extreme Learning method which is efficient for small datasets. It also utilizes IEMOCAP
database [6]. These approaches preprocessed and removed parts of speech that
corresponded to silence or noise.
5
3. Datasets
This paper focuses on classifying 8 emotions from voice recordings. The databases used in
the paper are the following: Ryerson Audio-Visual Database of Emotional Speech and
Song (RAVDESS) [1], Surrey Audio-Visual Expressed Emotion (SAVEE) [2] and Toronto
Emotional Speech Set (TESS) [3] Generally, our data are voice recordings from different
actors who pronounce some statements and exert certain emotion for each recording.
Voice recordings in the databases come in a .wav f ormat which describes the amplitude of
air pressure oscillations in the time domain. Each voice recording has an emotion label
attached to it. The RAVDESS database has 24 actors that pronounce 2 phrases: “Kids are
talking by the door” and “Dogs are sitting by the door” with 2 intensities: Normal and High
each repeated twice. Neutral emotion has no high intensity so it is only repeated twice.
The emotion labels are: (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 =
fearful, 07 = disgust, 08 = surprised). TESS dataset has 2 actors, young and old and both of
them are female. There are 2800 voices in total with each phrase being of the form “Say
the word x” where x stands for some word. Recordings in the TESS dataset have the same
labeled emotions as in RAVDESS except the calm label which is not present in this dataset.
SAVEE dataset has 4 English male actors with 480 voice recordings. 7 emotions are
present with the calm emotion missing. In total, there are 4720 samples. The distribution
of samples and classes are summarised below:

Database Num of Recordings Num of Actors Emotion Labels
RAVDESS 1440 24 8
SAVEE 480 4 7
TESS 2800 2 7
Neutral Calm Sad Fear Anger Surpries Happiness Disgust
616 192 652 652 652 652 652 652
6
4. Feature Extraction
To extract audio features from voice recordings we used librosa library for python [9]. It
handles most of the transformations done to voice recordings to get final features used
for classification. The first step before extracting features is to resample voice recording
files to obtain their time domain and amplitude representation. Voice recordings from our
databases have different original sampling rates which range from 22Khz to 48Khz.
However, the content that we are trying to analyze from those recordings are the human
voices themselves. Normally, human voice ranges from low range frequencies 300Hz to
higher ranges 4 - 10Khz. This means that we can use lower sampling rates to resample our
voice recording. We chose 22.05Khz sampling rate which preserves all human voices in
original audio recordings and also preserves some possible frequency deviations from
normal range which can be caused by pronouncing high frequency tones such as
fricatives. The result is a floating point time series describing the amplitude of air
pressure oscillations from mean frequency of 0 at each time point. Thus, we obtain a
time-domain representation of the signal. An example is illustrated below in Fig 4.1.1

Fig 4.1.1
After obtaining the time domain representation of each audio sample, several features
used further can be obtained through certain transformations.
7
4.1 Frequency Decomposition
A signal in time domain can be represented using a combination of sinusoids of different
frequencies of oscillations. A sinusoid of frequency f hertz can be described as follows
[10].
x(t) = asin(2πf t + ϕ)
So, frequency decomposition of the signal in the time domain can be obtained which will
conceptually identify how much of sinusoids with certain frequencies were combined to
produce the signal. Hence, the intensities of frequencies present in the original signal can
be computed . A tool used to achieve this is the Discrete Fourier Transform (DFT) which
will convert time domain signal of length K into another signal in frequency domain [11].
N
X (k) = ∑ x (n)e−i2πkn/N , 1 ≤ k ≤ K
n=1
After this transformation we obtain sequence X (k) describing the intensity of frequency
k hertz in the original signal.

In signal processing, frequency decomposition is often performed by dividing the signals
into time intervals of specified window size and performing DFT on each windowed
signal, thus coming up with frequency components in multiple time intervals. Such
representation of a signal is called the Short-Time Fourier Transform (STFT) of a signal.
4.2 Mel Frequency Cepstral Coefficients (MFCCs)

MFCCs are features which represent a given signal by cepstral energy coefficients at
specific short intervals of time. The advantage of MFCC features is that they represent the
signal in a way that is close to the signal perception by the human ear, which, intuitively,
is achieved by applying smaller window-sized cepstral filters on low frequencies on a
signal and increasing the window size of the filters as the considered frequency increases.
The reason behind such intuition is that the human ear perceives frequencies in lower
ranges much better than in higher ones, so one obtains higher frequency resolution on
lower ranged frequencies while computing MFCCs [4].

To calculate the MFCCs, first, the signal is decomposed to several overlapping sub-signals
(windows.) The principles of choosing the length of those windows is described later.

8
FFT is applied to each small window of the original signal, and the result is another
sequence which describes frequencies through time in the original signal. This process is
called a short time fourier transform (STFT). An illustration is given in Fig 4.2.1

Fig 4.2.1

N
(1) S j (k) = ∑ sj (n)h(n)e−i2πkn/N , 1 ≤ k ≤ K
n=1
Where S j (k) is the intensity of frequency k of time j-th time window, and sj (n) is the
intensity of j th window of the original signal at time point k and h(n) is a N point long
Hamming window. Then we take the absolute value of the complex fourier transform and
square it to obtain power spectrum estimate.
(2) P i (k) = 1| |2
N |S i (k)|
The next step is to calculate MFCCs from the frequency decomposition of the singal.
MFCCs in their turn are coefficients which describe the overall frequency spectrum of the
frequency decomposition. MFCCs are beneficial because their number of coefficients is
far less than the number of coefficients that come out of STFT and yet are able to
accurately represent the sound wave. After STFT for each window of the original signal
we have around 2048-8192 coefficients each describing how much amplitude of the
frequency of that coefficient was present in the window. After computing MFCCs we are
left with far less features. Computing MFCCs involves passing FFT results through Mel
filters and computing DCT of the filter outputs to obtain coefficients which describe the
frequency spectrum itself. Mel filters are necessary to adjust computed frequencies to
9
human hearing [4]. The more DCT coefficients we use to more features that describe the
frequency are employed.

Now, let us elaborate deeper on the steps of computing MFCCs.
1. We have power spectrum estimates of each windowed signal P i (k) , where i is the
index of the window and k is the index of frequency, thus P i (k) gives the intensity of
frequency k in time window i .
2. We compute triangular mel filter banks, which are essentially vectors with the length
of the range of frequencies after computing FFTs (let’s denote the latter by f max ).
Each filter bank is used for computing the energy of the windowed signal in a specific
range of frequencies.

Fig 4.2.2

An example of a mel filter bank is depicted above, which is used for computing the
energy of the signal in the frequency range 100-150 Hz, thus the values of the vector
are 0’s outside the range and are adjusted in the range so that the central frequency
will get the largest coefficient. In principle, the closer the central frequency to 0, the
shorter is the range of the triangular filter bank (and vise-versa), the reason being that
the human ear can distinguish the energies more clearly in the lower level of
frequencies, and therefore we need shorter filter banks (which also increases the
amount of filter banks) for lower frequency ranges so that the lower level frequency
energies will be more distinguished.
3. For each power spectrum of each windowed signal P i (k) and for each filter bank
vector H n (k) , where m ranges from 0 to the amount of mel coefficients that we want
to compute, we calculate the mel coefficient by computing the sum
f max
mn = ∑ H n (k)P i (k) . Then we take the log for each mel coefficient, mn = log(mn ) , so
k=0
finally we obtain the log filterbank energies for each window.
N
(3) M k = 1
2 m0 + ∑ mn cos [ Nπ n (k + 1
2 )] , 0 ≤ k ≤ N −1
n=1
10
Where N is the number of mel coefficients, mn is the k-th mel coefficient from the
step above and M k is the k-th MFCC [4].

MFCCs are visualised below.

Fig 4.2.3

The number of windows decides how accurate we can describe this change. If the number
of windows is large we have more individual parts of the original signal decomposed into
frequencies. On the other hand, if the number of windows is too large we lose general
frequency description of the whole signal which means there is a tradeoff between
describing frequency changes over time versus describing frequencies over the whole
signal. It is certainly necessary to try classifying the same examples with different
windows length chosen to find an optimal one. We should also choose the length of
fourier transform on each window. Since human voices range from around 300Hz to 5Khz
it is reasonable to choose a value from that range. The precise process of decomposing
signal into frequencies involves discrete fourier transform (DFT) which attempts to
describe oscillations in terms of sum of sinusoids of different periodicity. The algorithm to
compute precies DFT used in Librosa is fast fourier transform (FFT).
4.3 Continuous Wavelet Transform (CWT)

The continuous wavelet transform is a method of analysing the frequency components of
a signal in specific time intervals. The advantage that CWT has over STFT is that it solves
the problem of trade-off between frequency resolution and time resolution. When
performing a STFT on a signal, one has to choose the window length for dividing the
whole signal into sub-signals and performing DFT on each window, which means that the
larger the window size is set, we obtain a higher frequency resolution (the frequency
components are better explained on the scale of the whole signal) and a lower time
11
resolution (the temporal changes of frequencies in the signal are not explained
thoroughly.) The opposite holds as well: lower window sized STFT has higher time
resolution (indicates changes in frequency over time better) but a low frequency
resolution (does not properly show the frequency distribution over the whole signal.)
CWT solves this “resolution trade-off” problem by analysing the signal at different
frequency scales - larger scale meaning lower frequency and vise-versa, where
large-scale analysis preserves the frequency resolution of the transformation by deriving
the frequency-related data of the large parts of the signal, and low-scale analysis
preserves the time resolution by deriving data on the abrupt changes of the signal in short
intervals.

CWT makes use of wave-like functions called wavelets, and, at each step of the algorithm,
a chosen wavelet function is convoluted with the original signal function for deriving the
corresponding frequency-domain value. As described in [15], the conditions for a function
f (t) to be qualified as a wavelet are the following (complex wavelets are not considered in
this paper, the following conditions relate to the real-valued wavelet qualifications only):
∞
2
1. E = ∫ |f (t)| dt < ∞ , where E is called the energy of the wavelet,
−∞
2. If F (k) is the Fourier transform of f (t) , then it must be true that
∞ 2
|F (k)|
∫ k dk < ∞
0
which, intuitively speaking, means that f (t) has a zero mean.
3.
As stated in [15], the most commonly used wavelet functions are Gaussian wave, Mexican
hat, Haar and Morlet, the latter of which we utilized in speech signal processing
(visualised below.)
12

Fig 4.3.1

After choosing a wavelet function ψ (t) , the CWT of the signal x(t) i s computed as
follows:
∞
T (a, b) = 1
√a ∫ x(t) ψ( t −a b ) dt
−∞
where a is the scale at which the signal is analysed, and b is the measure of time at which
the wavelet function is shifted. Thus, the original signal is being processed at different
scales and time points, and it is crucial to note that the larger the scale, the lower the
frequency of the transformed wavelet function will be, making it grasp a wider window
and heighten the frequency resolution, while decreasing the amount of the observable
time points along the signal (the opposite statement is true as well.) Thus, with CWT, we
obtain the representation of the signal at various frequency scales for specific time
intervals. An example of a heatmap resulting from WT is visualised in Fig 4.3.2.

Fig 4.3.2
13
4.4 Component-Activation Decomposition
Component-Activation Decomposition is a method of decomposing a given signal
spectrogram into two nonnegative matrices by utilizing nonnegative matrix factorization
(NMF.) As described in [16], NMF is used for approximating a spectrogram of a signal by a
product of two nonnegative matrices, where the first matrix represents the spectral
patterns present in the signal, and the second matrix represents the activations of the
spectral patterns in specific time points. [16] describes such decomposition as a
significant advantage in machine learning algorithms and spectrogram processing as the
derived factors better interpret the properties and structures present in the given
spectrogram.

K×N
Formally speaking, given a nonnegative spectrogram matrix V ∈ R , we aim to find
K×R R×N
nonnegative matrices W ∈ R and H ∈ R such that
V ≈ W · H
where W is referred to as the template or components matrix, and H is called the
activations matrix. For achieving such an approximation, optimization techniques are
required, which are described in [16] elaborately.

In the context of our problem, considering the fact that all data observations are speech
recordings performed by only one person in each recording, we leveraged the advantage
of component-activation decomposition by factoring the signal to a template matrix with
only one component, which represents the spectral patterns in the voice of the speaker.
This method also allows us to derive a representation of a signal without a time
dimension, which makes the usage of statistical learning tools easier.
4.5 Fundamental Frequency Features

As defined in [17], the fundamental frequency of a signal (also referred to as the pitch) is
the lowest frequency component (partial) of the given signal. It is stated in [17] that the
fundamental frequency holds significant semantic information related to the signal that is
beyond the human perception of the voice, which can be efficiently utilized in speech
processing algorithms.

For estimating the pitch of a given signal, [17] suggest three general approaches:
time-domain methods, frequency-domain methods and statistical frequency domain
methods. For our purposes, we have decided to apply the second approach, particularly in
14
the form of cepstrum analysis. This approach is encouraged by [17], the reason being that
frequency analysis can provide information about harmonically related partials in the
signal, which are closely related to the fundamental frequency.

In short, cepstrum analysis is accomplished by performing a Fourier transform on the log
of the frequency spectrum of the given signal (intuitively, the cepstrum domain is
extremely close to the time domain because of applying Fourier transform twice on the
signal.) The process of spectrum analysis is described in the plot depicted in Fig 4.5.1.

Fig 4.5.1

After taking the Fourier transform of the signal, we have to detect the regularly spaced
peaks of the obtained spectrum, which represent the harmonic spectrum of the signal.
15
After taking the log of the spectrum for bringing the amplitudes to a usable scale, the
resulting period is Fourier transformed for detecting the distance between the peaks,
which are related to the fundamental frequency of the original signal. Finally, the
maximum value of the obtained cepstrum represents the period of the original waveform.

In our analysis, for each recording, we executed the calculation described above and
stored the amplitude of the peak of the cepstrum of each recording.
5. Methodology
5.1 Statistical Analysis

Exploratory data analysis was used as one of the approaches to solve the problem.
Statistical models were used to make predictions on the mood of the speech and to obtain
understanding of what characteristics determine the mood of the voice. To obtain
summary and make comparisons between speeches with different moods, exploratory
data analysis was conducted and visualisations were made to gain insights about the
features of voice recordings. The used statistical methods for predictions and insights are
random forests, CN2 rule induction and SVM. Random forests and CN2 rule induction
were used to analyse the frequency spectrum of the voice recordings. Random forest was
used for classification and CN2 rule induction was performed to explain the
characteristics of mood in voice recordings.

Tree based try to split the dataset into smaller parts in such a way that smaller subsets
of the observations belong to the same class. They split the data into smaller parts by
an iterative algorithm which poses conditions on input features and effectively splits
the data into 2 parts. During each split, the decision tree construction algorithm
considers all features to find the condition that will best split the data into smaller
partitions. In order to find the best split, at each split the algorithm computes an
impurity measure and the split that results in the lowest impurity is chosen at each
iteration. In such a way, after each split, there is more certainty that the new subgroup
of samples belongs to the same category. The more splitting is performed, the more
complex the decision rules are and the more accurate the results can be. For the
impurity measure, we used gini index defined as:
m
G = 1 − ∑ (pi ) 2
i=1
16
where pi is the probability of samples in the subgroup belonging to the i-th class and
m is the number of classes. In order to obtain a random forest, many decision trees
are constructed but each tree only uses a subset of features for splitting the dataset
and classifying observations. At the end, the class predicted by the random forest is
the class that the majority of decision trees predict.

CN2 rule extraction tries to find rules of the form IF R1, R2, R3, … THEN CLASS. The
CN2 algorithm works in an iterative manner and at each iteration it searches for a set
of rules that covers many examples of a single class but a few from other classes. As a
result, the rules obtained define which characteristics result in the decision of the
classification. In order to find the best rule at each iteration, CN2 algorithm computes
a function that outputs the quality of the rule on the examples that the rules covers
[12]. For the purpose of finding more general rules that define each of 8 moods we
used Laplace Expected Accuracy defined as:
L = (nc + 1) / (na + m)
where nc is the number of samples covered by the rule from class predicted, na is the
total number of samples covered by the rule and m is the number of total classes. This
rule favors rules that cover more examples of target class when compared to entropy
[13].

Support vector machines (SVM) classification is a statistical learning algorithm suited
for binary class classification. The goal of SVM is to define a hyperplane in an N
dimensional space where N is the number of features of the observations that will
clearly separate the examples of one class form the other. That is, the algorithm seeks
to find parameters (w, b) for the equation of an N dimensional hyperplane given by
wT x + b so that the samples from one class are on one side of the plane while the
observations from the other class are on the other side. Moreover, SVM seeks to
maximize the distance of the hyperplane from the closest observations from both
classes, meaning that the gap between thy hyperplane and the observations is
maximal. The algorithm solves an optimization problem, where the objective is to
maximize the gap from the closest point of each class under the constraint that the
observations should lie on separate sides of the hyperplane. In case the observations
are not separable, the algorithm uses feature mapping to map observations to a higher
dimension and then performs the search of hyperplane in that dimension. Further
details of theory and implementation are more elaborately discussed in [19].
17
5.2 Prediction Approaches
5.2.1 Discrete Emotion Classification

One of the approaches to achieve emotion recognition was to classify speech
recordings from our datasets into 8 categories according to the mood of the speaker.
This is essentially the 8 class classification problem commonly used to solve speaker
emotion recognition.
5.2.2 Emotional Dimensions, Continuous Sentiment Arousal Space

The challenges that we faced while solving the problem with the discrete emotion
classification method made us come up with a novel approach to the problem. We
decided to reduce the problem of classifying eight discrete emotions to a more
abstract one, which can be explained by the visualization depicted below.

Fig 5.2.1

An interesting interpretation of the eight discrete emotions can be obtained when
defining the two dimensions presented above: a dimension of measuring
positivity/negativity in speech and a dimension of measuring the arousal or
activity/passivity in speech. Thus, we obtain quite an intuitive mapping from discrete
emotions to measures in arousal and appraisal dimensions [8].

In our experiments, we used the arousal and appraisal dimensions to reduce the
classification problem to two sub-problems of labeling the speech with two possible
classes each. In the first sub-problem, we consider two classes: neutral/passively
aroused speech and highly aroused speech (in our dataset, sad, calm, neutral, and
18
disgust are mapped to the first class, and other four labels - excited, happy, angry and
afraid are mapped to the second class.) Likewise, in the second sub-problem, we
classify a speech either negatively appraised or neutrally/positively appraised (in our
dataset, angry, fearful, sad, disgust are mapped to the first class, and other four labels -
excited, happy, surprised and neutral are mapped to the second class.)

Instead of solving a discrete 8 class classification problem, we also tried to map
speeches to a 2 dimensional continuous sentiment-arousal space describing both the
arousal and sentiment of the speech. The horizontal axis of the sentiment-arousal
space indicates the polarity of the sentiment of the speech. -1 means that the emotion
is very negative and +1 means that emotion is very positive. Similarly, the vertical axis
describes the degree of arousal of the speech. The axes of this dimension are the same
as in Fig 5.2.1. This way, the classification problem is reduced to a regression problem
where the goal is to find out the coordinates of the observation in this space. Existing
discrete emotions have been manually mapped to the sentiment-arousal space in
accordance with figure above:

Discrete Sentiment-Arousal Coordinates
Neutral [0, 0]
Calm [0.25, -1]
Sad [-0.75, -0.5]
Happy [1, 0.75]
Angry [-0.75, 1]
Fear [-1, 0.25]
Disgust [-0.25, 0.25]
Surprise [0.25, 1]

This mapping was performed in accordance with Fig 5.2.1 above.
5.3 Neural Network Architectures

Because of the predictive power and ability to work with unstructured data sources,
neural network architectures were considered for solving prediction problems.
19
Section 5.3.1 describes the usage of convolutional neural networks and section 5.3.2 is
devoted to long-short term memory units.
5.3.1 Convolutional Neural Networks

Two dimensional convolutional neural networks (CNN) were considered to solve
prediction problems due to it being one of the more suitable architectures to work
with sequences of data. In contrast to standard multilayer perceptrons (MLP), CNN
uses convolutions with kernels of parameters on small parts of data to obtain feature
maps instead of linear combinations of all features with parameters. Another
difference is that CNN uses shared weights for obtaining feature maps while the MLP
uses a different set of parameters for obtaining each feature map. That is, CNN
convolves the same kernel over the input data to obtain a feature map which allows it
to capture spatial patterns in the data. To obtain more feature maps, more kernels are
used instead of using different kernels to convolve over different parts of the input
data. The process can be generalized as follows:
Oi, j = ∑ ∑ H [m, n] · x[i − m, j − n]

m n
where O is the output 2D feature map, H is the kernel of parameters and x is the
input sequence. Figure below illustrates the process.

Fig 5.3.1

Due to those characteristics, CNN is suitable for learning features from sequences of
features as convolutional kernels can potentially capture information of change of
input variables throughout the sequence. Moreover, it can capture useful information
from the change of neighbouring features at the same point of sequence. Depending
on the type of prediction problem, different types of loss functions can be used with
this architecture.
20
5.3.2 Long Short Term Memory Networks
Prior to understanding the purpose and algorithm of Long Short Term Memory Networks
(LSTMs), let us elaborate on the networks they are “inherited” from - Recurrent Neural
Networks (RNNs.) Intuitively, RNNs mimic the way the human brain processes a temporal
data input by keeping a memory of previously encountered data values while processing
each input at a specific time point. That is, the output of an RNN in the current step is
passed as an input to the network on the next step.

Fig 5.3.2

Fig 5.3.2 illustrates the step of an RNN cell computation. X t is the current observation in
the time series data, and ht−1 is the output of the previous cell, and both vectors are
passed as an input to the current cell, concatenated, passed to a tanh layer, and then
outputted to the next cell. The process repeats recursively until reaching the last time
point observation.

Although the algorithm of RNNs does consider previously encountered data on each
iteration, which does provide good results in some cases, there are major shortcomings to
the architecture as well. The main disadvantage of RNNs is known as the
vanishing/exploding gradient problem, meaning that during back propagation, due to the
recursive nature of such networks, if the number of parameters to be learned is high
enough, the gradients will either drastically increase of decrease depending on the output
of the cells. This problem was brought up in [20], where the authors presented Long Short
Term Memory networks (LSTM) as a solution.

21

Fig 5.3.3

Fig 5.3.3 shows the architecture of an LSTM cell. The key difference between LSTM and
RNN architecture is that apart from the general output of the cell, an additional vector
called the cell state is passed to the next cell as an input.

Fig 5.3.4

Fig 5.3.4 highlights the portion of the cell architecture that corresponds to the cell state.
As it is shown, the cell state is barely modified in the iteration, which helps to solve the
vanishing/exploding gradient problem, and long term memory is preserved during
computations as a result.

Now, let us elaborate on the steps of computations performed in a single LSTM iteration.

22
1. Firstly, based on the current time point vector and the output of the previous cell, the
LSTM cell decides what information to preserve/discard from the cell state.

Fig 5.3.5

As shown in Fig 5.3.5, we compute f t = σ (W f · [ht−1 , xt ] + bf ) , where [ht−1 , xt ] is the
concatenation operation, ht−1 is the output of the previous cell, and xt is the data in
the current time point. f t is then used to decide what information to preserve from
the cell state.

Also note that throughout the computations of an LSTM cell iteration, the sigmoid
function is used every time when deciding the importance of an information, which
leads to the preservation or discretion of the information. The reason is that the
output of the sigmoid function is between 0 and 1, and, during the training phase of
the model, non-important information will be “forced” to have 0 value while important
information will get values closer to 1.
2. Secondly, based on the current time point vector and the output of the previous cell,
the LSTM cell decides what new information to add to the cell state.

Fig 5.3.6
23

As depicted in Fig 5.3.6 above, we compute it = σ (W i · [ht−1 , xt ] + bi ) and
~
C = tanh(W C [ht−1 , X t ] + bC ) (the input variables are the same as mentioned in the
first step.) All the above calculated values are used in the next step for updating the
value of the cell state
3. Thirdly, using the values calculated in the first two steps, we update the LSTM cell
state.

Fig 5.3.7

~
As shown in Fig 5.3.7, we compute the new cell state like so: C t = f t * C t−1 + it * C t ,
where * is the pointwise multiplication operation. Thus, the cell state is updated by
deciding what information to preserve ( f t * C t−1 ) and what new information to add (
it * C t ~ ) to it.
4. The final step in LSTM cell iteration is to compute the cell output, which is
accomplished by deciding which part of the updated cell state should be outputted
from the cell.

Fig 5.3.8
24

As depicted in Fig 5.3.8, we compute ot = σ (W o [ht−1 , xt ] + bo ) for deciding what
information to preserve and what to discard from the updated cell, and compute the
final output ht = ot * tanh(C t ) [21].

As far as our problem is concerned, we have accomplished several experiments by
utilizing LSTM networks because, intuitively, they are a good fit to our purposes.
Considering that the voice recordings are represented as time series values of air
pressure (amplitude), on which we apply several signal processing algorithms such as
STFT, Wavelet Transform and MFCC computation, we extract features from voice that are
nothing more than a time series of some energy coefficient values, and LSTM networks
can learn emotion-relevant patterns by memorizing what energy coefficients they
encountered before (what they “heard”) and taking the coefficients of the next time point
as an input.
6. Experiments
6.1 Exploratory Data Analysis
6.1.1 Analysing Frequency Spectrum

To analyse overall distribution of frequency intensities of voice recordings, we
grouped those intensities into small bins of range 256Hz. This means that intensities
ranging from 0-256Hz were summed up for the 1st bin, 256-512Hz for the 2nd bin, etc.
We did not consider frequencies above 8192Hz as human speech intelligibility is mostly
covered by frequencies of up to 8 Khz [14]. We calculated means of intensities in each
bin grouped by the emotion. Min-max normalization was also performed to show
relative frequency intensities. The visualisation of the distribution of mean intensities
per emotion reveals that frequency spectrum is different amongst speeches of
different emotions. Voices with neutral, calm and sad emotion have more frequency
intensity concentrated in the lower part of the spectrum. On the contrary, fear, anger,
happiness and surprise lead to higher intensities of frequencies. Moreover, intensity of
high range frequencies is also present in those emotions. The distributions are
visualised below in Fig 6.1.1:

25

26

Fig 6.1.1

To perform classification using random forests and CN2 rule induction, we grouped
frequencies in bins of range 64Hz and each bin was used as a feature. Thus, we used
128 features with 8108-8192Hz being the bin with the highest frequency range. We
separated all voice recordings of 1st and 2nd actors from the RAVDESS dataset, one
male and one female to obtain a test set of 120 samples. The rest were used for
27
training. The results of classification using tree based models and CN2 rule induction
were very poor, however by analysing some results we can still get insights.

By looking at some of the top rules extracted by the CN2 induction algorithm we can
try to get a description of the frequency spectrum of speeches with different
emotions.

IF 6528Hz - 6592Hz ≥ 3.9886746406555176 AND 8064Hz - 8128Hz ≥ 3.8793866634368896 AND
128Hz - 192Hz ≥ 0.0574270561337471 → Angry

IF 7744Hz - 7808Hz ≤ 0.04547011852264404 AND 448Hz - 512Hz ≥ 1.8838613033294678 AND
64Hz - 128Hz ≥ 0.13342395424842834 → Sad

IF 1600Hz - 1664Hz ≤ 0.1066826581954956 AND 320Hz - 384Hz ≥ 2.2769432067871094
→ Neutral

IF 1088Hz - 1152Hz ≥ 33.377315521240234 AND 3840Hz - 3904Hz ≥ 2.9604032039642334 →
Happy

IF 768Hz - 832Hz ≥ 7.9639387130737305 AND 192Hz - 256Hz ≤ 0.10243330895900726 AND
5376Hz - 5440Hz ≥ 0.3536578118801117 AND 704Hz - 768Hz ≥ 5.466808319091797
→ Surprised

CN2 induction algorithm distinguishes between active and passive emotions. Hence, it
results in assigning one of the more passive emotion labels when intensity of high
range frequencies is low. And the opposite is also true for active emotions, when
energy in high frequency ranges is high, the classification results in one of the more
active emotions. The results are in accordance with the Fig 6.1.1 describing differences
of intensities of frequencies in 8 mood categories.

By taking a look at the confusion matrix of the random forest classifier on the test
dataset shown in Fig 6.1.2 below, we can notice that the classifier also confuses active
emotions and passive emotions. For example, we can see that it confuses anger with
disgust and surprise, happiness is confused with disgust and surprise. Same trend
exists within passive emotions.

28

Fig 6.1.2

The results mean that frequency domain is not descriptive enough to be able to
classify voice recordings based on their emotion. It is because by transforming the
whole time domain signal into frequency domain, we are losing information on how
frequency spectrum changes during the speech. That is, we obtain frequency
spectrum representation of the whole signal but have no option to observe how it
changes during the speech. However, frequency domain is descriptive enough to
obtain the intensity of the voice recordings in terms of whether the speaker exerts
active or passive emotion. Moreover, we obtained some characteristics of frequency
domain relative to the emotion in the speech.
6.1.2 Analysing Cepstral Peaks

For exploratory analysis, fundamental frequency related features were also considered.
Deriving from section 4.5, we applied cepstral analysis on audio data from RAVDESS
dataset and obtained the peak amplitudes of each cepstral signal. The cepstral peaks
were grouped by the mood of the recordings, and the picture visualised in Fig 6.1.3
illustrates the result:
29

Fig 6.1.3

Although the intersections of cepstral peak values between different mood recordings
is high, which makes training statistical models on the obtained data less feasible, an
interesting pattern is noticed when considering the arousal level of the recordings. If
we consider neutrally/passively aroused recordings (neutral, calm, sad and disgust),
we can notice that their cepstral peak values are higher than those of highly aroused
recordings (happy, angry, fateful, surprised.)

Fig 6.1.4
30

In the Fig 6.1.4 above , recordings from TESS dataset were added to the analysis as
well. It can be noticed that the pattern of passively aroused recordings having on
average lower cepstral peaks is still present.
6.1.3 MFCC Mean and Variance

The final experiment was using statistical models for classifying arousal level of a
speech by using the mean and variance of each MFCC feature. This helps us to get rid
of the time domain in statistical models. For each recording, we form a vector of
means and variances of each MFCC coefficient.

We used SVM as our statistical model and RAVDESS as the dataset. The dataset was
split in a 0.1 test and 0.9 train proportionality. The results of classifying arousal level of
recordings were not very impressive: 78% train accuracy and 79% test accuracy.
6.1.4 Component-Activation Analysis

As mentioned in section 4.4, decomposing a spectrum time series matrix to
components and activation matrices is a good way for acquiring spectral patterns in
the signal. In our experiments, we decomposed MFCC feature matrices to component
and activation matrices, where the number of components was set to 1 to acquire the
spectral patterns of the speaker (as there was no other sound source in the
recordings.) Such decomposition also helps us to get rid of the time domain by using
component features, which make the classification with statistical models easier.

In this experiment, we also applied the technique of speaker normalization, as
described in [21], which aims to compensate for the variations in speech caused by
speaker diversity.

The classification approach in this experiment was the reduced binary classification
approach as described in section 5.2.2. SVM was used as a statistical model for solving
two classification problems: classifying the arousal level of the speech, and classifying
the positivity/negativity of the speech. In one case, we also separated the tested the
model on a dataset where the actors were never encountered in the training phase.
This way we wanted to test how well the model captures the patterns of novel voices.

The results of this experiment are summarized in the table below.

31
Speaker Classification Train accuracy Test accuracy New actor
normalized type (%) (%) accuracy (%)
NO Arousal 98 70 -
YES Arousal 87 93 67
YES Positivity 78 61 -

It can be noticed that classifying positivity is a bigger challenge for the model than
classifying the arousal level of the speech. This can be explained by the fact that
MFCCs (and lots of other signal features) represent the energy amount in the signal in
specific frequency or cepstral ranges, and, intuitively, larger amounts of energies
correspond to higher arousal level. However, both negative and positive emotions can
correspond to a high arousal level (i.e. surprised and angry), but it is harder to tell how
energy features can show the positivity of a given speech. This problem is also present
in the experiments described later in the paper.
6.2 Classification Using Discrete 8 Emotions
6.2.1 CNN model on MFCCs

CNN classifier was trained on MFCC sequence. Different window and overlap sizes
were chosen for the calculation process described in Section 4.2. Eventually, window
size of 4096 and overlap of ¾ between subsequent windows were chosen to be
optimal. Decreasing the windows size by half degrades the performance of the
network. On average, these settings produced better results. 4096 for window size is
good because it allows to compute FFT of length 4096 on that window to capture
frequency spectrum of up to 4Khz. This means that the majority of human speech in
those recordings is captured in each window. After calculating MFCCs for every
recording and padding sequences with less length than the longest sequence, we
obtain input matrices to our network with size (40 x 160) where at each sequence point
we have 40 MFCCs.

The architecture of the CNN used is summarised as follows:
32
Fig 6.2.1

There are 3 convolutional layers in the network followed by average pooling layers of
size (2x2). The last layer is a fully connected layer that maps output of convolutional
layers to an 8 length vector. Log softmax activation is applied to use cross entropy loss.
Each layer has 32 kernels of parameters. The first layer has kernels of size (10x3) and it
is deliberately chosen to be narrow and heighty to capture features from change of
MFCCs through the sequence. Between layers, leaky relu activation function given as
h(x) = max(x, 0) + 0.01 * min(0, x) is used both to enable fast training and to prevent
neurons from dying. Leaky ReLU adds a small slope to non activated neurons thus
preventing them from becoming 0 and not contributing to backpropagation on later
epochs [18]. Since our dataset is very small, we used dropout with high probability (p =
0.5) as well as L2 regularization to prevent overfitting.
All recordings of the 1st and 2nd actor, one male and one female from RAVDESS
database were used for testing and the neural network did not see those samples
during training. All remaining recordings were used for training. We used Adam
optimizer with a learning rate of 0.00005 and L2 regularization with decay 10−4 . The
final loss function becomes:
︿
︿ exp(y i )
(1) l(y i ) = log( )
︿
∑ exp(y j )
j
︿︿
(2) L(yi ) = − ∑ y i l(yi ) + λ ∑ w2
i w∈W
where W is the set of all network parameters
33
The network was trained multiple times with these settings and on average, it achieved
67.5% accuracy on the test dataset and 96% accuracy on the training dataset. Average
Roc AUC for all classes was 0.927. ROC curves for all classes are demonstrated:

Fig 6.2.2

As we can see in Fig 6.2.2, the network captures some emotions easier than others. For
instance, Neutral, Calm, Angry and Surprise were captured better than the rest.
ROC-AUC metric also suggests that the model has significance and does not randomly
guess labels. Overall, these results can be considered fair as there is a lack of
34
generalization. The test dataset used contained voices from people the network did
not observe during training but it did observe the phrase spoken in the dataset. In
essence, if we try to use this network on different voices with random spoken phrases,
the results will be worse.

6.2.2 LSTM model on MFCCs

LSTM neural network was trained using sequences of MFCCs to achieve 8 class
emotion classification. Window length and overlap size between subsequent windows
for MFCC calculation described in Section 4.2 were chosen by the presets that come
with librosa. Window length was set to 2048 and overlap was ¾ between windows.
After calculation and padding of short sequences to match the longest sequence we
obtained MFCC sequences of size (40 x 228). The architecture of LSTM used is
summarised below:
Fig 6.2.3

MFCC sequences are fed into a LSTM recurrent layer with a hidden dimension of size
1024. There are 2 LSTM layers stacked on top of each other meaning that the outputs
of the first layer are processed by the second one. This increases the perceptiveness of
the network towards the features present in the sequence. Due to the dataset being
small, we used dropout with high probability (p = 0.5) on the outputs of the first LSTM
unit. The output

The network was trained using only RAVDESS dataset. The recordings of the 1st and
2nd actors (one male and one female) were used as testing set, the rest of recordings
were used for training of the network. Adam optimizer with learning rate of 0.0005
was used and the loss function to minimize was cross entropy loss given by:
35
︿
︿ exp(yi )
(1) l(yi ) = log( )
︿
∑ exp(yj )
j
︿︿
(2) L(yi ) = − ∑ y i l(yi )
i
The network achieved 65% test accuracy and 93.58% train accuracy. Clearly, the
model overfitted the provided data and gave poor performance on the voices of actors
it has never encountered during training. The ROC curves of the model are depicted
below.

Fig 6.2.4
36

By observing the ROC curves in Fig 6.2.4, we can notice that the model faced the same
difficulties as described in section 6.2.1, although it classifies some of the moods better.
Apart from the possible reasons mentioned before, for exploring the reasons behind
inefficient performance of the network in classifying 8 discrete emotions, we decided
to reduce the problem to classifying using the sentiment-arousal space (section 5.2.2.),
which is explained in the upcoming sections.
6.3 Classification Using Sentiment-Arousal Space

In this part of our experiments, by taking into account the sentiment-arousal space
depicted in Fig 5.2.1, we define more abstract classes of speeches based on their
emotion in three ways. First way of defining an abstraction is by defining 4 classes of
speech, which are the quadrants in the sentiment-arousal space. The other two ways
of defining abstractions are based either on the arousal level or the positivity of
speech, and both lead to a binary classification approach as described in 5.2.2.
6.3.1 LSTM Models on MFCCs for Sentiment-Arousal Space Zones

First, we conducted the experiment with LSTMs and the extracted MFCC features.
Only the first 40 cepstral coefficients were considered. The datasets used were
RAVDESS and TESS datasets (in some scenarios, only RAVDESS was considered.) There
were 4 scenarios of splitting the dataset into train and test subsets: 0.1 testing and 0.9
training proportion (standard), all the recordings of first 2 actors as test dataset and
the rest as train, all the recordings of first 3 actors as test dataset and the rest as train,
and all the recordings of the first 4 actor as test dataset and the rest as train.
The same architecture of LSTM model was used for all scenarios, which is visualised in
the figure below. A dropout layer with probability p=0.3 was used.
37
Fig 6.3.1

The results of the experiment are summarized in the table below.

Zones Data Datasets used Train acc. Test acc. AUC
separation (%) (%)
4 zones standard RAVDESS 96.30 67.36 -
4 zones 2 new actors RAVDESS 96.13 74.16 -
Arousal zones standard RAVDESS 97.76 87.14 0.91
Arousal zones 2 new actors RAVDESS 99.54 90.83 0.94
Arousal zones 3 new actors RAVDESS+TESS 94.23 86.11 0.91
Positivity zones standard RAVDESS 98.30 80.00 0.81
Positivity zones 2 new actors RAVDESS+TESS 96.89 84.14 0.91
Positivity zones 3 new actors RAVDESS+TESS 93.69 79.44 0.84

The ROC curves for each binary classification scenario are depicted below.

38

Fig 6.3.2

Firstly, the same LSTM architecture gave the best performance for all classification
scenarios, which indicates that the architecture is a good fit considering the datasets
available.

39
Secondly, the results indicate that the harder challenge for the model was to classify the
positivity, but its performance was good in classifying the arousal level of the speech. This
problem was also encountered in section 6.1.4, where the intuition behind the problem is
described as well.
6.3.2 CNN Model on Wavelet Transform for Sentiment-Arousal Space

Zones
We experimented on solving the problems of arousal level classification and positivity
classification with heatmap images resulting from wavelet transforms and CNNs. Only
RAVDESS dataset was considered in this experiment, and the splitted in a 0.1 test and
0.9 train scenario in both classification problems. The CNN architectures used are
visualised below. Fig 6.3.3 illustrates the architecture of CNN uses to classify the
sentiment in the speech and Fig 6.3.4 shows the architecture for arousal classification.

Fig 6.3.3

Fig 6.3.4

40
The difference between 2 architectures is that the network for classifying sentiment
has one more convolutional layer. Dropout with p=0.4 was used between each
convolutional layer to prevent overfitting. Leaky relu was used as an activation
function between layers and its benefit is described in section 6.2.1.
The results of the experiment are summarized in the figures below.

Zones Train accuracy (%) Test accuracy (%) AUC
Arousal zones 98.70 83.76 0.84
Positivity zones 87.76 75.71 0.77

Fig 6.3.5

Again, the models encounter the difficulty of classifying the positivity of the speeches,
as described in the prior sections.
6.4 Mapping Emotion to Continuous Sentiment Arousal Space

CNN was utilised to map emotion in the speech recordings to the continuous
sentiment-arousal space defined in the Section 5.2.2. The training process and the
architecture of the network are almost the same as described in Section 6.2.1. The
difference is the last layer which now maps output of convolutional layers to a 2
dimensional vector instead of 8. Architecture is depicted in Fig 6.4.1 below.

41

Fig 6.4.1

This effectively allows us to turn the classification problem into a vector-regression
problem. So, the loss function is also changed to be mean squared error loss instead of
cross entropy loss.
︿︿︿
L(y) = 12 (y − y ) T (y − y ) + λ ∑ w2
w∈W
The result after multiple training was around 0.21 MSE loss. One interpretation of this
loss is that the coordinates that the network outputs are ~0.6 euclidean distance away
from their correct destinations in the continuous sentiment-arousal space. But this
might give the wrong sense of interpretation because the network might correctly
predict that a speech is both aroused and negative but fail to find out the exact degree
of negativity and arousal. To this end, we constructed some visualizations of the
results in sentiment-arousal space to gain better understanding of the results.
42

Fig 6.4.2

Fig 6.4.2 shows that overall the performance was good. In the majority of cases the
network correctly identifies both the sentiment and arousal of speech. It rarely fails to
identify both components and it can at least identify the arousal of a speech. One of
the shortcomings we see is that there was a significant amount of happy labeled
speeches that were identified as negative by the network. On the contrary, fear,
disgust, anger and sadness were almost perfectly positioned in the new plane. This,
again shows us that the network is struggling to determine the sentiment but is good
at differentiating between active and passive emotions. Solving this regression
problem is more flexible than solving the discrete emotion classification problem.
From the perspective of optimization, it is easier to find optimal solutions since the
network has to identify only 2 components from the speech. Moreover, this
optimization is more flexible in terms of solving the problem of categorizing speech as
one can define a different continuous space and thus, achieve a solution to a
completely different problem.
43
7. Conclusion and Future Work
7.1 Retrospective and Future Work

Based on the experiments explored above, it can be stated that among all the signal
processing techniques and extracted features that were used, MFCCs are the most
efficient in solving speech emotion prediction problems. Even with the limited dataset,
deep learning models were the most efficient in the conducted experiments.

One of the main difficulties that we faced was the limitations of the datasets. As far as
our problem is concerned, a diverse dataset can be thought of as having two features:
a diversity of phrases and a diversity of actors/voice sources. Thus, a considerably
larger and diverse dataset is required for the trained models to become more
generalizable and to avoid overfitting. Some part of the lack of such diversity can be
solved by using voice data augmentation techniques. However, those augmentation
techniques should be applied very carefully in order to avoid possible changes of
mood-related features present in the dataset (for example, the pitch of the voice),
which can turn into a whole new research and evaluation project.

As was shown in the experiments, considering the sentiment-arousal space and the
reduced form of the problem, the other main difficulty for the models was to learn the
positivity level in a speech, which, therefore, can also be considered a difficulty in the
discrete 8-class classification problem. On the other hand, the arousal prediction trials
were rather efficient.

Additionally, fundamental frequency analysis can further be explored for feature
extraction. It is possible to derive spectral representations of a voice signal based on
the fundamental frequency or the pitch of the signal on specific time intervals, since
pitch is a good indicator of frequential features and patterns present in a signal.
7.2 Systems and Applications

Emotion recognition from voice recordings can also have a wide range of applications
in everyday life. For example, extracting emotion from the voice of users can be a base
for building recommendation systems of movies or music. Or, a workspace system can
be built for employers and managers to track the mood changes of employees through
time. Another area that voice emotion recognition can be used are security systems,
44
such as detecting a possibly harmful emotional state of a car driver and blocking an
action. Finally, voice emotion recognition models can be integrated in customer
service centers for analysing the moods of the clients and using it as a tool for service
efficiency analysis.
45
References
[1] Steven R. Livingstone and Frank A. Russo, “RAVDESS Emotional speech audio.” Kaggle,
doi: 10.34740/KAGGLE/DSV/256618.
[2] SAVEE Database. Retrieved from http://kahlan.eps.surrey.ac.uk/savee/
[3] TESS Database. Retrieved from
https://tspace.library.utoronto.ca/handle/1807/24487
[4] Sahidullah, M., & Saha, G. (2012). Design, analysis and experimental evaluation of block
based transformation in MFCC computation for speaker recognition. Speech
Communication, 54( 4), 543–565. doi: 10.1016/j.specom.2011.11.004
[5] Mower, E., Mataric, M. J., & Narayanan, S. (2011). A Framework for Automatic Human
Emotion Classification Using Emotion Profiles. IEEE Transactions on Audio, Speech, and
Language Processing, 19( 5), 1057-1070. doi:10.1109/tasl.2010.2076804
[6] Glüge, S., Böck, R., & Ott, T. (2017). Emotion Recognition from Speech using
Representation Learning in Extreme Learning Machines. Proceedings of the 9th
International Joint Conference on Computational Intelligence.
doi:10.5220/0006485401790185
[7] C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang, S. Lee, and S.
Narayanan, "IEMOCAP: Interactive emotional dyadic motion capture database," Journal
of Language Resources and Evaluation, vol. 42, no. 4, pp. 335-359, December 2008.
[8] Wieczorkowska, Alicja., Synak, P., Lxewis, R., Ras, Z. W., Extracting Emotions from
Music Data, Proceedings of 15th International Symposium, ISMIS 2005, Saratoga
Springs, NY, USA, May 25-28, 2005., p456-465
[9] Librosa sound processing library. https://librosa.github.io/librosa/
[10] Frequency Domain and Fourier Transforms. Retrieved from
https://www.princeton.edu/~cuff/ele201/kulkarni_text/frequency.pdf
[11]A. Kulkarni, M. F. Qureshi, M. Jha Discrete Fourier Transform: Approach to Signal
Processing.
[12] P. Clark, T. Niblett (1988) The CN2 induction algorithm, The Turing Institute
[13] R. Boswell (1998) Rule Induction with CN2: Some Recent Improvements
[14] FACTS ABOUT SPEECH INTELLIGIBILITY - DPA Microphones. (n.d.). Retrieved
from
https://www.dpamicrophones.com/mic-university/facts-about-speech-intelligibi
lity
[15] The continuous wavelet transform. (n.d.). The Illustrated Wavelet Transform
Handbook. doi:10.1887/0750306920/b833c2
[16] MULLER, M. (2016). FUNDAMENTALS OF MUSIC PROCESSING: Audio, analysis,
algorithms, applications. Place of publication not identified: SPRINGER, pages
415-568
46
View publication stats
[17] Gerhard, D. (2003). Pitch Extraction and Fundamental Frequency: History and
Current Techniques. Department of Computer Science University of Regina, 9-12.
[18] B. Xu, W. Naiyan, C. Tianqi, L. Mu (2017). Empirical Evaluation of Rectified
Activations in Convolutional Networks
[19] Awad, Mariette & Khanna, Rahul. (2015). Support Vector Machines for Classification.
10.1007/978-1-4302-5990-9_3.
[20] Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural
Computation, 9( 8), 1735-1780. doi:10.1162/neco.1997.9.8.1735
[21] Wu, S. (2009). Recognition of Human Emotion in Speech Using Modulation
Spectral Features and Support Vector Machines. Queen’s University, Department of
Electrical and Computer Engineering, 70-74.
47

Emotion Analysis and Prediction From Voice Recordings

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Emotion Analysis and Prediction From Voice Recordings

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Emotion Analysis and Prediction from Voice Recordings

Article · June 2020

Emotion Analysis and Prediction from Voice Recordings View project

The user has requested enhancement of the downloaded file.

Emotion Analysis and Prediction from

7. Conclusion and Future Work 44

Database Num of Recordings Num of Actors Emotion Labels

RAVDESS 1440 24 8

Neutral Calm Sad Fear Anger Surpries Happiness Disgust

616 192 652 652 652 652 652 652

4.2 Mel Frequency Cepstral Coefficients (MFCCs)

4.3 Continuous Wavelet Transform (CWT)

4.5 Fundamental Frequency Features

5.1 Statistical Analysis

5.2.1 Discrete Emotion Classification

5.2.2 Emotional Dimensions, Continuous Sentiment Arousal Space

Neutral [0, 0]

Calm [0.25, -1]

Sad [-0.75, -0.5]

Happy [1, 0.75]

Angry [-0.75, 1]

Fear [-1, 0.25]

Disgust [-0.25, 0.25]

Surprise [0.25, 1]

5.3 Neural Network Architectures

5.3.1 Convolutional Neural Networks

Oi, j = ∑ ∑ H [m, n] · x[i − m, j − n]

6.1 Exploratory Data Analysis

6.1.1 Analysing Frequency Spectrum

6.1.2 Analysing Cepstral Peaks

6.1.3 MFCC Mean and Variance

6.1.4 Component-Activation Analysis

NO Arousal 98 70 -

YES Arousal 87 93 67

YES Positivity 78 61 -

6.2 Classification Using Discrete 8 Emotions

6.2.1 CNN model on MFCCs

6.2.2 LSTM model on MFCCs

6.3 Classification Using Sentiment-Arousal Space

6.3.1 LSTM Models on MFCCs for Sentiment-Arousal Space Zones

4 zones standard RAVDESS 96.30 67.36 -

4 zones 2 new actors RAVDESS 96.13 74.16 -

Arousal zones standard RAVDESS 97.76 87.14 0.91

Arousal zones 2 new actors RAVDESS 99.54 90.83 0.94

Arousal zones 3 new actors RAVDESS+TESS 94.23 86.11 0.91

Arousal zones 4 new actors RAVDESS+TESS 98.40 83.33 0.87

Arousal zones 2 new actors RAVDESS+TESS 95.63 93.30 0.97

Positivity zones standard RAVDESS 98.30 80.00 0.81

Positivity zones 2 new actors RAVDESS+TESS 96.89 84.14 0.91

Positivity zones 3 new actors RAVDESS+TESS 93.69 79.44 0.84

6.3.2 CNN Model on Wavelet Transform for Sentiment-Arousal Space

Arousal zones 98.70 83.76 0.84

Positivity zones 87.76 75.71 0.77

6.4 Mapping Emotion to Continuous Sentiment Arousal Space

7.1 Retrospective and Future Work

7.2 Systems and Applications

You might also like