Professional Documents
Culture Documents
net/publication/342453009
CITATIONS READS
0 227
1 author:
Narek Tumanyan
American University of Armenia
1 PUBLICATION 0 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Narek Tumanyan on 25 June 2020.
BS Capstone Project
American University of Armenia
Akian College of Science and Engineering
Yerevan, Armenia
2020
1
Abstract
This project intends to solve the problem of mood recognition of a speaker based on
their voice. Statistical and exploratory analysis were performed on speech data, as well
as statistical learning and deep learning models and architectures were used along
with data features obtained from diverse signal processing techniques. Various
classification and regression approaches were experimented. The obtained results and
identified difficulties were analysed.
2
1. Introduction 4
2. Related Work 5
3. Datasets 6
4. Feature Extraction 7
4.1 Frequency Decomposition 8
4.2 Mel Frequency Cepstral Coefficients (MFCCs) 8
4.3 Continuous Wavelet Transform (CWT) 11
4.4 Component-Activation Decomposition 13
4.5 Fundamental Frequency Features 14
5. Methodology 16
5.1 Statistical Analysis 16
5.2 Prediction Approaches 17
5.2.1 Discrete Emotion Classification 17
5.2.2 Emotional Dimensions, Continuous Sentiment Arousal Space 17
5.3 Neural Network Architectures 19
5.3.1 Convolutional Neural Networks 19
5.3.2 Long Short Term Memory Networks 20
6. Experiments 24
6.1 Exploratory Data Analysis 24
6.1.1 Analysing Frequency Spectrum 24
6.1.2 Analysing Cepstral Peaks 28
6.1.3 MFCC Mean and Variance 30
6.1.4 Component-Activation Analysis 30
6.2 Classification Using Discrete 8 Emotions 32
6.2.1 CNN model on MFCCs 32
6.2.2 LSTM model on MFCCs 35
6.3 Classification Using Sentiment-Arousal Space 37
6.3.1 LSTM Models on MFCCs for Sentiment-Arousal Space Zones 37
6.3.2 CNN Model on Wavelet Transform for Sentiment-Arousal Space Zones 40
6.4 Mapping Emotion to Continuous Sentiment Arousal Space 42
3
1. Introduction
Existing approaches to determining speaker emotion make use of spectral features of the
voice signal. In particular, mel frequency cepstral coefficients are found to be popular
amongst approaches towards solving this problem. The popularity of this feature stems
from speech recognition systems which use this feature with success. Moreover, due to
their well-known predictive power, machine learning algorithms are widely used to solve
the emotion recognition problem. However, there are several hindrances towards
providing generalizable solutions. There is a lack of labeled speech datasets for speaker
emotion recognition. This is a problem as there is a large variability of human speech
which stems from different languages, pronunciation and difference in human voice.
Hence, it is hard for ML algorithms to generally classify emotions in speech based on
small datasets. Also, there is a question of how good mel frequency cepstral coefficients
describe emotions in human speech. To overcome these issues, some approaches also use
other features in combination with spectral features of the speech. For instance, facial
features are used as descriptive features of human emotion.
This paper tries to tackle speaker emotion recognition problems from several viewpoints.
The experiments are as follows; we tried using statistical analysis methods on spectral
features of the human voice to classify emotion in speech and to identify which
characteristics determine the emotion in the speech. We also experimented on methods
such as continuous wavelet transform, component-activation decomposition and
fundamental frequencies in order to compare performance against widely-used mel
frequency cepstral coefficients. We used several neural network architectures to solve
prediction problems due to their high predictive capacity. And lastly we explored some
potential methods of reduction of complexity of the speaker emotion recognition
problem. Reduced problems yield better results and can be used instead of the original
problem depending on the application. Namely, we experimented on reducing discrete
emotion classification problem to classifying activity, binary sentiment and the
combination of both of emotions in the speech. The work on feature extraction, methods
and results are further delineated in section 4, 5 and 6.
In the end, there is also a retrospective on the experiments and explanation of some
potential applications for an emotion classification algorithm.
4
2. Related Work
There have been previous approaches to classifying emotion of the speech using machine
learning models. Mostly, the approaches to the classification problem were similar to each
other. The steps can be generalized in the following way:
1. Extraction of waveforms of voices from databases. Each voice recording in the
database is being sampled with a fixed sampling rate. This step results in a floating
point time series which describes the amplitude of the recording through the time
axis.
2. Conversion of time domain representation of the recording to the frequency domain.
This step is accomplished using a fast fourier transform algorithm. This results in a
sequence which describes the presence of different frequency ranges in original time
series.
3. Extraction of Mel Frequency Cepstral Coefficients (MFCCs) from frequency
representation of each voice recording. MFCCs are the most commonly used features
for classification problems related to speech.
4. Training a classifier with extracted features to classify the mood of the recording. Most
of the approaches tried to classify discrete emotions or to classify the relative intensity
of the speech which also implies emotion.
Results varied depending on the chosen architectures and databases. What is observed
though commonly is the limitations. Mostly, limitations are related to databases, there is a
lack of number of recordings and number of actors who perform those recordings. These
commonly lead to classifiers learning speaker specific information during classification
which prevents generalization to new speakers or new phrases.
There is a paper that uses SVM classification algorithm to classify natural human emotion
expressions into 5 categories: angry, happy, neutral, sad, or excited. The dataset used is
the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database [7]. It is
distinguished because it also considers facial expression of the speaker during speech to
classify it into one of 5 categories [5]. Another paper examines the Deep Neural Network
Extreme Learning method which is efficient for small datasets. It also utilizes IEMOCAP
database [6]. These approaches preprocessed and removed parts of speech that
corresponded to silence or noise.
5
3. Datasets
This paper focuses on classifying 8 emotions from voice recordings. The databases used in
the paper are the following: Ryerson Audio-Visual Database of Emotional Speech and
Song (RAVDESS) [1], Surrey Audio-Visual Expressed Emotion (SAVEE) [2] and Toronto
Emotional Speech Set (TESS) [3] Generally, our data are voice recordings from different
actors who pronounce some statements and exert certain emotion for each recording.
Voice recordings in the databases come in a .wav f ormat which describes the amplitude of
air pressure oscillations in the time domain. Each voice recording has an emotion label
attached to it. The RAVDESS database has 24 actors that pronounce 2 phrases: “Kids are
talking by the door” and “Dogs are sitting by the door” with 2 intensities: Normal and High
each repeated twice. Neutral emotion has no high intensity so it is only repeated twice.
The emotion labels are: (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 =
fearful, 07 = disgust, 08 = surprised). TESS dataset has 2 actors, young and old and both of
them are female. There are 2800 voices in total with each phrase being of the form “Say
the word x” where x stands for some word. Recordings in the TESS dataset have the same
labeled emotions as in RAVDESS except the calm label which is not present in this dataset.
SAVEE dataset has 4 English male actors with 480 voice recordings. 7 emotions are
present with the calm emotion missing. In total, there are 4720 samples. The distribution
of samples and classes are summarised below:
SAVEE 480 4 7
TESS 2800 2 7
6
4. Feature Extraction
To extract audio features from voice recordings we used librosa library for python [9]. It
handles most of the transformations done to voice recordings to get final features used
for classification. The first step before extracting features is to resample voice recording
files to obtain their time domain and amplitude representation. Voice recordings from our
databases have different original sampling rates which range from 22Khz to 48Khz.
However, the content that we are trying to analyze from those recordings are the human
voices themselves. Normally, human voice ranges from low range frequencies 300Hz to
higher ranges 4 - 10Khz. This means that we can use lower sampling rates to resample our
voice recording. We chose 22.05Khz sampling rate which preserves all human voices in
original audio recordings and also preserves some possible frequency deviations from
normal range which can be caused by pronouncing high frequency tones such as
fricatives. The result is a floating point time series describing the amplitude of air
pressure oscillations from mean frequency of 0 at each time point. Thus, we obtain a
time-domain representation of the signal. An example is illustrated below in Fig 4.1.1
Fig 4.1.1
After obtaining the time domain representation of each audio sample, several features
used further can be obtained through certain transformations.
7
4.1 Frequency Decomposition
A signal in time domain can be represented using a combination of sinusoids of different
frequencies of oscillations. A sinusoid of frequency f hertz can be described as follows
[10].
x(t) = asin(2πf t + ϕ)
So, frequency decomposition of the signal in the time domain can be obtained which will
conceptually identify how much of sinusoids with certain frequencies were combined to
produce the signal. Hence, the intensities of frequencies present in the original signal can
be computed . A tool used to achieve this is the Discrete Fourier Transform (DFT) which
will convert time domain signal of length K into another signal in frequency domain [11].
N
X (k) = ∑ x (n)e−i2πkn/N , 1 ≤ k ≤ K
n=1
After this transformation we obtain sequence X (k) describing the intensity of frequency
k hertz in the original signal.
In signal processing, frequency decomposition is often performed by dividing the signals
into time intervals of specified window size and performing DFT on each windowed
signal, thus coming up with frequency components in multiple time intervals. Such
representation of a signal is called the Short-Time Fourier Transform (STFT) of a signal.
8
FFT is applied to each small window of the original signal, and the result is another
sequence which describes frequencies through time in the original signal. This process is
called a short time fourier transform (STFT). An illustration is given in Fig 4.2.1
Fig 4.2.1
N
(1) S j (k) = ∑ sj (n)h(n)e−i2πkn/N , 1 ≤ k ≤ K
n=1
Where S j (k) is the intensity of frequency k of time j-th time window, and sj (n) is the
intensity of j th window of the original signal at time point k and h(n) is a N point long
Hamming window. Then we take the absolute value of the complex fourier transform and
square it to obtain power spectrum estimate.
(2) P i (k) = 1| |2
N |S i (k)|
The next step is to calculate MFCCs from the frequency decomposition of the singal.
MFCCs in their turn are coefficients which describe the overall frequency spectrum of the
frequency decomposition. MFCCs are beneficial because their number of coefficients is
far less than the number of coefficients that come out of STFT and yet are able to
accurately represent the sound wave. After STFT for each window of the original signal
we have around 2048-8192 coefficients each describing how much amplitude of the
frequency of that coefficient was present in the window. After computing MFCCs we are
left with far less features. Computing MFCCs involves passing FFT results through Mel
filters and computing DCT of the filter outputs to obtain coefficients which describe the
frequency spectrum itself. Mel filters are necessary to adjust computed frequencies to
9
human hearing [4]. The more DCT coefficients we use to more features that describe the
frequency are employed.
Now, let us elaborate deeper on the steps of computing MFCCs.
1. We have power spectrum estimates of each windowed signal P i (k) , where i is the
index of the window and k is the index of frequency, thus P i (k) gives the intensity of
frequency k in time window i .
2. We compute triangular mel filter banks, which are essentially vectors with the length
of the range of frequencies after computing FFTs (let’s denote the latter by f max ).
Each filter bank is used for computing the energy of the windowed signal in a specific
range of frequencies.
Fig 4.2.2
An example of a mel filter bank is depicted above, which is used for computing the
energy of the signal in the frequency range 100-150 Hz, thus the values of the vector
are 0’s outside the range and are adjusted in the range so that the central frequency
will get the largest coefficient. In principle, the closer the central frequency to 0, the
shorter is the range of the triangular filter bank (and vise-versa), the reason being that
the human ear can distinguish the energies more clearly in the lower level of
frequencies, and therefore we need shorter filter banks (which also increases the
amount of filter banks) for lower frequency ranges so that the lower level frequency
energies will be more distinguished.
3. For each power spectrum of each windowed signal P i (k) and for each filter bank
vector H n (k) , where m ranges from 0 to the amount of mel coefficients that we want
to compute, we calculate the mel coefficient by computing the sum
f max
mn = ∑ H n (k)P i (k) . Then we take the log for each mel coefficient, mn = log(mn ) , so
k=0
finally we obtain the log filterbank energies for each window.
N
(3) M k = 1
2 m0 + ∑ mn cos [ Nπ n (k + 1
2 )] , 0 ≤ k ≤ N −1
n=1
10
Where N is the number of mel coefficients, mn is the k-th mel coefficient from the
step above and M k is the k-th MFCC [4].
MFCCs are visualised below.
Fig 4.2.3
The number of windows decides how accurate we can describe this change. If the number
of windows is large we have more individual parts of the original signal decomposed into
frequencies. On the other hand, if the number of windows is too large we lose general
frequency description of the whole signal which means there is a tradeoff between
describing frequency changes over time versus describing frequencies over the whole
signal. It is certainly necessary to try classifying the same examples with different
windows length chosen to find an optimal one. We should also choose the length of
fourier transform on each window. Since human voices range from around 300Hz to 5Khz
it is reasonable to choose a value from that range. The precise process of decomposing
signal into frequencies involves discrete fourier transform (DFT) which attempts to
describe oscillations in terms of sum of sinusoids of different periodicity. The algorithm to
compute precies DFT used in Librosa is fast fourier transform (FFT).
11
resolution (the temporal changes of frequencies in the signal are not explained
thoroughly.) The opposite holds as well: lower window sized STFT has higher time
resolution (indicates changes in frequency over time better) but a low frequency
resolution (does not properly show the frequency distribution over the whole signal.)
CWT solves this “resolution trade-off” problem by analysing the signal at different
frequency scales - larger scale meaning lower frequency and vise-versa, where
large-scale analysis preserves the frequency resolution of the transformation by deriving
the frequency-related data of the large parts of the signal, and low-scale analysis
preserves the time resolution by deriving data on the abrupt changes of the signal in short
intervals.
CWT makes use of wave-like functions called wavelets, and, at each step of the algorithm,
a chosen wavelet function is convoluted with the original signal function for deriving the
corresponding frequency-domain value. As described in [15], the conditions for a function
f (t) to be qualified as a wavelet are the following (complex wavelets are not considered in
this paper, the following conditions relate to the real-valued wavelet qualifications only):
∞
2
1. E = ∫ |f (t)| dt < ∞ , where E is called the energy of the wavelet,
−∞
2. If F (k) is the Fourier transform of f (t) , then it must be true that
∞ 2
|F (k)|
∫ k dk < ∞
0
which, intuitively speaking, means that f (t) has a zero mean.
3.
As stated in [15], the most commonly used wavelet functions are Gaussian wave, Mexican
hat, Haar and Morlet, the latter of which we utilized in speech signal processing
(visualised below.)
12
Fig 4.3.1
After choosing a wavelet function ψ (t) , the CWT of the signal x(t) i s computed as
follows:
∞
T (a, b) = 1
√a ∫ x(t) ψ( t −a b ) dt
−∞
where a is the scale at which the signal is analysed, and b is the measure of time at which
the wavelet function is shifted. Thus, the original signal is being processed at different
scales and time points, and it is crucial to note that the larger the scale, the lower the
frequency of the transformed wavelet function will be, making it grasp a wider window
and heighten the frequency resolution, while decreasing the amount of the observable
time points along the signal (the opposite statement is true as well.) Thus, with CWT, we
obtain the representation of the signal at various frequency scales for specific time
intervals. An example of a heatmap resulting from WT is visualised in Fig 4.3.2.
Fig 4.3.2
13
4.4 Component-Activation Decomposition
Component-Activation Decomposition is a method of decomposing a given signal
spectrogram into two nonnegative matrices by utilizing nonnegative matrix factorization
(NMF.) As described in [16], NMF is used for approximating a spectrogram of a signal by a
product of two nonnegative matrices, where the first matrix represents the spectral
patterns present in the signal, and the second matrix represents the activations of the
spectral patterns in specific time points. [16] describes such decomposition as a
significant advantage in machine learning algorithms and spectrogram processing as the
derived factors better interpret the properties and structures present in the given
spectrogram.
K×N
Formally speaking, given a nonnegative spectrogram matrix V ∈ R , we aim to find
K×R R×N
nonnegative matrices W ∈ R and H ∈ R such that
V ≈ W · H
where W is referred to as the template or components matrix, and H is called the
activations matrix. For achieving such an approximation, optimization techniques are
required, which are described in [16] elaborately.
In the context of our problem, considering the fact that all data observations are speech
recordings performed by only one person in each recording, we leveraged the advantage
of component-activation decomposition by factoring the signal to a template matrix with
only one component, which represents the spectral patterns in the voice of the speaker.
This method also allows us to derive a representation of a signal without a time
dimension, which makes the usage of statistical learning tools easier.
14
the form of cepstrum analysis. This approach is encouraged by [17], the reason being that
frequency analysis can provide information about harmonically related partials in the
signal, which are closely related to the fundamental frequency.
In short, cepstrum analysis is accomplished by performing a Fourier transform on the log
of the frequency spectrum of the given signal (intuitively, the cepstrum domain is
extremely close to the time domain because of applying Fourier transform twice on the
signal.) The process of spectrum analysis is described in the plot depicted in Fig 4.5.1.
Fig 4.5.1
After taking the Fourier transform of the signal, we have to detect the regularly spaced
peaks of the obtained spectrum, which represent the harmonic spectrum of the signal.
15
After taking the log of the spectrum for bringing the amplitudes to a usable scale, the
resulting period is Fourier transformed for detecting the distance between the peaks,
which are related to the fundamental frequency of the original signal. Finally, the
maximum value of the obtained cepstrum represents the period of the original waveform.
In our analysis, for each recording, we executed the calculation described above and
stored the amplitude of the peak of the cepstrum of each recording.
5. Methodology
16
where pi is the probability of samples in the subgroup belonging to the i-th class and
m is the number of classes. In order to obtain a random forest, many decision trees
are constructed but each tree only uses a subset of features for splitting the dataset
and classifying observations. At the end, the class predicted by the random forest is
the class that the majority of decision trees predict.
CN2 rule extraction tries to find rules of the form IF R1, R2, R3, … THEN CLASS. The
CN2 algorithm works in an iterative manner and at each iteration it searches for a set
of rules that covers many examples of a single class but a few from other classes. As a
result, the rules obtained define which characteristics result in the decision of the
classification. In order to find the best rule at each iteration, CN2 algorithm computes
a function that outputs the quality of the rule on the examples that the rules covers
[12]. For the purpose of finding more general rules that define each of 8 moods we
used Laplace Expected Accuracy defined as:
L = (nc + 1) / (na + m)
where nc is the number of samples covered by the rule from class predicted, na is the
total number of samples covered by the rule and m is the number of total classes. This
rule favors rules that cover more examples of target class when compared to entropy
[13].
Support vector machines (SVM) classification is a statistical learning algorithm suited
for binary class classification. The goal of SVM is to define a hyperplane in an N
dimensional space where N is the number of features of the observations that will
clearly separate the examples of one class form the other. That is, the algorithm seeks
to find parameters (w, b) for the equation of an N dimensional hyperplane given by
wT x + b so that the samples from one class are on one side of the plane while the
observations from the other class are on the other side. Moreover, SVM seeks to
maximize the distance of the hyperplane from the closest observations from both
classes, meaning that the gap between thy hyperplane and the observations is
maximal. The algorithm solves an optimization problem, where the objective is to
maximize the gap from the closest point of each class under the constraint that the
observations should lie on separate sides of the hyperplane. In case the observations
are not separable, the algorithm uses feature mapping to map observations to a higher
dimension and then performs the search of hyperplane in that dimension. Further
details of theory and implementation are more elaborately discussed in [19].
17
5.2 Prediction Approaches
Fig 5.2.1
An interesting interpretation of the eight discrete emotions can be obtained when
defining the two dimensions presented above: a dimension of measuring
positivity/negativity in speech and a dimension of measuring the arousal or
activity/passivity in speech. Thus, we obtain quite an intuitive mapping from discrete
emotions to measures in arousal and appraisal dimensions [8].
In our experiments, we used the arousal and appraisal dimensions to reduce the
classification problem to two sub-problems of labeling the speech with two possible
classes each. In the first sub-problem, we consider two classes: neutral/passively
aroused speech and highly aroused speech (in our dataset, sad, calm, neutral, and
18
disgust are mapped to the first class, and other four labels - excited, happy, angry and
afraid are mapped to the second class.) Likewise, in the second sub-problem, we
classify a speech either negatively appraised or neutrally/positively appraised (in our
dataset, angry, fearful, sad, disgust are mapped to the first class, and other four labels -
excited, happy, surprised and neutral are mapped to the second class.)
Instead of solving a discrete 8 class classification problem, we also tried to map
speeches to a 2 dimensional continuous sentiment-arousal space describing both the
arousal and sentiment of the speech. The horizontal axis of the sentiment-arousal
space indicates the polarity of the sentiment of the speech. -1 means that the emotion
is very negative and +1 means that emotion is very positive. Similarly, the vertical axis
describes the degree of arousal of the speech. The axes of this dimension are the same
as in Fig 5.2.1. This way, the classification problem is reduced to a regression problem
where the goal is to find out the coordinates of the observation in this space. Existing
discrete emotions have been manually mapped to the sentiment-arousal space in
accordance with figure above:
Discrete Sentiment-Arousal Coordinates
19
Section 5.3.1 describes the usage of convolutional neural networks and section 5.3.2 is
devoted to long-short term memory units.
Fig 5.3.1
Due to those characteristics, CNN is suitable for learning features from sequences of
features as convolutional kernels can potentially capture information of change of
input variables throughout the sequence. Moreover, it can capture useful information
from the change of neighbouring features at the same point of sequence. Depending
on the type of prediction problem, different types of loss functions can be used with
this architecture.
20
5.3.2 Long Short Term Memory Networks
Prior to understanding the purpose and algorithm of Long Short Term Memory Networks
(LSTMs), let us elaborate on the networks they are “inherited” from - Recurrent Neural
Networks (RNNs.) Intuitively, RNNs mimic the way the human brain processes a temporal
data input by keeping a memory of previously encountered data values while processing
each input at a specific time point. That is, the output of an RNN in the current step is
passed as an input to the network on the next step.
Fig 5.3.2
Fig 5.3.2 illustrates the step of an RNN cell computation. X t is the current observation in
the time series data, and ht−1 is the output of the previous cell, and both vectors are
passed as an input to the current cell, concatenated, passed to a tanh layer, and then
outputted to the next cell. The process repeats recursively until reaching the last time
point observation.
Although the algorithm of RNNs does consider previously encountered data on each
iteration, which does provide good results in some cases, there are major shortcomings to
the architecture as well. The main disadvantage of RNNs is known as the
vanishing/exploding gradient problem, meaning that during back propagation, due to the
recursive nature of such networks, if the number of parameters to be learned is high
enough, the gradients will either drastically increase of decrease depending on the output
of the cells. This problem was brought up in [20], where the authors presented Long Short
Term Memory networks (LSTM) as a solution.
21
Fig 5.3.3
Fig 5.3.3 shows the architecture of an LSTM cell. The key difference between LSTM and
RNN architecture is that apart from the general output of the cell, an additional vector
called the cell state is passed to the next cell as an input.
Fig 5.3.4
Fig 5.3.4 highlights the portion of the cell architecture that corresponds to the cell state.
As it is shown, the cell state is barely modified in the iteration, which helps to solve the
vanishing/exploding gradient problem, and long term memory is preserved during
computations as a result.
Now, let us elaborate on the steps of computations performed in a single LSTM iteration.
22
1. Firstly, based on the current time point vector and the output of the previous cell, the
LSTM cell decides what information to preserve/discard from the cell state.
Fig 5.3.5
As shown in Fig 5.3.5, we compute f t = σ (W f · [ht−1 , xt ] + bf ) , where [ht−1 , xt ] is the
concatenation operation, ht−1 is the output of the previous cell, and xt is the data in
the current time point. f t is then used to decide what information to preserve from
the cell state.
Also note that throughout the computations of an LSTM cell iteration, the sigmoid
function is used every time when deciding the importance of an information, which
leads to the preservation or discretion of the information. The reason is that the
output of the sigmoid function is between 0 and 1, and, during the training phase of
the model, non-important information will be “forced” to have 0 value while important
information will get values closer to 1.
2. Secondly, based on the current time point vector and the output of the previous cell,
the LSTM cell decides what new information to add to the cell state.
Fig 5.3.6
23
As depicted in Fig 5.3.6 above, we compute it = σ (W i · [ht−1 , xt ] + bi ) and
~
C = tanh(W C [ht−1 , X t ] + bC ) (the input variables are the same as mentioned in the
first step.) All the above calculated values are used in the next step for updating the
value of the cell state
3. Thirdly, using the values calculated in the first two steps, we update the LSTM cell
state.
Fig 5.3.7
~
As shown in Fig 5.3.7, we compute the new cell state like so: C t = f t * C t−1 + it * C t ,
where * is the pointwise multiplication operation. Thus, the cell state is updated by
deciding what information to preserve ( f t * C t−1 ) and what new information to add (
it * C t ~ ) to it.
4. The final step in LSTM cell iteration is to compute the cell output, which is
accomplished by deciding which part of the updated cell state should be outputted
from the cell.
Fig 5.3.8
24
As depicted in Fig 5.3.8, we compute ot = σ (W o [ht−1 , xt ] + bo ) for deciding what
information to preserve and what to discard from the updated cell, and compute the
final output ht = ot * tanh(C t ) [21].
As far as our problem is concerned, we have accomplished several experiments by
utilizing LSTM networks because, intuitively, they are a good fit to our purposes.
Considering that the voice recordings are represented as time series values of air
pressure (amplitude), on which we apply several signal processing algorithms such as
STFT, Wavelet Transform and MFCC computation, we extract features from voice that are
nothing more than a time series of some energy coefficient values, and LSTM networks
can learn emotion-relevant patterns by memorizing what energy coefficients they
encountered before (what they “heard”) and taking the coefficients of the next time point
as an input.
6. Experiments
25
26
Fig 6.1.1
To perform classification using random forests and CN2 rule induction, we grouped
frequencies in bins of range 64Hz and each bin was used as a feature. Thus, we used
128 features with 8108-8192Hz being the bin with the highest frequency range. We
separated all voice recordings of 1st and 2nd actors from the RAVDESS dataset, one
male and one female to obtain a test set of 120 samples. The rest were used for
27
training. The results of classification using tree based models and CN2 rule induction
were very poor, however by analysing some results we can still get insights.
By looking at some of the top rules extracted by the CN2 induction algorithm we can
try to get a description of the frequency spectrum of speeches with different
emotions.
IF 6528Hz - 6592Hz ≥ 3.9886746406555176 AND 8064Hz - 8128Hz ≥ 3.8793866634368896 AND
128Hz - 192Hz ≥ 0.0574270561337471 → Angry
IF 7744Hz - 7808Hz ≤ 0.04547011852264404 AND 448Hz - 512Hz ≥ 1.8838613033294678 AND
64Hz - 128Hz ≥ 0.13342395424842834 → Sad
IF 1600Hz - 1664Hz ≤ 0.1066826581954956 AND 320Hz - 384Hz ≥ 2.2769432067871094
→ Neutral
IF 1088Hz - 1152Hz ≥ 33.377315521240234 AND 3840Hz - 3904Hz ≥ 2.9604032039642334 →
Happy
IF 768Hz - 832Hz ≥ 7.9639387130737305 AND 192Hz - 256Hz ≤ 0.10243330895900726 AND
5376Hz - 5440Hz ≥ 0.3536578118801117 AND 704Hz - 768Hz ≥ 5.466808319091797
→ Surprised
CN2 induction algorithm distinguishes between active and passive emotions. Hence, it
results in assigning one of the more passive emotion labels when intensity of high
range frequencies is low. And the opposite is also true for active emotions, when
energy in high frequency ranges is high, the classification results in one of the more
active emotions. The results are in accordance with the Fig 6.1.1 describing differences
of intensities of frequencies in 8 mood categories.
By taking a look at the confusion matrix of the random forest classifier on the test
dataset shown in Fig 6.1.2 below, we can notice that the classifier also confuses active
emotions and passive emotions. For example, we can see that it confuses anger with
disgust and surprise, happiness is confused with disgust and surprise. Same trend
exists within passive emotions.
28
Fig 6.1.2
The results mean that frequency domain is not descriptive enough to be able to
classify voice recordings based on their emotion. It is because by transforming the
whole time domain signal into frequency domain, we are losing information on how
frequency spectrum changes during the speech. That is, we obtain frequency
spectrum representation of the whole signal but have no option to observe how it
changes during the speech. However, frequency domain is descriptive enough to
obtain the intensity of the voice recordings in terms of whether the speaker exerts
active or passive emotion. Moreover, we obtained some characteristics of frequency
domain relative to the emotion in the speech.
29
Fig 6.1.3
Although the intersections of cepstral peak values between different mood recordings
is high, which makes training statistical models on the obtained data less feasible, an
interesting pattern is noticed when considering the arousal level of the recordings. If
we consider neutrally/passively aroused recordings (neutral, calm, sad and disgust),
we can notice that their cepstral peak values are higher than those of highly aroused
recordings (happy, angry, fateful, surprised.)
Fig 6.1.4
30
In the Fig 6.1.4 above , recordings from TESS dataset were added to the analysis as
well. It can be noticed that the pattern of passively aroused recordings having on
average lower cepstral peaks is still present.
31
Speaker Classification Train accuracy Test accuracy New actor
normalized type (%) (%) accuracy (%)
32
Fig 6.2.1
There are 3 convolutional layers in the network followed by average pooling layers of
size (2x2). The last layer is a fully connected layer that maps output of convolutional
layers to an 8 length vector. Log softmax activation is applied to use cross entropy loss.
Each layer has 32 kernels of parameters. The first layer has kernels of size (10x3) and it
is deliberately chosen to be narrow and heighty to capture features from change of
MFCCs through the sequence. Between layers, leaky relu activation function given as
h(x) = max(x, 0) + 0.01 * min(0, x) is used both to enable fast training and to prevent
neurons from dying. Leaky ReLU adds a small slope to non activated neurons thus
preventing them from becoming 0 and not contributing to backpropagation on later
epochs [18]. Since our dataset is very small, we used dropout with high probability (p =
0.5) as well as L2 regularization to prevent overfitting.
All recordings of the 1st and 2nd actor, one male and one female from RAVDESS
database were used for testing and the neural network did not see those samples
during training. All remaining recordings were used for training. We used Adam
optimizer with a learning rate of 0.00005 and L2 regularization with decay 10−4 . The
final loss function becomes:
︿
︿ exp(y i )
(1) l(y i ) = log( )
︿
∑ exp(y j )
j
︿ ︿
(2) L(yi ) = − ∑ y i l(yi ) + λ ∑ w2
i w∈W
where W is the set of all network parameters
33
The network was trained multiple times with these settings and on average, it achieved
67.5% accuracy on the test dataset and 96% accuracy on the training dataset. Average
Roc AUC for all classes was 0.927. ROC curves for all classes are demonstrated:
Fig 6.2.2
As we can see in Fig 6.2.2, the network captures some emotions easier than others. For
instance, Neutral, Calm, Angry and Surprise were captured better than the rest.
ROC-AUC metric also suggests that the model has significance and does not randomly
guess labels. Overall, these results can be considered fair as there is a lack of
34
generalization. The test dataset used contained voices from people the network did
not observe during training but it did observe the phrase spoken in the dataset. In
essence, if we try to use this network on different voices with random spoken phrases,
the results will be worse.
Fig 6.2.3
MFCC sequences are fed into a LSTM recurrent layer with a hidden dimension of size
1024. There are 2 LSTM layers stacked on top of each other meaning that the outputs
of the first layer are processed by the second one. This increases the perceptiveness of
the network towards the features present in the sequence. Due to the dataset being
small, we used dropout with high probability (p = 0.5) on the outputs of the first LSTM
unit. The output
The network was trained using only RAVDESS dataset. The recordings of the 1st and
2nd actors (one male and one female) were used as testing set, the rest of recordings
were used for training of the network. Adam optimizer with learning rate of 0.0005
was used and the loss function to minimize was cross entropy loss given by:
35
︿
︿ exp(yi )
(1) l(yi ) = log( )
︿
∑ exp(yj )
j
︿ ︿
(2) L(yi ) = − ∑ y i l(yi )
i
The network achieved 65% test accuracy and 93.58% train accuracy. Clearly, the
model overfitted the provided data and gave poor performance on the voices of actors
it has never encountered during training. The ROC curves of the model are depicted
below.
Fig 6.2.4
36
By observing the ROC curves in Fig 6.2.4, we can notice that the model faced the same
difficulties as described in section 6.2.1, although it classifies some of the moods better.
Apart from the possible reasons mentioned before, for exploring the reasons behind
inefficient performance of the network in classifying 8 discrete emotions, we decided
to reduce the problem to classifying using the sentiment-arousal space (section 5.2.2.),
which is explained in the upcoming sections.
37
Fig 6.3.1
The results of the experiment are summarized in the table below.
Zones Data Datasets used Train acc. Test acc. AUC
separation (%) (%)
38
Fig 6.3.2
Firstly, the same LSTM architecture gave the best performance for all classification
scenarios, which indicates that the architecture is a good fit considering the datasets
available.
39
Secondly, the results indicate that the harder challenge for the model was to classify the
positivity, but its performance was good in classifying the arousal level of the speech. This
problem was also encountered in section 6.1.4, where the intuition behind the problem is
described as well.
Fig 6.3.3
Fig 6.3.4
40
The difference between 2 architectures is that the network for classifying sentiment
has one more convolutional layer. Dropout with p=0.4 was used between each
convolutional layer to prevent overfitting. Leaky relu was used as an activation
function between layers and its benefit is described in section 6.2.1.
The results of the experiment are summarized in the figures below.
Zones Train accuracy (%) Test accuracy (%) AUC
Fig 6.3.5
Again, the models encounter the difficulty of classifying the positivity of the speeches,
as described in the prior sections.
41
Fig 6.4.1
This effectively allows us to turn the classification problem into a vector-regression
problem. So, the loss function is also changed to be mean squared error loss instead of
cross entropy loss.
︿ ︿ ︿
L(y) = 12 (y − y ) T (y − y ) + λ ∑ w2
w∈W
The result after multiple training was around 0.21 MSE loss. One interpretation of this
loss is that the coordinates that the network outputs are ~0.6 euclidean distance away
from their correct destinations in the continuous sentiment-arousal space. But this
might give the wrong sense of interpretation because the network might correctly
predict that a speech is both aroused and negative but fail to find out the exact degree
of negativity and arousal. To this end, we constructed some visualizations of the
results in sentiment-arousal space to gain better understanding of the results.
42
Fig 6.4.2
Fig 6.4.2 shows that overall the performance was good. In the majority of cases the
network correctly identifies both the sentiment and arousal of speech. It rarely fails to
identify both components and it can at least identify the arousal of a speech. One of
the shortcomings we see is that there was a significant amount of happy labeled
speeches that were identified as negative by the network. On the contrary, fear,
disgust, anger and sadness were almost perfectly positioned in the new plane. This,
again shows us that the network is struggling to determine the sentiment but is good
at differentiating between active and passive emotions. Solving this regression
problem is more flexible than solving the discrete emotion classification problem.
From the perspective of optimization, it is easier to find optimal solutions since the
network has to identify only 2 components from the speech. Moreover, this
optimization is more flexible in terms of solving the problem of categorizing speech as
one can define a different continuous space and thus, achieve a solution to a
completely different problem.
43
7. Conclusion and Future Work
44
such as detecting a possibly harmful emotional state of a car driver and blocking an
action. Finally, voice emotion recognition models can be integrated in customer
service centers for analysing the moods of the clients and using it as a tool for service
efficiency analysis.
45
References
[1] Steven R. Livingstone and Frank A. Russo, “RAVDESS Emotional speech audio.” Kaggle,
doi: 10.34740/KAGGLE/DSV/256618.
[2] SAVEE Database. Retrieved from http://kahlan.eps.surrey.ac.uk/savee/
[3] TESS Database. Retrieved from
https://tspace.library.utoronto.ca/handle/1807/24487
[4] Sahidullah, M., & Saha, G. (2012). Design, analysis and experimental evaluation of block
based transformation in MFCC computation for speaker recognition. Speech
Communication, 54( 4), 543–565. doi: 10.1016/j.specom.2011.11.004
[5] Mower, E., Mataric, M. J., & Narayanan, S. (2011). A Framework for Automatic Human
Emotion Classification Using Emotion Profiles. IEEE Transactions on Audio, Speech, and
Language Processing, 19( 5), 1057-1070. doi:10.1109/tasl.2010.2076804
[6] Glüge, S., Böck, R., & Ott, T. (2017). Emotion Recognition from Speech using
Representation Learning in Extreme Learning Machines. Proceedings of the 9th
International Joint Conference on Computational Intelligence.
doi:10.5220/0006485401790185
[7] C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang, S. Lee, and S.
Narayanan, "IEMOCAP: Interactive emotional dyadic motion capture database," Journal
of Language Resources and Evaluation, vol. 42, no. 4, pp. 335-359, December 2008.
[8] Wieczorkowska, Alicja., Synak, P., Lxewis, R., Ras, Z. W., Extracting Emotions from
Music Data, Proceedings of 15th International Symposium, ISMIS 2005, Saratoga
Springs, NY, USA, May 25-28, 2005., p456-465
[9] Librosa sound processing library. https://librosa.github.io/librosa/
[10] Frequency Domain and Fourier Transforms. Retrieved from
https://www.princeton.edu/~cuff/ele201/kulkarni_text/frequency.pdf
[11]A. Kulkarni, M. F. Qureshi, M. Jha Discrete Fourier Transform: Approach to Signal
Processing.
[12] P. Clark, T. Niblett (1988) The CN2 induction algorithm, The Turing Institute
[13] R. Boswell (1998) Rule Induction with CN2: Some Recent Improvements
[14] FACTS ABOUT SPEECH INTELLIGIBILITY - DPA Microphones. (n.d.). Retrieved
from
https://www.dpamicrophones.com/mic-university/facts-about-speech-intelligibi
lity
[15] The continuous wavelet transform. (n.d.). The Illustrated Wavelet Transform
Handbook. doi:10.1887/0750306920/b833c2
[16] MULLER, M. (2016). FUNDAMENTALS OF MUSIC PROCESSING: Audio, analysis,
algorithms, applications. Place of publication not identified: SPRINGER, pages
415-568
46
View publication stats
[17] Gerhard, D. (2003). Pitch Extraction and Fundamental Frequency: History and
Current Techniques. Department of Computer Science University of Regina, 9-12.
[18] B. Xu, W. Naiyan, C. Tianqi, L. Mu (2017). Empirical Evaluation of Rectified
Activations in Convolutional Networks
[19] Awad, Mariette & Khanna, Rahul. (2015). Support Vector Machines for Classification.
10.1007/978-1-4302-5990-9_3.
[20] Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural
Computation, 9( 8), 1735-1780. doi:10.1162/neco.1997.9.8.1735
[21] Wu, S. (2009). Recognition of Human Emotion in Speech Using Modulation
Spectral Features and Support Vector Machines. Queen’s University, Department of
Electrical and Computer Engineering, 70-74.
47