Professional Documents
Culture Documents
Submitted by
Rahul Wadhwa
19EC39028
rahulwadhwa157@gmail.com
I, Rahul Wadhwa, declare that this thesis titled, ‘Exploring Self Supervised Learning for
Speech Emotion Recognition’ and the work displayed in it are mine. I confirm the fol-
lowing:
• The entirety of this work was completed while pursuing a Bachelor’s degree at this
university.
• The work has not been submitted to any other Institute for any degree or diploma.
• I have complied with the standards and recommendations outlined in the Institute’s
Ethical Code of Conduct.
• All significant aid sources have been acknowledged.
1
Indian Institute of Technology, Kharagpur
Certificate
This is to certify that the project report entitled ”Exploring Self Supervised Learning for
Speech Emotion Recognition” submitted by Rahul Wadhwa (Roll No. 19EC39028 ) to
Indian Institute of Technology Kharagpur towards partial fulfilment of requirements for
the award of degree of Bachelors of Technology in Electronics and Electrical Communi-
cation Engineering is a record of bonafide work carried out by him under my supervision
and guidance during Autumn Semester, 2022-23.
2
Acknowledgements
I would like to thank my guide, Prof. Goutam Saha, for their exceptional guidance
and support, without which this project would not have been possible. They have always
motivated me to explore as much as I can, look into multiple papers and try as many
ideas as I can. They have always supported me through whatever problems I faced during
the project.
I am grateful towards Premjeet Singh, PhD candidate at IIT Kharagpur for his un-
wavering support throughout the project.
In conclusion, I recognize that this project would not have been possible without the
support from the Department of Electronics and Electrical Communication Engineering,
IIT Kharagpur. Many thanks to all those who made this project possible.
Yours Sincerely,
Rahul Wadhwa
19EC39028
3
Contents
1 Introduction 1
2 Motivation 1
3 Problem definition 2
4 Datasets 2
4.1 EmoDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
4.2 RAVDESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
5 Transformer Architecture 3
5.1 HuBERT Pre-trained Model . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.2 Wav2Vec2 Pre-trained Model . . . . . . . . . . . . . . . . . . . . . . . . 5
7 Results 7
7.1 EmodB Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
7.1.1 HuBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
7.1.2 Wav2Vec2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
7.2 RAVDESS Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
7.2.1 HuBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
7.2.2 Wav2Vec2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Speech is the most vital part of day to day communication with other people. Speech
signals can be modelled as continuous time domain signals generated due to pressure
variations. Speech signals are characterized by basic building blocks like word, syllables
and phones and their arrangements within each other, which is also known as the linguistic
structure of speech signals.
Speech signals are digitalized and converted to a vector representation for analysis
and pre-processing, which is very similar to how words are converted in natural language
processing (NLP). Sentences and speeches are both examples of sequential data, and we
typically can’t interpret much without analysing the entire set of data. Sampling and
quantization are the fundamental components of the digitalization of any signal. More
than double the Nyquist frequency is used for sampling. Several algorithms can be used
for quantization, including linear quantization, logarithmic quantization, and mu-law.
Speech includes information on the speaker’s emotional state in addition to other sorts
of information such language, speaker, spoken topic, etc. [1] This method of determining
the speaker’s emotional state is known as Speech Emotion Recognition. The subjective
nature of emotion make speech emotion recognition (SER) a difficult task.
2 Motivation
Emotions play a vital role in communication and their detection and analysis is quite im-
portant. Humans are naturally quite good at recognising the emotions contained in voice
signals. Regardless of who is speaking, we concentrate on the speaker’s tone and words to
determine their emotions. In field of speech emotion recognition this is known as speaker
independent emotion recognition. For speech emotion recognition(SER) researchers use
a collection of rules that process and classify speech signals to detect emotions present in
speech signals. SER systems aim to create efficient, real-time methods of detecting the
emotions in communication. SER is a complex task and before the advent of deep learn-
ing tradition ML Systems like kernel SVM and other algorithms used for these complex
tasks. The choosing of features is a common challenge with this technique. Generally
speaking, it is unknown which features can result in the most effective clustering of data
into distinct classes.
This optimal features selection problem is solved after advent of deep learning. The con-
cept is to employ an end-to-end network that produces a class label as an output from
raw data as an input. Nowadays Pre-trained Deep Learning Models are used for SER like
Transformers, CNN etc. They are already trained on a large dataset not in a supervised
way but to learn the distribution of the dataset. Then depending on the task they are
fine-tuned in supervised manner and for our case the task is speech emotion recognition.
Inspired by this, we aim to investigate different deep learning models, pre-trained on
speech related tasks, and fine-tune them to analyse the effect on SER performance.
1
Figure 1: Architecture of SER system. [2]
3 Problem definition
• Utilize a pre-trained transformer based deep learning framework for SER which
takes raw speech utterances at the input and provide the most probable emotion
class at the output, i.e., ...
Input Audio → T ransf ormer Architecture → Emotion Class
• Fine-tune the model for speaker independent Emotion Recognition.
• Evaluate SER performance for different test and validation speakers and observe
the emotion-wise accuracy.
• Observe the behaviour of different pre-trained models on same dataset for SER.
4 Datasets
Databases are an essential part of speech emotion recognition because the classification
process rely on the labelled data. The quality of the data has an impact on how well the
recognition process works.
Two datasets have been Considered which are:
• Berlin Emotional Database (EmoDB)
• Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)
4.1 EmoDB
• It contains utterances of seven emotions uttered by ten professional actors, half
male, half female in German. Each utterance is repeated with different actors and
different emotions. [3]
2
• Every utterance is named according to the same scheme:
– Positions 1-2: number of speaker
– Positions 3-5: code for text
– Position 6: emotion
– Position 7: if there are more than two versions these are numbered a, b, c ....
4.2 RAVDESS
• The RAVDESS contains 24 professional actors (12 female, 12 male). [4]
• Every utterance is named according to the same scheme:
– Modality
– Vocal channel
– Emotion
– Emotional intensity
– Statement
– Repetition
– Actor
5 Transformer Architecture
Transformers are designed to process sequential input data with a major breakthrough
in field of deep learning that all the input data can be processed simultaneously. To know
about Transformers we first need to know about Attention because Attention is all you
need!! [5]
3
The Attention mechanism is a neural architecture that mimics this process
X
Attention(q, k, v) = similarity(q ∗ ki ) ∗ vi (1)
pos
P E(pos, 2i) = sin( 2i ) (2)
10000 dmodel
pos
P E(pos, 2i + 1) = cos( 2i ) (3)
10000 dmodel
Where our dmodel = 512
Next, these embeddings are transformed by 3 different weight matrices to get vectors
q,k and v all of size 64.
qT k
similarity(q, k) = (4)
8
Where 8 is the square root of query and key vectors dimension. Both Encoder and
Decoder part of Transformers are comprised of these Self Attention layers along with Feed
forward Neural Network layers, intermediate Normalization and regularization layers.
For the Speech Emotion Recognition task two different pre-trained transformer based
models have been used.
• HuBERT
• Wav2Vec2
4
5.1 HuBERT Pre-trained Model
Hubert is a Pre trained Speech model based on Transformer architecture. [6] It will take
a float array representing the speech signal’s raw waveform. It is pretrained on 16kHz
sampled speech audio files of Libri Light dataset.In order to use this model for speech
emotion recognition the model should be fine tuned on labeled text data. [7]
The internal architecture of HuBERT includes a CNN Encoder followed by the trans-
former architecture. CNN encoder includes 7 Convolution layers with GELU Activation
function [8] which is
s
2
GELU (x) = xP (X <= x) ≈ 0.5x(1 + tanh( (x + 0.044175x3 ))) (5)
π
The Transformer Encoder part includes 12 self-attention layers with 12 feed forward
layers having activation function GELU.
Wav2Vec2 is a pre-trained Model used for feature extraction and Classification for Audio
Signals. It is pre-trained on 60 hours of Librispeech on 16kHz sampled speech audio.For
SER the pre-trained weights are fine-tuned with labels as emotions present in speech
signals. [9]
The internal architecture of Wav2Vec2 is similar to Hubert. It also includes a CNN
Architecture with 7 convolution layers with GELU Activation. The Transformer Encoder
part includes 24 self-attention layers with 24 feed forward layers having activation function
GELU.
However the training process of Wav2Vec2 is quite different from HuBERT. Wav2vec
2.0 learns its targets concurrently with model training, whereas HuBERT builds targets
through a separate clustering procedure.
5
strides 5,2,2,2,2,2,2
CNN Encoder kernel width 10,3,3,3,3,2,2
channel 512
layer 24
Embedding dim. 1024
Transformer
layerdrop prob. 0.1
attention heads 16
Projection dim. 1024
6
Test speaker Validation speaker Validation accuracy Test accuracy
09 03 95.83 % 80.95 %
03 10 86.84 % 95.83 %
10 11 81.48 % 81.58 %
11 12 91.18 % 70.37 %
12 13 83.33 % 94.12 %
13 14 92.65 % 86.67 %
14 15 92.86 % 89.71 %
15 16 90.00 % 85.71 %
16 09 92.86 % 84.29 %
7 Results
Two different datasets have been used and for each dataset two different pre-trained
models are fine tuned.
7.1.1 HuBERT
This is the result obtained using one of test and validation speaker combination. The
model is trained over 50 epochs for each validation and test speaker Combination.
In total 9 different test and validation speakers combinations are used and finally their
average has been taken.
So final result of HuBERT on EmodB dataset is:
• Average Validation Accuracy: 89.6697853930396 %
• Average Test Accuracy: 85.46946978921522 %
7
7.1.2 Wav2Vec2
This is the result obtained using one of test and validation speaker combination. The
model is trained over 30 epochs for each validation and test speaker Combination.
For this also total 9 different test and validation speakers combinations are used and
finally their average has been taken.
Final result of wav2vec2:
• Average Validation Accuracy: 85.0427493934546 %
• Average Test Accuracy: 82.68533761782817 %
8
7.2 RAVDESS Dataset
RAVDESS is a bigger dataset Compared to so training models over this dataset took a
much longer time than on EmodB.
7.2.1 HuBERT
This is the result obtained using one of test and validation speaker combination. The
model is trained over 30 epochs for each validation and test speaker Combination.
Figure 11: Validation Speaker: 2,Test Speaker: 1, Best Validation Accuracy=95% , Test
Accuracy=75%
In total 3 different test and validation speakers combinations are used and finally their
average has been taken.
9
So final result of HuBERT on RAVDESS dataset is:
• Average Validation Accuracy: 91.11 %
• Average Test Accuracy: 78.33 %
7.2.2 Wav2Vec2
This is the result obtained using one of test and validation speaker combination. The
model is trained over 30 epochs for each validation and test speaker Combination.
Figure 14: Validation Speaker: 2,Test Speaker: 1, Best Validation Accuracy=90% , Test
Accuracy=60%
In total 2 different test and validation speakers combinations are used and finally their
average has been taken.
10
So final result of Wav2Vec2 on RAVDESS dataset is:
• Average Validation Accuracy: 68.35 %
• Average Test Accuracy: 50 %
All the results are available in this Colab notebook: 19EC39028-BTP-Code
References
[1] M. M. H. E. Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recog-
nition: Features, classification schemes, and databases,” Pattern Recognit., vol. 44,
pp. 572–587, 2011.
11
[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
and I. Polosukhin, “Attention is all you need,” 2017.
[6] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mo-
hamed, “Hubert: Self-supervised speech representation learning by masked predic-
tion of hidden units,” 2021.
[8] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” 2016.
12