Self-Supervised Speech Emotion Recognition

BTP-1 Report
Exploring Self Supervised Learning for Speech

Emotion Recognition
Under the Supervision of
Prof. Goutam Saha
Submitted by
Rahul Wadhwa
19EC39028
rahulwadhwa157@gmail.com
A thesis submitted in partial fulfillment for the degree of Bachelors of

Technology
Indian Institute of Technology, Kharagpur
Department of Electronics and

Electrical Communication Engineering
November 28, 2022
Declaration
I, Rahul Wadhwa, declare that this thesis titled, ‘Exploring Self Supervised Learning for
Speech Emotion Recognition’ and the work displayed in it are mine. I confirm the fol-
lowing:
• The entirety of this work was completed while pursuing a Bachelor’s degree at this
university.
• The work has not been submitted to any other Institute for any degree or diploma.
• I have complied with the standards and recommendations outlined in the Institute’s
Ethical Code of Conduct.
• All significant aid sources have been acknowledged.
Date: November 28, 2022 (Rahul Wadhwa)

Place: Kharagpur (19EC39028)
1
Indian Institute of Technology, Kharagpur
Certificate
This is to certify that the project report entitled ”Exploring Self Supervised Learning for
Speech Emotion Recognition” submitted by Rahul Wadhwa (Roll No. 19EC39028 ) to
Indian Institute of Technology Kharagpur towards partial fulfilment of requirements for
the award of degree of Bachelors of Technology in Electronics and Electrical Communi-
cation Engineering is a record of bonafide work carried out by him under my supervision
and guidance during Autumn Semester, 2022-23.
Date: November 28, 2022 Prof. Goutam Saha

Department of Electronics and
Electrical Communication Engineering
(Indian Institute of Technology, Kharagpur, India)
2
Acknowledgements
I would like to thank my guide, Prof. Goutam Saha, for their exceptional guidance
and support, without which this project would not have been possible. They have always
motivated me to explore as much as I can, look into multiple papers and try as many
ideas as I can. They have always supported me through whatever problems I faced during
the project.
I am grateful towards Premjeet Singh, PhD candidate at IIT Kharagpur for his un-
wavering support throughout the project.
In conclusion, I recognize that this project would not have been possible without the
support from the Department of Electronics and Electrical Communication Engineering,
IIT Kharagpur. Many thanks to all those who made this project possible.
Yours Sincerely,
Rahul Wadhwa
19EC39028
3
Contents
1 Introduction 1
2 Motivation 1
3 Problem definition 2
4 Datasets 2
4.1 EmoDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
4.2 RAVDESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
5 Transformer Architecture 3
5.1 HuBERT Pre-trained Model . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.2 Wav2Vec2 Pre-trained Model . . . . . . . . . . . . . . . . . . . . . . . . 5
6 Self-supervised learning for SER 6
7 Results 7
7.1 EmodB Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
7.1.1 HuBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
7.1.2 Wav2Vec2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
7.2 RAVDESS Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
7.2.1 HuBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
7.2.2 Wav2Vec2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
8 Drawbacks and Challenges 11
9 Scope for Future Work 11

1 Introduction
Speech is the most vital part of day to day communication with other people. Speech
signals can be modelled as continuous time domain signals generated due to pressure
variations. Speech signals are characterized by basic building blocks like word, syllables
and phones and their arrangements within each other, which is also known as the linguistic
structure of speech signals.
Speech signals are digitalized and converted to a vector representation for analysis
and pre-processing, which is very similar to how words are converted in natural language
processing (NLP). Sentences and speeches are both examples of sequential data, and we
typically can’t interpret much without analysing the entire set of data. Sampling and
quantization are the fundamental components of the digitalization of any signal. More
than double the Nyquist frequency is used for sampling. Several algorithms can be used
for quantization, including linear quantization, logarithmic quantization, and mu-law.
Speech includes information on the speaker’s emotional state in addition to other sorts
of information such language, speaker, spoken topic, etc. [1] This method of determining
the speaker’s emotional state is known as Speech Emotion Recognition. The subjective
nature of emotion make speech emotion recognition (SER) a difficult task.
2 Motivation
Emotions play a vital role in communication and their detection and analysis is quite im-
portant. Humans are naturally quite good at recognising the emotions contained in voice
signals. Regardless of who is speaking, we concentrate on the speaker’s tone and words to
determine their emotions. In field of speech emotion recognition this is known as speaker
independent emotion recognition. For speech emotion recognition(SER) researchers use
a collection of rules that process and classify speech signals to detect emotions present in
speech signals. SER systems aim to create efficient, real-time methods of detecting the
emotions in communication. SER is a complex task and before the advent of deep learn-
ing tradition ML Systems like kernel SVM and other algorithms used for these complex
tasks. The choosing of features is a common challenge with this technique. Generally
speaking, it is unknown which features can result in the most effective clustering of data
into distinct classes.
This optimal features selection problem is solved after advent of deep learning. The con-
cept is to employ an end-to-end network that produces a class label as an output from
raw data as an input. Nowadays Pre-trained Deep Learning Models are used for SER like
Transformers, CNN etc. They are already trained on a large dataset not in a supervised
way but to learn the distribution of the dataset. Then depending on the task they are
fine-tuned in supervised manner and for our case the task is speech emotion recognition.
Inspired by this, we aim to investigate different deep learning models, pre-trained on
speech related tasks, and fine-tune them to analyse the effect on SER performance.
1
Figure 1: Architecture of SER system. [2]
3 Problem definition
• Utilize a pre-trained transformer based deep learning framework for SER which
takes raw speech utterances at the input and provide the most probable emotion
class at the output, i.e., ...
Input Audio → T ransf ormer Architecture → Emotion Class
• Fine-tune the model for speaker independent Emotion Recognition.
• Evaluate SER performance for different test and validation speakers and observe
the emotion-wise accuracy.
• Observe the behaviour of different pre-trained models on same dataset for SER.
4 Datasets
Databases are an essential part of speech emotion recognition because the classification
process rely on the labelled data. The quality of the data has an impact on how well the
recognition process works.
Two datasets have been Considered which are:
• Berlin Emotional Database (EmoDB)
• Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)
4.1 EmoDB
• It contains utterances of seven emotions uttered by ten professional actors, half
male, half female in German. Each utterance is repeated with different actors and
different emotions. [3]
2
• Every utterance is named according to the same scheme:
– Positions 1-2: number of speaker
– Positions 3-5: code for text
– Position 6: emotion
– Position 7: if there are more than two versions these are numbered a, b, c ....
4.2 RAVDESS
• The RAVDESS contains 24 professional actors (12 female, 12 male). [4]
• Every utterance is named according to the same scheme:
– Modality
– Vocal channel
– Emotion
– Emotional intensity
– Statement
– Repetition
– Actor
5 Transformer Architecture
Figure 2: Transformer Architecture.
Transformers are designed to process sequential input data with a major breakthrough
in field of deep learning that all the input data can be processed simultaneously. To know
about Transformers we first need to know about Attention because Attention is all you
need!! [5]
3
The Attention mechanism is a neural architecture that mimics this process
X
Attention(q, k, v) = similarity(q ∗ ki ) ∗ vi (1)
Where q=query, k=key, v=value

This Self Attention Mechanism is the major breakthrough in the processing of sequential
data, using this mechanism entire data can be passed through encoder simultaneously. In
transformers data samples are first converted into vector embeddings of dimension 512.
These vector embeddings are added with position embeddings to preserve the relation-
ships between different speech words irrespective of sentence length. These are calculated
using these 2 equations
pos
P E(pos, 2i) = sin( 2i ) (2)
10000 dmodel
pos
P E(pos, 2i + 1) = cos( 2i ) (3)
10000 dmodel
Where our dmodel = 512
Next, these embeddings are transformed by 3 different weight matrices to get vectors
q,k and v all of size 64.
Figure 3: Self Attention Mechanism.
Similarity function that is used in transformers is scaled dot product which is
qT k
similarity(q, k) = (4)
8
Where 8 is the square root of query and key vectors dimension. Both Encoder and
Decoder part of Transformers are comprised of these Self Attention layers along with Feed
forward Neural Network layers, intermediate Normalization and regularization layers.
For the Speech Emotion Recognition task two different pre-trained transformer based
models have been used.
• HuBERT
• Wav2Vec2
4
5.1 HuBERT Pre-trained Model
Hubert is a Pre trained Speech model based on Transformer architecture. [6] It will take
a float array representing the speech signal’s raw waveform. It is pretrained on 16kHz
sampled speech audio files of Libri Light dataset.In order to use this model for speech
emotion recognition the model should be fine tuned on labeled text data. [7]
Figure 4: HuBERT Architecture.
The internal architecture of HuBERT includes a CNN Encoder followed by the trans-
former architecture. CNN encoder includes 7 Convolution layers with GELU Activation
function [8] which is
s
2
GELU (x) = xP (X <= x) ≈ 0.5x(1 + tanh( (x + 0.044175x3 ))) (5)
π
The Transformer Encoder part includes 12 self-attention layers with 12 feed forward
layers having activation function GELU.
5.2 Wav2Vec2 Pre-trained Model
Wav2Vec2 is a pre-trained Model used for feature extraction and Classification for Audio
Signals. It is pre-trained on 60 hours of Librispeech on 16kHz sampled speech audio.For
SER the pre-trained weights are fine-tuned with labels as emotions present in speech
signals. [9]
The internal architecture of Wav2Vec2 is similar to Hubert. It also includes a CNN
Architecture with 7 convolution layers with GELU Activation. The Transformer Encoder
part includes 24 self-attention layers with 24 feed forward layers having activation function
GELU.
However the training process of Wav2Vec2 is quite different from HuBERT. Wav2vec
2.0 learns its targets concurrently with model training, whereas HuBERT builds targets
through a separate clustering procedure.
5
strides 5,2,2,2,2,2,2
CNN Encoder kernel width 10,3,3,3,3,2,2
channel 512
layer 24
Embedding dim. 1024
Transformer
layerdrop prob. 0.1
attention heads 16
Projection dim. 1024
Figure 5: Data flow in Wav2Vec2.
6 Self-supervised learning for SER

The paradigm used here is the Upstream + Downstream Model. [10]
Figure 6: SER Model.
Upstream model is task independent model in which pre-trained architectures like

HuBERT and Wav2Vec2 have been used for the feature extraction from speech signals.
However the downstream is a task dependent model which map the feature vectors to
emotion classes.
Using the categories for emotion from the datasets, we fine-tune our Upstream models
in order to improve the performance of our SER system.
In order to make the SER system speaker independent the dataset is split in this way:
• All files corresponding to speaker no. say 1 are put into test dataset.
• All files corresponding to speaker no. say 2 are put into validation dataset.
• Rest all files are used for training purpose.
6
Test speaker Validation speaker Validation accuracy Test accuracy
09 03 95.83 % 80.95 %
03 10 86.84 % 95.83 %
10 11 81.48 % 81.58 %
11 12 91.18 % 70.37 %
12 13 83.33 % 94.12 %
13 14 92.65 % 86.67 %
14 15 92.86 % 89.71 %
15 16 90.00 % 85.71 %
16 09 92.86 % 84.29 %
7 Results
Two different datasets have been used and for each dataset two different pre-trained
models are fine tuned.
7.1 EmodB Dataset
7.1.1 HuBERT
This is the result obtained using one of test and validation speaker combination. The
model is trained over 50 epochs for each validation and test speaker Combination.
Figure 7: Validation Speaker: 03,Test Speaker: 09, Best Validation Accuracy=95.83% ,

Test Accuracy=80.95%
In total 9 different test and validation speakers combinations are used and finally their
average has been taken.
So final result of HuBERT on EmodB dataset is:
• Average Validation Accuracy: 89.6697853930396 %
• Average Test Accuracy: 85.46946978921522 %
7
7.1.2 Wav2Vec2
Figure 8: Validation Speaker: 14,Test Speaker: 13, Best Validation Accuracy=85.29% ,

Test Accuracy=75%
Figure 9: Validation Report
Figure 10: Test Report
For this also total 9 different test and validation speakers combinations are used and
finally their average has been taken.
Final result of wav2vec2:
8
7.2 RAVDESS Dataset
RAVDESS is a bigger dataset Compared to so training models over this dataset took a
much longer time than on EmodB.
7.2.1 HuBERT
Figure 11: Validation Speaker: 2,Test Speaker: 1, Best Validation Accuracy=95% , Test
Accuracy=75%
9
So final result of HuBERT on RAVDESS dataset is:
7.2.2 Wav2Vec2
Figure 14: Validation Speaker: 2,Test Speaker: 1, Best Validation Accuracy=90% , Test
Accuracy=60%
10
So final result of Wav2Vec2 on RAVDESS dataset is:
• Average Test Accuracy: 50 %
All the results are available in this Colab notebook: 19EC39028-BTP-Code
8 Drawbacks and Challenges

• EmoDB dataset Contained an uneven no. of utterances for each emotion. It made
the accuracies of particular emotions too high compared to others.
• Mostly the emotion content in speech signals is present in some segments of audio
and is not uniformly distributed. So depending on the distribution of emotion
content in test audio the accuracy can vary abruptly since weights of feed forward
layers are already set after training.
• Working with bigger dataset like RAVDESS was an issue because it took a lot more
time for training than EmoDB dataset and generally requires a high end system.
• Some of the issue is also with audio files. They do not contain natural emotions.
Both EmoDB and RAVDESS contains voice of different actors who tried to speak
in such a way to show that particular emotion. So depending on the efficiency of
actors the accuracy of model can vary.
9 Scope for Future Work

• Try other approaches for this emotion recognition task.
• Train the HuBERT and Wav2Vec2 for 24 different test and validation speaker com-
bination just like they are trained for EmoDB dataset.
• Make these models more robust.
References
[1] M. M. H. E. Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recog-
nition: Features, classification schemes, and databases,” Pattern Recognit., vol. 44,
pp. 572–587, 2011.
[2] V. Kamble, R. Deshmukh, A. Karwankar, V. Ratnaparkhe, and S. Annadate, “Emo-

tion recognition for instantaneous marathi spoken words,” 11 2014.
[3] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, “A database

of german emotional speech,” vol. 5, pp. 1517–1520, 09 2005.
[4] A. U A and K. V K, “Speech emotion recognition-a deep learning approach,” in

2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics
and Cloud) (I-SMAC), pp. 867–871, 2021.
11
[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
and I. Polosukhin, “Attention is all you need,” 2017.
[6] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mo-
hamed, “Hubert: Self-supervised speech representation learning by masked predic-
tion of hidden units,” 2021.
[7] P. Kumar, V. N. Sukhadia, and S. Umesh, “Investigation of robustness of hubert

features from different layers to domain, accent and language variations,” in ICASSP
2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), pp. 6887–6891, 2022.
[8] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” 2016.
[9] J. Zuluaga-Gomez, A. Prasad, I. Nigmatulina, S. Sarfjoo, P. Motlicek, M. Kleinert,

H. Helmke, O. Ohneiser, and Q. Zhan, “How does pre-trained wav2vec 2.0 perform on
domain shifted asr? an extensive benchmark on air traffic control communications,”
2022.
[10] E. Morais, R. Hoory, W. Zhu, I. Gat, M. Damasceno, and H. Aronowitz, “Speech

emotion recognition using self-supervised features,” 2022.
12

Self-Supervised Speech Emotion Recognition

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Self-Supervised Speech Emotion Recognition

Uploaded by

Copyright:

Available Formats

BTP-1 Report

Exploring Self Supervised Learning for Speech

Prof. Goutam Saha

A thesis submitted in partial fulfillment for the degree of Bachelors of

Indian Institute of Technology, Kharagpur

Department of Electronics and

Date: November 28, 2022 (Rahul Wadhwa)

Date: November 28, 2022 Prof. Goutam Saha

6 Self-supervised learning for SER 6

8 Drawbacks and Challenges 11

9 Scope for Future Work 11

Figure 2: Transformer Architecture.

Where q=query, k=key, v=value

Figure 3: Self Attention Mechanism.

Similarity function that is used in transformers is scaled dot product which is

Figure 4: HuBERT Architecture.

5.2 Wav2Vec2 Pre-trained Model

Figure 5: Data flow in Wav2Vec2.

6 Self-supervised learning for SER

Figure 6: SER Model.

Upstream model is task independent model in which pre-trained architectures like

7.1 EmodB Dataset

Figure 7: Validation Speaker: 03,Test Speaker: 09, Best Validation Accuracy=95.83% ,

Figure 8: Validation Speaker: 14,Test Speaker: 13, Best Validation Accuracy=85.29% ,

Figure 9: Validation Report

Figure 10: Test Report

Figure 12: Validation Report

Figure 13: Test Report

Figure 15: Validation Report

Figure 16: Test Report

8 Drawbacks and Challenges

9 Scope for Future Work

[2] V. Kamble, R. Deshmukh, A. Karwankar, V. Ratnaparkhe, and S. Annadate, “Emo-

[3] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, “A database

[4] A. U A and K. V K, “Speech emotion recognition-a deep learning approach,” in

[7] P. Kumar, V. N. Sukhadia, and S. Umesh, “Investigation of robustness of hubert

[9] J. Zuluaga-Gomez, A. Prasad, I. Nigmatulina, S. Sarfjoo, P. Motlicek, M. Kleinert,

[10] E. Morais, R. Hoory, W. Zhu, I. Gat, M. Damasceno, and H. Aronowitz, “Speech

You might also like