Chethana H N REPORT

SPEECH EMOTION RECOGNITION
Progress Report submitted by
Chethana H N
2019PEV5184
MTech, VLSI
Under the Supervision of
Dr. Tarun Varma

Associate Professor
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

MALAVIYA NATIONAL INSTITUTE OF TECHNOLOGY
1
SPEECH EMOTION RECOGNITION
INTRODUCTION:
Speech is the most natural and significant mode of communication among human beings and also a
potential method for human-computer interaction (HCI). It not only carries the explicit linguistic
contents but also contain the implicit paralinguistic information about the speakers. The key point of
an effective communication is to make robots or virtual agents understand speakers’ true intentions.
Emotions play an important role in human communications. Emotional displays convey considerable
information about the mental state of an individual. It helps us to match and understand the feelings
of others by conveying and giving feedback. Research has revealed that emotion plays powerful role
in shaping human social interaction. The vocal emotion information as a kind of nonlinguistic
information can significantly help robots or virtual agents to understand speakers' true intentions.
Successfully detecting the emotion states is helpful to improve the efficiency of human-computer
interaction. This has opened up a new research field called speech emotion recognition, having basic
goals to understand and retrieve desired emotions.
Speech Emotion Recognition (SER) is the task of recognizing the emotional aspects of speech
irrespective of the semantic contents. It is the natural and fastest way of exchanging and
communication between humans and computers and plays an important role in real-time applications
of human-machine interaction. Successfully detecting the emotion states is helpful to improve the
efficiency of human-computer interaction. For instance, in call centers, tracking customers’ emotion
states can be useful for quality measurement and the calls from angry customers can therefore be
assigned to experienced agents.
During the last two decades, enormous efforts have been devoted to developing methods for
automatically identifying human emotions from speech signals, which is called speech emotion
recognition. At present, speech emotion recognition has become an attractive research topic in signal
processing, pattern recognition, artificial intelligence, and so on, due to its importance in human-
machine interactions. The speech signals generated using sensors for SER is an active area of research
in the digital signal processing used to recognize the qualitative emotional state of speakers using
speech signals, which has more information than spoken words. Nevertheless, effective SER is still
a very challenging problem, partly due to the cultural differences, various expression types, context,
ambient noise, etc.
Many researchers are working in this domain to make a machine intelligent enough that can
understand the state from an individual’s speech to analyze or identify the emotional condition of the
speaker. In SER, the salient and discriminative features selection and extraction is a challenging task
[1]. Recently researchers are trying to finding the robust and salient features for SER using artificial
intelligence and deep learning approaches [2] to extract hidden information, CNN features to train
different CNN models [3,4] to increase the performance and decrease the computational complexity
of SER for human behavior assessment.
2
RELATED WORK:
1. Y. Zhang et al. [5], presented a novel attention based fully convolutional network for speech -
emotion recognition that it is able to handle variable-length speech, free of the demand of
segmentation to keep critical information not lost. This mechanism can make the model be aware
of which time- frequency region of speech spectrogram is more emotion-relevant. Considering
limited data, the transfer learning is also adapted to improve the accuracy. Validated on the
publicly available IEMOCAP corpus, the model produced a weighted accuracy of 70.4% and an
unweighted accuracy of 63.9% respectively.
2. In [6], Mousmita Sarma et al. investigated a number of Deep Neural Network (DNN)
architectures for emotion identification with the IEMOCAP database. By comparing different
feature extraction frontends: high-dimensional MFCC input versus frequency-domain and time-
domain approaches to learn filters as part of the network. Best results are obtained with the time-
domain filter-learning approach. It has a separate label per frame and TDNN-LSTM architecture
with time-restricted self-attention, achieving a weighted accuracy of 70.6%.
3. In [7], Mustaqeem and Soonil Kwon proposed an artificial intelligence-assisted deep stride
convolutional neural network (DSCNN) architecture using the plain nets strategy to learn salient
and discriminative features from spectrogram of speech signals that are enhanced in the prior
steps to perform better. Local hidden patterns are learned in the convolutional layers with special
strides to down-sample the feature maps rather than pooling layer and global discriminative
features are learned in fully connected layers. A SoftMax classifier is used for the classification
of emotions in speech. The proposed technique is evaluated on IEMOCAP and RAVDESS
datasets with accuracy of 7.85% and 4.5%, respectively.
4. In [8], S. Zhang et al. utilized a DCNN to bridge the affective gap in speech signals. Three channels
of log Mel-spectrograms (static, delta and delta-delta) extracted as the DCNN input is then
modeled on AlexNet DCNN model to learn high-level feature representations on each segment
divided from an utterance. The learned segment-level features are aggregated by a Discriminant
Temporal Pyramid Matching (DTPM) strategy combining temporal pyramid matching and
optimal Lp-norm pooling to form a global utterance-level feature representation, followed by the
linear Support Vector Machines (SVM) for emotion classification. Experimental results on four
public datasets, i.e., EMO-DB, RML, eNTERFACE05 and BAUM-1s have accuracy of 87.31%,
75.34%, 79.25%, and 44.61%, respectively.
5. In [9], Z. Zhao et al. developed a model leveraging a parallel combination of attention-based

bidirectional long short-term memory recurrent neural networks with attention-based fully
convolutional networks (FCN). The experiments were undertaken on the interactive emotional
dyadic motion capture (IEMOCAP) and FAU aibo emotion corpus (FAU-AEC) datasets to
highlight the effectiveness of our approach. The experimental results achievied a WA of 68.1%
and a UA of 67.0% on IEMOCAP, and 45.4% for UA on FAU-AEC dataset.
6. In [10], Z. Peng et al. fully utilized the auditory and attention mechanism by first investigating
temporal modulation cues from auditory front-ends and then propose a joint deep learning model
3
that combines 3D convolutions and attention-based sliding recurrent neural networks (ASRNNs)
for emotion recognition. The experiments on IEMOCAP and MSP-IMPROV datasets produced
an accuracy of 62.6% for IEMOCAP and 55.7% for MSP-IMPROV.
7. In [11], W. Zheng et al. proposed speech emotion recognition method based on least square
regression (LSR) model, in which an incomplete sparse LSR (ISLSR) model is proposed and
utilized to characterize the linear relationship between speech features and the corresponding
emotion labels. Both labeled and unlabeled speech data sets are used in in training the ISLSR
model, where the use of unlabeled data set aims to enhance the compatibility of the model such
that it is well suitable for the out-of-sample speech data. ISLSR method achieves the average
recall and the average precision of 60.50% and 60.25% respectively.
8. In [12], Suman Deb et al. explores the analysis and classification of speech under stress using a
new feature, harmonic peak to energy ratio (HPER). The HPER feature is computed from the
Fourier spectra of speech signal. The harmonic amplitudes are closely related to breathiness
levels of speech which is different for different stress conditions. Support Vector Machine (SVM)
classifier with binary cascade strategy is used to evaluate the performance of the HPER feature
using simulated stressed speech database (SSD. The performance of the HPER feature is
compared with the mel frequency cepstral coefficients (MFCC), the Linear prediction
coefficients (LPC) and the Teager-Energy-Operator (TEO) based Critical Band TEO
Autocorrelation Envelope (TEO-CB-Auto-Env) features. When the model is trained with session
1 data and tested with session 2 data, the average accuracy obtained is 77.92% and 79.19% vice-
versa. Overall, the proposed HPER feature improves the speech under stress classification by
3.93% with respect to the MFCC feature. The classification performance is further enhanced by
4.02% with the combination of the HPER and the MFCC features.
9. In [13], Kunxia Wang et al. proposed a Fourier parameter model using the perceptual content of
voice quality and the first- and second-order differences for speaker-independent speech emotion
recognition. They improve the recognition rates over the methods using Mel frequency cepstral
coefficient (MFCC) features by 16.2, 6.8 and 16.6 points on the German database (EMODB),
Chinese language database (CASIA) and Chinese elderly emotion database (EESDB). In
particular, when combining FP with MFCC, the recognition rates can be further improved on the
aforementioned databases by 17.5, 10 and 10.5 points, respectively.
10. In [14], Suman Deb et al. proposed region switching based classification method for speech
emotion classification using vowel-like regions (VLRs) and non-vowel-like regions (non-VLRs).
This work presents an analysis of emotion information contained independently in segmented
VLRs and non-VLRs. The proposed region switching based method is implemented by choosing
the features of either VLRs or non-VLRs for each emotion. The VLRs are detected by identifying
hypothesized VLR onset and end points. Segmentation of non-VLRs is done by using the
knowledge of VLRs and active speech regions. The performance is evaluated using EMODB,
IEMOCAP and FAU AIBO databases and achieved 85.1% average recognition rate using 39
dimensional MFCC feature. The proposed method shows an average recognition rate of 64.2%
using 39 features with six emotions.
4
11. In [15], Lili Guo1 et al. proposes a dynamic fusion framework to utilize the potential advantages
of the complementary spectrogram-based statistical features and the auditory-based empirical
features. In addition, a kernel extreme learning machine (KELM) is adopted as the classifier to
distinguish emotions. To validate the proposed framework, experiments are conducted on two
public emotional databases, including Emo-DB and IEMOCAP databases. The results also show
that the proposed method, by integrating the auditory-based features with spectrogram-based
features, could achieve a notably improved performance over the conventional methods.
12. Yongjin Wang et al. [16], explored a systematic approach for recognition of human emotional
state from audiovisual signals. The audio characteristics of emotional speech are represented by
the extracted prosodic, Mel-frequency Cepstral Coefficient (MFCC), and formant frequency
features. A face detection scheme based on HSV color model is used to detect the face from the
background. The visual information is represented by Gabor wavelet features. Feature selection
is done by using a stepwise method based on Mahalanobis distance. The selected audiovisual
features are used to classify the data into their corresponding emotions. Based on a comparative
study of different classification algorithms and specific characteristics of individual emotion, a
novel multi-classifier scheme is proposed to boost the recognition performance. The feasibility
of the proposed system is tested over a database that incorporates human subjects from different
languages and cultural backgrounds. The multi-classifier scheme achieves the best overall
recognition rate of 82.14%.
13. Suman Deb et al. [17], presents a Fourier model for analysis of out-of-breath speech using mutual
information (MI) on the difference and ratio values of the Fourier parameters, amplitude and
frequency. The difference and ratio are calculated between two contiguous values of the Fourier
parameters. To analyze the out-of-breath speech, a new stressed speech database, named out-of-
breath speech (OBS) database, is created. The database contains three classes of speech, out-of-
breath speech, low out-of-breath speech and normal speech. The effectiveness of the proposed
features is evaluated with the statistical analysis. The proposed features not only differentiate the
normal speech and the out-of-breath speech, but also can discriminate different breath emission
levels of speech. Hidden Markov model (HMM) and support vector machine (SVM) are used to
evaluate the performance of the proposed features using the OBS database. For multi-class
classification problem, SVM classifier is used with binary cascade approach. The performance
of the proposed features is compared with the breathiness feature, the mel frequency cepstral
coefficient (MFCC) feature and the Tea- ger energy operator (TEO) based critical band TEO
autocorrelation envelope (TEO-CB-Auto-Env) feature. The combined features show an average
recognition rate of 82.83%.
5
IMPLEMENTATION:
1. Speech Emotion Recognition using Librosa
Librosa: Librosa is a python package for music and audio analysis. It provides the building blocks
necessary to create music information retrieval systems.
Jupyter Notebook: The Jupyter Notebook is an open-source web application that allows you to
create and share documents that contain live code, equations, visualizations and narrative text. Uses
include: data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and much more.
Objective: To build a model to recognize emotion from speech using the librosa and sklearn libraries
and the RAVDESS dataset.
About Implementation: In this implementation we will use the libraries librosa, soundfile, and
sklearn (among others) to build a model using an MLPClassifier. This will be able to recognize
emotion from sound files. We will load the data, extract features from it, then split the dataset into
training and testing sets. Then, we’ll initialize an MLPClassifier and train the model. Finally, we’ll
calculate the accuracy of our model.
The Dataset: we will use the RAVDESS dataset; this is the Ryerson Audio-Visual Database of
Emotional Speech and Song dataset.
Code:
import librosa
import soundfile
import os, glob, pickle
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
# DataFlair - Extract features (mfcc, chroma, mel) from a sound file

def extract_feature(file_name, mfcc, chroma, mel):
with soundfile.SoundFile(file_name) as sound_file:
X = sound_file.read(dtype="float32")
sample_rate = sound_file.samplerate
if chroma:
stft = np.abs(librosa.stft(X))
result = np.array([])
if mfcc:
mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate,
n_mfcc=40).T, axis=0)
result = np.hstack((result, mfccs))
if chroma:
chroma = np.mean(librosa.feature.chroma_stft(S=stft,
sr=sample_rate).T, axis=0)
result = np.hstack((result, chroma))
if mel:
mel = np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,
6
axis=0)
result = np.hstack((result, mel))
return result
# DataFlair - Emotions in the RAVDESS dataset

emotions = {
'W': 'anger',
'A': 'anxiety',
'L': 'boredom',
'E': 'disgust',
'F': 'happy',
'N': 'neutral',
'T': 'sad'
}
# DataFlair - Emotions to observe
# observed_emotions = ['anger', 'disgust', 'happy', 'sad']
# DataFlair - Load the data and extract features for each sound file
def load_data(test_size=0.2):
x, y = [], []
for file in glob.glob("C:\\all emotions Berlin\\emotion_*\\*.wav"):
file_name = os.path.basename(file)
emotion = emotions[file_name.split("-")[1]]
#if emotion not in observed_emotions:
#continue
feature = extract_feature(file, mfcc=True, chroma=True, mel=True)
x.append(feature)
y.append(emotion)
return train_test_split(np.array(x), y, test_size=test_size, random_state=9)
# DataFlair - Split the dataset

x_train, x_test, y_train, y_test = load_data(test_size=0.25)
# DataFlair - Get the shape of the training and testing datasets

print((x_train.shape[0], x_test.shape[0]))
# DataFlair - Get the number of features extracted

print(f'Features extracted: {x_train.shape[1]}')
# DataFlair - Initialize the Multi Layer Perceptron Classifier

model = MLPClassifier(alpha=0.01, batch_size=256, epsilon=1e-08,
hidden_layer_sizes=(300,), learning_rate='adaptive',
max_iter=500)
# DataFlair - Train the model

model.fit(x_train, y_train)
# DataFlair - Predict for the test set

print(x_test)
y_pred = model.predict(x_test)
7
# DataFlair - Calculate the accuracy of our model
accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)
# DataFlair - Print the accuracy
print("Accuracy: {:.2f}%".format(accuracy * 100))
Output:
(401, 134)
Features extracted: 180
[[-1.96837631e+02 3.91609154e+01 -3.18292675e+01 ... 1.18120888e-03 7.52544263e-04
4.97290748e-04]
[-1.92062027e+02 1.13047318e+02 -7.00292170e-01 ... 1.92426227e-03 1.28824555e-03
6.82404614e-04]
[-1.38566086e+02 4.94683342e+01 -1.15768385e+01 ... 3.86706144e-02 1.67271513e-02
1.16589162e-02]
...
[-1.53069397e+02 6.04137650e+01 2.30412884e+01 ... 1.46246880e-01 7.27017894e-02
3.93074863e-02]
[-2.01625946e+02 5.50481339e+01 5.33991623e+00 ... 1.16678432e-01 1.34125724e-01
7.25167170e-02]
[-1.68781616e+02 9.58256607e+01 8.23898411e+00 ... 3.59166856e-03 1.53693173e-03
8.90273659e-04]]
Accuracy: 73.13%
Conclusion: We learned to recognize emotions from speech. We used an MLPClassifier for this and
made use of the soundfile library to read the sound file, and the librosa library to extract features from
it. As you’ll see, the model delivered an accuracy of 73.13%. That’s good enough for us yet.
2. Implementation of Training Pipeline (Model, Loss and Optimizer) using

Pytorch
Model: A model represents what was learned by a machine learning algorithm. The model is the
“thing” that is saved after running a machine learning algorithm on training data and represents the
rules, numbers, and any other algorithm-specific data structures required to make predictions.
Loss: Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad the model's
prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise,
the loss is greater.
Optimizer: Optimizers are algorithms or methods used to change the attributes of your neural
network such as weights and learning rate in order to reduce the losses. Optimizers help to get results
faster.
8
Code:
#design model (input, output size, forward pass)
#construct loss & optimizer
#training loop
# -forward pass: compute prediction
# -backward pass: gradienys
# -update our weights
import torch
import torch.nn as nn
#f = w*x
#f = 2*x
x = torch.tensor([[1],[2],[3],[4]],dtype=torch.float32)
y = torch.tensor([[2],[4],[6],[8]],dtype=torch.float32)
x_test = torch.tensor([5],dtype=torch.float32)
n_samples, n_features =x.shape
print(n_samples, n_features)
input_size = n_features
output_size = n_features
#model= nn.Linear(input_size,output_size)
class LinearRegression(nn.Module):
def __init__(self,input_dim,output_dim):
super(LinearRegression, self).__init__()
#define layer
self.lin = nn.Linear(input_dim,output_dim)
def forward(self,x):
return self.lin(x)
model = LinearRegression(input_size,output_size)
print(f'prediction before training: f(5) = {model(x_test).item():.3f}')
#training
learning_rate = 0.01
n_iters = 100
loss = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr = learning_rate)
for epoch in range(n_iters):
#prediction = forward pass
y_pred = model(x)
#loss
l = loss(y,y_pred)
#gradients
l.backward()#dl/dw
9
#update weights
optimizer.step()
#zero gradients
optimizer.zero_grad()
if epoch%10 == 0:
[w,b]=model.parameters()
print(f'epoch {epoch + 1 }: w ={ w[0][0].item()}, loss = {l:.8f}')
print(f'prediction after training : f(5) = {model(x_test).item():.3f}')
Output:
epoch 1: w =-0.07654359936714172, loss = 42.73521423
epoch 11: w =1.427065134048462, loss = 1.22151613
epoch 21: w =1.6759765148162842, loss = 0.14071432
epoch 31: w =1.7228628396987915, loss = 0.10640065
epoch 41: w =1.737051248550415, loss = 0.09953168
epoch 51: w =1.7457839250564575, loss = 0.09372092
epoch 61: w =1.753448486328125, loss = 0.08826540
epoch 71: w =1.760756254196167, loss = 0.08312786
epoch 81: w =1.7678272724151611, loss = 0.07828946
epoch 91: w =1.7746859788894653, loss = 0.07373256
prediction after training: f(5) = 9.548
Conclusion: We have done some more implementation using Pytorch Like Gradient Descent,
Activation Functions (ReLU, Leaky ReLU,Sigmoid, Tanh, Softmax) and many more. We have
learned the basics of deep learning which we are going to use in our SER project. We have understood
the Linear Regression, Logistic Regression, Gradient Descent, Activation Functions.
10
REFERENCES:
1. Wei, B.; Hu, W.; Yang, M.; Chou, C.T. From real to complex: Enhancing radio-based activity
recognition using complex-valued CSI. ACM Trans. Sens. Netw. (TOSN) 2019, 15, 35.
2. Zhao,W.; Ye, J.; Yang, M.; Lei, Z.; Zhang, S.; Zhao, Z. Investigating capsule networks with
dynamic routing for text classification. arXiv 2018, arXiv:1804.00538.
3. Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic routing between capsules. In Proceedings of the
Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December
2017; pp. 3856–3866.
4. Bae, J.; Kim, D.-S. End-to-End Speech Command Recognition with Capsule Network. In
Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 776–780.
5. Y. Zhang, J. Du, Z. Wang, J. Zhang and Y. Tu, "Attention Based Fully Convolutional Network
for Speech Emotion Recognition," 2018 Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA, 2018, pp.
1771-1775, doi: 10.23919/APSIPA.2018.8659587.
6. Sarma, Mousmita & Ghahremani, Pegah & Povey, Daniel & Goel, Nagendra & Sarma, Kandarpa
& Dehak, Najim. (2018). Emotion Identification from Raw Speech Signals Using DNNs. 3097-
3101. 10.21437/Interspeech.2018-1353.
7. Mustaqeem & Kwon, Soonil. (2019). A CNN-Assisted Enhanced Audio Signal Processing for
Speech Emotion Recognition. Sensors. 20. 183. 10.3390/s20010183.
8. S. Zhang, S. Zhang, T. Huang and W. Gao, "Speech Emotion Recognition Using Deep
Convolutional Neural Network and Discriminant Temporal Pyramid Matching," in IEEE
Transactions on Multimedia, vol. 20, no. 6, pp. 1576-1590, June 2018, doi:
10.1109/TMM.2017.2766843.
9. Z. Zhao et al., "Exploring Deep Spectrum Representations via Attention-Based Recurrent and
Convolutional Neural Networks for Speech Emotion Recognition," in IEEE Access, vol. 7, pp.
97515-97525, 2019, doi: 10.1109/ACCESS.2019.2928625.
10. Z. Peng, X. Li, Z. Zhu, M. Unoki, J. Dang and M. Akagi, "Speech Emotion Recognition Using
3D Convolutions and Attention-Based Sliding Recurrent Networks with Auditory Front-Ends,"
in IEEE Access, vol. 8, pp. 16560-16572, 2020, doi: 10.1109/ACCESS.2020.2967791.
11. W. Zheng, M. Xin, X. Wang and B. Wang, "A Novel Speech Emotion Recognition Method via
Incomplete Sparse Least Square Regression," in IEEE Signal Processing Letters, vol. 21, no. 5,
pp. 569-572, May 2014, doi: 10.1109/LSP.2014.2308954.
11
12. Deb, Suman & Dandapat, S. (2016). Classification of speech under stress using harmonic peak
to energy ratio. Computers & Electrical Engineering. 55. 12-23.
10.1016/j.compeleceng.2016.09.027.
13. K. Wang, N. An, B. N. Li, Y. Zhang and L. Li, "Speech Emotion Recognition Using Fourier
Parameters," in IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 69-75, 1 Jan.-
March 2015, doi: 10.1109/TAFFC.2015.2392101.
14. S. Deb and S. Dandapat, "Emotion Classification Using Segmentation of Vowel-Like and Non-
Vowel-Like Regions," in IEEE Transactions on Affective Computing, vol. 10, no. 3, pp. 360-
373, 1 July-Sept. 2019, doi: 10.1109/TAFFC.2017.2730187.
15. L. Guo, L. Wang, J. Dang, Z. Liu and H. Guan, "Exploration of Complementary Features for
Speech Emotion Recognition Based on Kernel Extreme Learning Machine," in IEEE Access,
vol. 7, pp. 75798-75809, 2019, doi: 10.1109/ACCESS.2019.2921390.
16. Y. Wang and L. Guan, "Recognizing Human Emotional State From Audiovisual Signals,"
in IEEE Transactions on Multimedia, vol. 10, no. 4, pp. 659-668, June 2008, doi:
10.1109/TMM.2008.921734.
17. Deb, Suman & Dandapat, S. (2017). Fourier Model based Features for Analysis and
Classification of Out-of-Breath Speech. Speech Communication. 90.
10.1016/j.specom.2017.04.002.
18. M. E. Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features,
classification schemes, and databases,” Patt. Recognit., vol. 44, pp. 572–587, 2011.
19. D.-N. Jiang and L.-H. Cai, “Speech emotion recognition with the combination of statistical
features and temporal features,” in Proc. IEEE Int. Conf. Multimedia and Expo, 2004, pp. 1967–
1970.
20. Y. Li and Y. Zhao, “Recognizing emotions in speech using short-term and long-term features,”
in Proc. Fifth Int. Conf. Spoken Language Processing, 1998, pp. 2255–2258.
21. Bachu R.G., Kopparthi S., Adapa B., Barkana B.D.Separation of Voiced and Unvoiced using
Zero crossing rate and Energy of the Speech Signal.
12

Chethana H N REPORT

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chethana H N REPORT

Uploaded by

Copyright:

Available Formats

SPEECH EMOTION RECOGNITION

Progress Report submitted by

Under the Supervision of

Dr. Tarun Varma

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

5. In [9], Z. Zhao et al. developed a model leveraging a parallel combination of attention-based

# DataFlair - Extract features (mfcc, chroma, mel) from a sound file

# DataFlair - Emotions in the RAVDESS dataset

# DataFlair - Split the dataset

# DataFlair - Get the shape of the training and testing datasets

# DataFlair - Get the number of features extracted

# DataFlair - Initialize the Multi Layer Perceptron Classifier

# DataFlair - Train the model

# DataFlair - Predict for the test set

print("Accuracy: {:.2f}%".format(accuracy * 100))

2. Implementation of Training Pipeline (Model, Loss and Optimizer) using

You might also like