Speech Emotion Recognition

Speech Emotion
Recognition (SER)
By Arash Mari Oriyad
As Internship Project
At Institute of AI, Isfahan University of Technology, Summer 2019

Overview
● Problem Definition
● Data Gathering
● Feature Selection Methods
● Model Designing
● Results And Conclusion
● Future Work
● References
2
Overview
● Data Gathering
● Model Designing
● Future Work
● References
3
What Is Speech Emotion Recognition?
● SER is the process of getting an acoustic/audio file as an input and
recognizing the class of dominating emotion.
● Input: An Audio File
● Model: Classifier
● Output: Recognized Emotion Class
4
Typical SER System
5
Emotion Classes/Categories
● Inter-Related field with psychology and medicine
● Classic Paradigm: more than 50 distinct emotion classes
● Palette Theory: emotion can be retrieved as a mixture of 5 or 6 basic
emotion classes
6
Basic Emotion Classes
Happiness
Surprise Anger
Fear Sadness
Neutral
7
Other Emotion Categories
● High active vs low active (Energy)

● High pitch vs low pitch (Frequency)
● ...
8
SER Applications
● Human-Machine Interaction (HMI)

● Voice Assistants
● Medical Applications
● Self-Driving Vehicles
● E-Learning
● ...
9
SER Referenced Problems
● Multi-lingual SER
● SER as a culture-based task
● Data gathering in laboratory environment (artifact emotions)
● Imbalance SER datasets (emotion classes, gender, …)
● ...
10
SER Subtasks
● Stress Classification
● Text Processing
● Speaker Recognition
11
Growth of IEEE SER Published Papers
from 2000 until 2017
12
Overview
● Data Gathering
● Model Designing
● Future Work
● References
13
Appropriate SER Dataset Characteristics
● Being balanced in emotion classes and genders

● Representing natural emotions
● Culture independency
● No ambiguity in text
● Low noise (avoiding noisy environment)
14
Popular SER Dataset
RAVDESS
● English
● 24 professional actors (12 male, 12 female)
● 7 emotion classes
● Sampling at 48kHz and 96kHz
● 4-6ms
● 60 audio instance per actor * 24 actors = 1440 utterances
15
Popular SER Dataset Cont.
EMO-DB
● German
● Sampling at 96kHz
● 500 utterances
16
SUSAS
● English
● 32 semi-professional actors (19 male, 13 female)
● More than 10 emotion classes
● 16000 utterances
17
Database produced by Cowie and Douglas-Cowie (1996)

● English
● 5-10s
● 5000 utterances
18
ShEMO
● Persian
● 87 native-persian speakers (from radio)
● 3.5h
● 3000 semi-natural utterances
19
Appropriate Recording Devices
20
Appropriate Recording Devices Cont.
21
Appropriate Recording Devices Cont.
22
Overview
● Data Gathering

● Model Designing
● Future Work
● References
23
Audio Signal Preprocessing
● Denoising methods:
○ Wavelet Transform
○ NN Approaches
● Removing silent part of samples
24
Categories of Speech Features
25
Feature Extraction Phases
Pre-Emphasis Covariate Extraction
01 02 03
Fram Blocking &

Windowing
26
Pre-Emphasis Phase
● Using high-pass filters
● Emphasising on high-frequency baud by increasing its amplitude
and decreasing the amplitude of lower frequency
● Higher frequency holds more important information to extract while
lower frequency might be mingled with noise.
27
Windowing Phase
● Decomposing speech signals into short speech sequences called
frames
● Each Frame is an overlapped window with a length from 20 ms to
40 ms
● Windowing Methods
○ Triangle Windowing
○ Rectangular Windowing
○ Hamming Windowing
28
Feature Extraction Phase
● Linear Predictive Coding Coefficients (LPCCs)
● Mel-Frequency Cepstral Coefficients (MFCCs)
● TEager Energy Operator (TEOs)
29
LPCC Method
● A digital method for encoding an analog signal
● Predicting the next value of a signal based on the information it has
received in the past, forming a linear pattern
● First we sampling the analog speech signal to n digital points
30
LPCC Method Cont.
Consider a frame called x:
● x[n]: value of frame x at point n

● p: order of LPCC
● e[n]: prediction error at point n
31
LPCC Method Cont.
So we can define an optimization problem on E:
min
An LP-Solver or a classical ML approach can give us appropriate

coefficients.
32
MFCC Method
● One of the most popular audio feature
● Representation of the speech signals where a feature called the
cepstrum of a windowed short-time signal is derived from the FFT
of that signal
● Works well for n-way classification tasks
33
MFCC Steps
● Pre-Emphasis
● Frame Blocking & Windowing
● FFT Magnitude
● Mel Filterbank
● Log Energy
● Discrete Cosine Transform (DCT)
34
Mel Scale
● Humans are much better at discerning small changes in pitch at low
frequencies than high frequencies
● Mel Scale makes the features match more closely what human hears
35
TEO Method
● Proposed by Herbert M. Teager and Shushan M in 1983
● Unlike LPCC, represent a non-linear predictive method
● More robust in noisy environment in comparison with MFCC and
LPCC
● Detecte the stress level of emotion
36
Feature Extraction Methods Comparison
37
Overview
● Data Gathering
● Model Designing
● Future Work
● References
38
Popular SER Classification Models
● Hidden Markov Model (HMM)
● Gaussian Mixture Model (GMM)
● K-Nearest Neighbor (KNN)
● Deep Neural Network (DNN)
● Vector Quantization (VQ)
39
Hidden Markov Model
● A probabilistic learning approach consist of the first order markov
chain whose states are hidden from the observer
● A kind of graphical models
● Key Concepts:
○ Markov Chain
○ Markov Process
○ First Order Markov Decision Process
40
Hidden Markov Model Cont
● Text Independent
● Significant increase in computational complexity
● The need of a proper initialization for the model parameters before
training
41
Gaussian Mixture Model
● A robust probabilistic framework
● Form multivariate Gaussian density models that represents all the
frames
● Text independent
● Computationally efficient in comparison with HMM
● Easy to be implemented
● Comparatively takes time to train
42
Vector Quantization
● Unsupervised clustering
● Simple to implement
● Low computational burden
● Comparatively takes time to train
● Text-dependent
43
Deep Neural Network
Because of multiple layers, feature extraction process can be skipped and
using raw data (with less preprocessing in comparison with other other
methods) as input.
Beside DNNs, Convolutional Neural Networks (CNNs) are other type on

NNs which used frequently in the SER literature.
44
Deep Neural Network Cont.
45
Overview
● Data Gathering
● Model Designing

● Future Work
● References
46
Top Results (Accuracy)
LPCC MFCC TEO
DNN 72.4 % 78.4 % 77.2 %
HMM 74.2 % 82.3 % 75.3 %
GMM 74.5 % 79.6 % 77.3 %
47
My Results
● On RAVDESS Dataset
55.3% 61.2% 53.2%
DNN-LPCC DNN-MFCC HMM-LPCC
48
Overview
● Data Gathering
● Model Designing
● Future Works
● References
49
Future Work On Feature Extraction
● Using image data instead of audio data:
○ Plotting time-frequency diagram for each frame using FFT
○ Applying a kind of heat-map to the diagram
○ Using image processing for feature extraction
● Using game theory approaches for phoneme based SER
● Using TEO method for feature selection
50
Future Work On Classification Models
● Convolutional neural networks for classification
● Long short-term memory neural networks
● Recurrent neural network
51
Overview
● Data Gathering
● Model Designing
● Future Works
● References
52
References
● A Review on Emotion Recognition Algorithms Using Speech Analysis,
Teddy Surya Gunawan, 2018
● A Review on Emotion Recognition using Speech, Saikat Basu, 2017
● Databases, features and classifiers for speech emotion recognition: a review,
Monorama Swain, 2018
● An Approach to Extract Feature using MFCC, Parwinder Pal Singh, 2014
53
Contact
● Email: arashmarioriyad@gmail.com
● Github: https://github.com/Arash-Mari-Oriyad
● LinkedIn: https://www.linkedin.com/in/arash-mari-oriyad/
54
Thanks
Any Question?

Speech Emotion Recognition

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Emotion Recognition

Uploaded by

Copyright:

Available Formats

Speech Emotion

At Institute of AI, Isfahan University of Technology, Summer 2019

● High active vs low active (Energy)

● Human-Machine Interaction (HMI)

● Being balanced in emotion classes and genders

Database produced by Cowie and Douglas-Cowie (1996)

● Feature Selection Methods

Pre-Emphasis Covariate Extraction

Fram Blocking &

● x[n]: value of frame x at point n

An LP-Solver or a classical ML approach can give us appropriate

Beside DNNs, Convolutional Neural Networks (CNNs) are other type on

● Results And Conclusion

LPCC MFCC TEO

DNN 72.4 % 78.4 % 77.2 %

HMM 74.2 % 82.3 % 75.3 %

GMM 74.5 % 79.6 % 77.3 %

DNN-LPCC DNN-MFCC HMM-LPCC

You might also like