Professional Documents
Culture Documents
Recognition (SER)
By Arash Mari Oriyad
As Internship Project
● Problem Definition
● Data Gathering
● Feature Selection Methods
● Model Designing
● Results And Conclusion
● Future Work
● References
2
Overview
● Problem Definition
● Data Gathering
● Feature Selection Methods
● Model Designing
● Results And Conclusion
● Future Work
● References
3
What Is Speech Emotion Recognition?
● SER is the process of getting an acoustic/audio file as an input and
recognizing the class of dominating emotion.
● Input: An Audio File
● Model: Classifier
● Output: Recognized Emotion Class
4
Typical SER System
5
Emotion Classes/Categories
● Inter-Related field with psychology and medicine
● Classic Paradigm: more than 50 distinct emotion classes
● Palette Theory: emotion can be retrieved as a mixture of 5 or 6 basic
emotion classes
6
Basic Emotion Classes
Happiness
Surprise Anger
Fear Sadness
Neutral
7
Other Emotion Categories
8
SER Applications
9
SER Referenced Problems
● Multi-lingual SER
● SER as a culture-based task
● Data gathering in laboratory environment (artifact emotions)
● Imbalance SER datasets (emotion classes, gender, …)
● ...
10
SER Subtasks
● Stress Classification
● Text Processing
● Speaker Recognition
11
Growth of IEEE SER Published Papers
from 2000 until 2017
12
Overview
● Problem Definition
● Data Gathering
● Feature Selection Methods
● Model Designing
● Results And Conclusion
● Future Work
● References
13
Appropriate SER Dataset Characteristics
14
Popular SER Dataset
RAVDESS
● English
● 24 professional actors (12 male, 12 female)
● 7 emotion classes
● Sampling at 48kHz and 96kHz
● 4-6ms
● 60 audio instance per actor * 24 actors = 1440 utterances
15
Popular SER Dataset Cont.
EMO-DB
● German
● 10 professional actors (6 male, 4 female)
● 6 emotion classes
● Sampling at 96kHz
● 500 utterances
16
Popular SER Dataset Cont.
SUSAS
● English
● 32 semi-professional actors (19 male, 13 female)
● More than 10 emotion classes
● 16000 utterances
17
Popular SER Dataset Cont.
18
Popular SER Dataset Cont.
ShEMO
● Persian
● 87 native-persian speakers (from radio)
● 5 emotion classes
● 3.5h
● 3000 semi-natural utterances
19
Appropriate Recording Devices
20
Appropriate Recording Devices Cont.
21
Appropriate Recording Devices Cont.
22
Overview
● Problem Definition
● Data Gathering
24
Categories of Speech Features
25
Feature Extraction Phases
01 02 03
26
Pre-Emphasis Phase
● Using high-pass filters
● Emphasising on high-frequency baud by increasing its amplitude
and decreasing the amplitude of lower frequency
● Higher frequency holds more important information to extract while
lower frequency might be mingled with noise.
27
Windowing Phase
● Decomposing speech signals into short speech sequences called
frames
● Each Frame is an overlapped window with a length from 20 ms to
40 ms
● Windowing Methods
○ Triangle Windowing
○ Rectangular Windowing
○ Hamming Windowing
28
Feature Extraction Phase
● Linear Predictive Coding Coefficients (LPCCs)
● Mel-Frequency Cepstral Coefficients (MFCCs)
● TEager Energy Operator (TEOs)
29
LPCC Method
● A digital method for encoding an analog signal
● Predicting the next value of a signal based on the information it has
received in the past, forming a linear pattern
● First we sampling the analog speech signal to n digital points
30
LPCC Method Cont.
Consider a frame called x:
31
LPCC Method Cont.
So we can define an optimization problem on E:
min
32
MFCC Method
● One of the most popular audio feature
● Representation of the speech signals where a feature called the
cepstrum of a windowed short-time signal is derived from the FFT
of that signal
● Works well for n-way classification tasks
33
MFCC Steps
● Pre-Emphasis
● Frame Blocking & Windowing
● FFT Magnitude
● Mel Filterbank
● Log Energy
● Discrete Cosine Transform (DCT)
34
Mel Scale
● Humans are much better at discerning small changes in pitch at low
frequencies than high frequencies
● Mel Scale makes the features match more closely what human hears
35
TEO Method
● Proposed by Herbert M. Teager and Shushan M in 1983
● Unlike LPCC, represent a non-linear predictive method
● More robust in noisy environment in comparison with MFCC and
LPCC
● Detecte the stress level of emotion
36
Feature Extraction Methods Comparison
37
Overview
● Problem Definition
● Data Gathering
● Feature Selection Methods
● Model Designing
● Results And Conclusion
● Future Work
● References
38
Popular SER Classification Models
● Hidden Markov Model (HMM)
● Gaussian Mixture Model (GMM)
● K-Nearest Neighbor (KNN)
● Deep Neural Network (DNN)
● Vector Quantization (VQ)
39
Hidden Markov Model
● A probabilistic learning approach consist of the first order markov
chain whose states are hidden from the observer
● A kind of graphical models
● Key Concepts:
○ Markov Chain
○ Markov Process
○ First Order Markov Decision Process
40
Hidden Markov Model Cont
● Text Independent
● Significant increase in computational complexity
● The need of a proper initialization for the model parameters before
training
41
Gaussian Mixture Model
● A robust probabilistic framework
● Form multivariate Gaussian density models that represents all the
frames
● Text independent
● Computationally efficient in comparison with HMM
● Easy to be implemented
● Comparatively takes time to train
42
Vector Quantization
● Unsupervised clustering
● Simple to implement
● Low computational burden
● Comparatively takes time to train
● Text-dependent
43
Deep Neural Network
Because of multiple layers, feature extraction process can be skipped and
using raw data (with less preprocessing in comparison with other other
methods) as input.
44
Deep Neural Network Cont.
45
Overview
● Problem Definition
● Data Gathering
● Feature Selection Methods
● Model Designing
47
My Results
● On RAVDESS Dataset
55.3% 61.2% 53.2%
48
Overview
● Problem Definition
● Data Gathering
● Feature Selection Methods
● Model Designing
● Results And Conclusion
● Future Works
● References
49
Future Work On Feature Extraction
● Using image data instead of audio data:
○ Plotting time-frequency diagram for each frame using FFT
○ Applying a kind of heat-map to the diagram
○ Using image processing for feature extraction
● Using game theory approaches for phoneme based SER
● Using TEO method for feature selection
50
Future Work On Classification Models
● Convolutional neural networks for classification
● Long short-term memory neural networks
● Recurrent neural network
51
Overview
● Problem Definition
● Data Gathering
● Feature Selection Methods
● Model Designing
● Results And Conclusion
● Future Works
● References
52
References
● A Review on Emotion Recognition Algorithms Using Speech Analysis,
Teddy Surya Gunawan, 2018
● A Review on Emotion Recognition using Speech, Saikat Basu, 2017
● Databases, features and classifiers for speech emotion recognition: a review,
Monorama Swain, 2018
● An Approach to Extract Feature using MFCC, Parwinder Pal Singh, 2014
53
Contact
● Email: arashmarioriyad@gmail.com
● Github: https://github.com/Arash-Mari-Oriyad
● LinkedIn: https://www.linkedin.com/in/arash-mari-oriyad/
54
Thanks
Any Question?