You are on page 1of 55

Speech Emotion

Recognition (SER)
By Arash Mari Oriyad

As Internship Project

At Institute of AI, Isfahan University of Technology, Summer 2019


Overview

● Problem Definition
● Data Gathering
● Feature Selection Methods
● Model Designing
● Results And Conclusion
● Future Work
● References
2
Overview

● Problem Definition
● Data Gathering
● Feature Selection Methods
● Model Designing
● Results And Conclusion
● Future Work
● References
3
What Is Speech Emotion Recognition?
● SER is the process of getting an acoustic/audio file as an input and
recognizing the class of dominating emotion.
● Input: An Audio File
● Model: Classifier
● Output: Recognized Emotion Class

4
Typical SER System

5
Emotion Classes/Categories
● Inter-Related field with psychology and medicine
● Classic Paradigm: more than 50 distinct emotion classes
● Palette Theory: emotion can be retrieved as a mixture of 5 or 6 basic
emotion classes

6
Basic Emotion Classes

Happiness

Surprise Anger

Fear Sadness

Neutral

7
Other Emotion Categories

● High active vs low active (Energy)


● High pitch vs low pitch (Frequency)
● ...

8
SER Applications

● Human-Machine Interaction (HMI)


● Voice Assistants
● Medical Applications
● Self-Driving Vehicles
● E-Learning
● ...

9
SER Referenced Problems

● Multi-lingual SER
● SER as a culture-based task
● Data gathering in laboratory environment (artifact emotions)
● Imbalance SER datasets (emotion classes, gender, …)
● ...

10
SER Subtasks

● Stress Classification
● Text Processing
● Speaker Recognition

11
Growth of IEEE SER Published Papers
from 2000 until 2017

12
Overview

● Problem Definition

● Data Gathering
● Feature Selection Methods
● Model Designing
● Results And Conclusion
● Future Work
● References
13
Appropriate SER Dataset Characteristics

● Being balanced in emotion classes and genders


● Representing natural emotions
● Culture independency
● No ambiguity in text
● Low noise (avoiding noisy environment)

14
Popular SER Dataset

RAVDESS
● English
● 24 professional actors (12 male, 12 female)
● 7 emotion classes
● Sampling at 48kHz and 96kHz
● 4-6ms
● 60 audio instance per actor * 24 actors = 1440 utterances

15
Popular SER Dataset Cont.

EMO-DB
● German
● 10 professional actors (6 male, 4 female)
● 6 emotion classes
● Sampling at 96kHz
● 500 utterances

16
Popular SER Dataset Cont.

SUSAS
● English
● 32 semi-professional actors (19 male, 13 female)
● More than 10 emotion classes
● 16000 utterances

17
Popular SER Dataset Cont.

Database produced by Cowie and Douglas-Cowie (1996)


● English
● 40 professional actors (20 male, 20 female)
● 5 emotion classes
● 5-10s
● 5000 utterances

18
Popular SER Dataset Cont.

ShEMO
● Persian
● 87 native-persian speakers (from radio)
● 5 emotion classes
● 3.5h
● 3000 semi-natural utterances

19
Appropriate Recording Devices

20
Appropriate Recording Devices Cont.

21
Appropriate Recording Devices Cont.

22
Overview

● Problem Definition
● Data Gathering

● Feature Selection Methods


● Model Designing
● Results And Conclusion
● Future Work
● References
23
Audio Signal Preprocessing
● Denoising methods:
○ Wavelet Transform
○ NN Approaches
● Removing silent part of samples

24
Categories of Speech Features

25
Feature Extraction Phases

Pre-Emphasis Covariate Extraction

01 02 03

Fram Blocking &


Windowing

26
Pre-Emphasis Phase
● Using high-pass filters
● Emphasising on high-frequency baud by increasing its amplitude
and decreasing the amplitude of lower frequency
● Higher frequency holds more important information to extract while
lower frequency might be mingled with noise.

27
Windowing Phase
● Decomposing speech signals into short speech sequences called
frames
● Each Frame is an overlapped window with a length from 20 ms to
40 ms
● Windowing Methods
○ Triangle Windowing
○ Rectangular Windowing
○ Hamming Windowing

28
Feature Extraction Phase
● Linear Predictive Coding Coefficients (LPCCs)
● Mel-Frequency Cepstral Coefficients (MFCCs)
● TEager Energy Operator (TEOs)

29
LPCC Method
● A digital method for encoding an analog signal
● Predicting the next value of a signal based on the information it has
received in the past, forming a linear pattern
● First we sampling the analog speech signal to n digital points

30
LPCC Method Cont.
Consider a frame called x:

● x[n]: value of frame x at point n


● p: order of LPCC
● e[n]: prediction error at point n

31
LPCC Method Cont.
So we can define an optimization problem on E:
min

An LP-Solver or a classical ML approach can give us appropriate


coefficients.

32
MFCC Method
● One of the most popular audio feature
● Representation of the speech signals where a feature called the
cepstrum of a windowed short-time signal is derived from the FFT
of that signal
● Works well for n-way classification tasks

33
MFCC Steps
● Pre-Emphasis
● Frame Blocking & Windowing
● FFT Magnitude
● Mel Filterbank
● Log Energy
● Discrete Cosine Transform (DCT)

34
Mel Scale
● Humans are much better at discerning small changes in pitch at low
frequencies than high frequencies
● Mel Scale makes the features match more closely what human hears

35
TEO Method
● Proposed by Herbert M. Teager and Shushan M in 1983
● Unlike LPCC, represent a non-linear predictive method
● More robust in noisy environment in comparison with MFCC and
LPCC
● Detecte the stress level of emotion

36
Feature Extraction Methods Comparison

37
Overview

● Problem Definition
● Data Gathering
● Feature Selection Methods

● Model Designing
● Results And Conclusion
● Future Work
● References
38
Popular SER Classification Models
● Hidden Markov Model (HMM)
● Gaussian Mixture Model (GMM)
● K-Nearest Neighbor (KNN)
● Deep Neural Network (DNN)
● Vector Quantization (VQ)

39
Hidden Markov Model
● A probabilistic learning approach consist of the first order markov
chain whose states are hidden from the observer
● A kind of graphical models
● Key Concepts:
○ Markov Chain
○ Markov Process
○ First Order Markov Decision Process

40
Hidden Markov Model Cont
● Text Independent
● Significant increase in computational complexity
● The need of a proper initialization for the model parameters before
training

41
Gaussian Mixture Model
● A robust probabilistic framework
● Form multivariate Gaussian density models that represents all the
frames
● Text independent
● Computationally efficient in comparison with HMM
● Easy to be implemented
● Comparatively takes time to train

42
Vector Quantization
● Unsupervised clustering
● Simple to implement
● Low computational burden
● Comparatively takes time to train
● Text-dependent

43
Deep Neural Network
Because of multiple layers, feature extraction process can be skipped and
using raw data (with less preprocessing in comparison with other other
methods) as input.

Beside DNNs, Convolutional Neural Networks (CNNs) are other type on


NNs which used frequently in the SER literature.

44
Deep Neural Network Cont.

45
Overview

● Problem Definition
● Data Gathering
● Feature Selection Methods
● Model Designing

● Results And Conclusion


● Future Work
● References
46
Top Results (Accuracy)

LPCC MFCC TEO

DNN 72.4 % 78.4 % 77.2 %

HMM 74.2 % 82.3 % 75.3 %

GMM 74.5 % 79.6 % 77.3 %

47
My Results
● On RAVDESS Dataset
55.3% 61.2% 53.2%

DNN-LPCC DNN-MFCC HMM-LPCC

48
Overview

● Problem Definition
● Data Gathering
● Feature Selection Methods
● Model Designing
● Results And Conclusion

● Future Works
● References
49
Future Work On Feature Extraction
● Using image data instead of audio data:
○ Plotting time-frequency diagram for each frame using FFT
○ Applying a kind of heat-map to the diagram
○ Using image processing for feature extraction
● Using game theory approaches for phoneme based SER
● Using TEO method for feature selection

50
Future Work On Classification Models
● Convolutional neural networks for classification
● Long short-term memory neural networks
● Recurrent neural network

51
Overview

● Problem Definition
● Data Gathering
● Feature Selection Methods
● Model Designing
● Results And Conclusion
● Future Works

● References
52
References
● A Review on Emotion Recognition Algorithms Using Speech Analysis,
Teddy Surya Gunawan, 2018
● A Review on Emotion Recognition using Speech, Saikat Basu, 2017
● Databases, features and classifiers for speech emotion recognition: a review,
Monorama Swain, 2018
● An Approach to Extract Feature using MFCC, Parwinder Pal Singh, 2014

53
Contact

● Email: arashmarioriyad@gmail.com
● Github: https://github.com/Arash-Mari-Oriyad
● LinkedIn: https://www.linkedin.com/in/arash-mari-oriyad/

54
Thanks

Any Question?

You might also like