You are on page 1of 27

Guided By: Dr.

Rajesh Kumar Dubey

Presented By: sailesh tomar


Enrollment ID:15316015
Branch: Communication System
M.Tech(4TH sem)
INTRODUCTION
BLOCK DIAGRAM OF SPEECH RECOGINATION
SPEECH EMOTION FEATURES
DEEP LEANING TECHNIQUE
DEEP AUTO ENCODER
DESIGN OF EXPERIMENT
EXPERIMENTAL RESULTS AND ANALYSIS
CONCLUSION
FUTURE WORK
Based on artificial intelligence
Work on Human-computer interaction technology
Computer have more humanized function
Important human computer intersection
Computer and human being more eco friendly
Purpose of SCR is to make computer discover people
current emotion statement from voice signal and
understand people emotional thinking.
Judge emotion states
The voice feature extraction algorithm for speech emotion
has great effect on recognition result.
Scientists find ideal features to fully described the speech
signal.
Traditional feature effect in speech recognition
technology.
Deep learning technology solve problem about learning
related features from large data.
Deep learning approach for speech emotion feature
extraction
Reduce manual workload
Recognition and classification system more intelligent
Speech feature is an essential element in the SER process.
Speech of the communicator must reflect his emotion and
devoid of the other effects such as voice of the
communicator, methods of delivering the speech, content
of the speech, and so on.
The phonetic features of the speech are divided into three
different categories:
Prosodic features, spectrum characteristics, and other
characteristics.
Prosodic Features
Prosodic features contains the information related to emotions in the
speech, and has been used in SER process. The classical prosodic features
consist of the following:
a. temporal feature, duration of the time;
b. frequency of the pitch;
c. energy;
d. formant, such as first order, second order and bandwidth of the formant;
e. contents such as phrase, idioms, word etc ;
Spectrum Features
Spectrum features are generally described by the small time
representation of speech signals. Some of the important spectral
features consist of the following: Fourier transform for shorter
time, Perceptual Linear Prediction cepstral coefficient (PLP) and Zero-
crossings with Peak amplitudes(ZCPA). Further, scientists have proposed
new important features such as performing wavelet transformation
associated with each frame of the speech signal and then obtaining its
Fourier transformation on its result.
Other Features
In addition to the prosodic features and spectral features, there are
several other common characteristics in SER.
The irrelevant information present in the speech features also affects
the accuracy of the speech recognition process.
Deep learning is the fastest-growing field in artificial
intelligence, helping computers make sense of infinite
amounts of data in the form of images, sound, and text.
Using multiple levels of neural networks, computers now
have the capacity to see, learn, and react to complex
situations as well or better than humans.
The deep network is derived from artificial neural
networks.
In 2006, Hinton proposed a deep belief network (DBNs).
The deep network and deep learning is widely concerned
in the field of artificial intelligence.
Latest technology explored in AI
Each layer can be analyzed independently without
obtaining the output from previous layer.
DAE is a special type of deep neural network in which the
dimension of input and output are same.
DAE makes a reversible conversion between the spatial
distribution of given data and the special features present
in the space, which can be inferred as the reconstruction
and decomposition of the given input signal.
This essentially means that the large amount of
unclassified data is analyzed using unsupervised learning.
The encoder is used to map the raw input to the hidden
level and this mapping is given by,

Where is the set of parameters and is the non


linear activation function.

The decoder map the learned features from the hidden


layer by reconstruction at output layer and this
reconstruction is expressed as,

Where is the set of parameter and


is the linear activation function and the parameter satisfy
for the relation
Further, the activation function between hidden layer of
the network may be linear or non linear function. In
general, the non linear activation function is defined by
Sigmoid function which is expressed as,

The objective, during the training, is to determine the set


of parameters which minimizes
The reconstruction error can be expressed as . This set of
parameters is expressed as
Initially, a pre-training algorithm of layer by layer for
setting the parameters are used in the coding of the DAE,
and as the image gets enlarge, the initial set parameters
of decoding become active, and finally set the network
parameters in the network training. It is necessary to
mention that DAE, by itself, cannot do classification of
data, but can only extract features from the hidden layer.
Thus, a classifier is attached at the end to perform the
classification of the data.
The objective of the layer by layer logarithm is to train a
single layer of the network at particular point of time. The
next training for the first two layers of the network begins
only after completion of the training of first layer, and
than using the analogy. Thus, one can say that initial
parameter for L-1 are set, and then finally add the L
layers.
The training process for the deep network consists of the
following steps:
Using the unsupervised method for training of the first
layer, and the resulting output is further obtained by
minimizing the reconstruction error of the original input;
Using the output of initial hidden layer as the input to the
next layers of the network, and the unlabeled data
samples are used for the training of the subsequent layer
within the range of certain errors;
The step 2 is repeated till the completion of training of the
whole network;
The output of the last hidden layer acts as the input to the
supervised layer. The parameters of the trained layer set
the parameter of the whole network.
Finally, all the layers of the entire network are fine tuned
with the supervision in accordance with supervised
learning method.
The overall cost function for the sample training function. Is expressed as,

Where gives the activation value of the output for the values of the
input and are the weight attenuation. This first term is the square error
function giving the values of the input reconstructed encoder and the initial
input. The second term is nothing but the rule term. The cost function is
minimized by taking the partial derivatives with respect to the
network weight and bias parameters. Steps taken in this process are as
follows:
Forward propagation algorithm is used to obtain the
activation probability of each node from second layer to
output layer and is given by

Where the bias parameter in is neural value for layer


one and is the connecting layer n and n+1.

The residual error of the first n neuron in the output


layer is
Determining the residual error of the neurons with
respect to the first hidden layer neuron. The residual of
the n the node of the nth layer is:

is the weight of the unit j connected the I of layer 1


with the unit j of layer l+1

Determine the partial derivative


Optimizing the set of parameters of network weights and
node bias.
In this project, the speech emotion data base is taken
from the Chinese Academy of Sciences Institute of
automation (CASIA). This database is taken for 4 people (2
male and 2 female)and includes 6 types of emotion
statements such as, fear, anger, This database is Chinese
emotion corpus created by four actors (2 males and 2
females). Six kinds of emotion statements are: anger, fear,
sadness, neutral, surprise and happiness. A total of 9600
statements.
The emotional recognition obtained using DAE showed an
enhancement in recognition by 4.85% with highest
accuracy of 86.41 in comparisons to the recognition from
traditional feature.
It means the learned features by DAE make better
emotional recognition.
The other important aspect of the experiment is the
improvement in the recognition result by using learned
feature in DAE.
The recognition result obtained from speech emotion data
for one male group with respect to particular emotional
statement i.e., angry emotion and happy emotion.
The speech recognition is a complex process but plays an
important role in the HCI.
Technique DAE to learn features and recognize emotion
with greater accuracy than the available traditional voice
recognition.
Owing to lack availability of emotion voice database and
paucity of time, my results are far from the ideal and
require further improvement.
In higher studies, I would devote my time for using better
emotional voice database and exploring different
processing technique on the input of DAE to determine
better emotion features from the sample speech signal.

You might also like