You are on page 1of 2

Initially, Yasmin M Kassim introduced a new work to identify the malarial parasite.

The work uses


the RBCNet architecture which contains U-Net and Faster CNN. U-Net first stage is used for cell
cluster and Faster CNN for detecting small blood cells within the cell clusters. The dataset used in this
work is a collection of malaria smears with 200,000 labeled cells. This work surpass the traditional cell
detection methods. The work cannot recognize the emotions correctly with the short audio segments
[10].
In order to increase the recognition rate, Pengcheng Wei proposed a new system called “A
novel speech emotion recognition algorithm based on wavelet kernel sparse classifier in stacked deep
auto-encoder model”. To improve the speech emotion recognition, the system uses an improved
stacked kernel sparse deep model algorithm,which is based on auto-encoder, denoising auto-encoder,
and sparse auto-encoder. The recognition rate of the system is 80.95%. Although the work gives a good
recognition rate, the model is very complex to use many encoders [11].
In order to make the model simple and to gain good accuracy, Linhui Sun proposed a new
system called “Decision tree SVM model with Fisher feature selection for speech emotion
recognition”. This work introduces  Fisher feature selection to extract the features which are used to
train each SVM in the decision tree. The system used two different types of data-set for Chinese and
Berlin languages. The work shows the recognition rate of 83.75%  for chinese data-set and 87.55% for
the Berlin data-set. Although the work gives a good recognition rate, the work does not correctly
classify the emotion between fear and sadness [12]. 
Similarly, Xingfeng Li  proposed a new work called “Improving multilingual speech emotion
recognition by combining acoustic features in a three-layer model”. The work initially finds acoustic
features from speech data-set. After predicting the acoustic features, the work normalizes the features
using speaker normalization method and selects some of the features using Fisher Discriminant Ratio
(FDR). With the help of selected features, the different emotion dimensions like arousal, pleasure and
power are identified using training the logistic model trees. But, the work achieves different accuracy
for different languages In order to improve the accuracy, Xingfeng Li introduces a technique called
segment reptitation with data augmentation. This technique yields the high accuracy of 98.16% after
data augmentation. Even though the method gives high accuracy the method fails to classify the
emotion based on gender[13].
Due to evolution of recent technologies, authors used Deep Learning techniques to recognize
emotion from speech. Bagas Adi Prayitno proposed a new methodology called “ Segment Repetition
Based on High Amplitude to Enhance a Speech Emotion Recognition” to recognize emotion using
deep learning techniques. The work uses the Berlin Emotional Speech Database (Berlin EMO-DB) and
Long Short Term Memory (LSTM) with 3 core layers and 2 recurrent layers for classification of
different emotions. The method does not classify the emotions correctly and gives less accuracy of
66.18% [14].
To identify the emotion based on the gender, Ftoon Abu Shaqra proposed a new model called
“Recognizing Emotion from Speech Based on Age and Gender Using Hierarchical Models”. The
proposed model uses hierarchical classification models to find the necessity of identifying gender and
age in the process of emotion recognition. The model uses Toronto Emotional Speech Set (TESS) for
emotion recognition. Although the work classifies the emotions based on the gender, the system
obtains very less accuracy of 74%[15].
To effectively differentiate emotions from the given speech data-set, Mohit Shah  proposed a
new method called “Articulation constrained learning with application to speech emotion recognition”.
A discriminative learning method  is used to effectively recognize emotions by separating the features
based on vowel arousal. The model distinguishes happiness from other emotions more accurately in the
ElectroMagnetic Articulography (EMA) database and Interactive Emotional Dyadic Motion Capture
(IEMOCAP) database. But the method fails to recognize the emotions from the long audio speech
[16]. 
To recognize the emotion from the long audio speech, Jian-Hua Tao introduced a new method
*called “Semi-supervised Ladder Networks for Speech Emotion Recognition” *to recognize emotions
through a semi supervised ladder network. This method has two encoders for reducing the noise and to
clean the input signal. The method trains the Interactive Emotional Dyadic Motion Capture
(IEMOCAP) database which has 12 hours audio duration.  The results show that Ladder network
obtains 83.3% which outperforms the existing network like Denoising Autoencoders (DAE) and
Variational Autoencoders (VAE).  Moreover, the method has a higher classification rate for angry and
happy emotions[17]. 
To effectively reduce the confusion between emotions and to  improve the speech emotion
recognition rate Linhui Sun proposed a new method *called “Speech Emotion Recognition Based on
DNN-Decision Tree SVM Model”*. The input signal is preprocessed by pre-emphasis method and the
MFCC feature is extracted by framing. The extracted features are trained in decision tree SVM and
DNN to classify the emotions. The results obtained from the method shows that the average emotion
recognition rate is 75.83%[18].
From the state of the art, the emotion recognition still has some limitations for accurately predicting
the emotions. This limitation motivates to introduce a new approach. The following section describes
the motivation of the proposed work.

You might also like