Initially, Yasmin M Kassim introduced a new work to identify the malarial parasite.
The work uses
the RBCNet architecture which contains U-Net and Faster CNN. U-Net first stage is used for cell cluster and Faster CNN for detecting small blood cells within the cell clusters. The dataset used in this work is a collection of malaria smears with 200,000 labeled cells. This work surpass the traditional cell detection methods. The work cannot recognize the emotions correctly with the short audio segments [10]. In order to increase the recognition rate, Pengcheng Wei proposed a new system called “A novel speech emotion recognition algorithm based on wavelet kernel sparse classifier in stacked deep auto-encoder model”. To improve the speech emotion recognition, the system uses an improved stacked kernel sparse deep model algorithm,which is based on auto-encoder, denoising auto-encoder, and sparse auto-encoder. The recognition rate of the system is 80.95%. Although the work gives a good recognition rate, the model is very complex to use many encoders [11]. In order to make the model simple and to gain good accuracy, Linhui Sun proposed a new system called “Decision tree SVM model with Fisher feature selection for speech emotion recognition”. This work introduces Fisher feature selection to extract the features which are used to train each SVM in the decision tree. The system used two different types of data-set for Chinese and Berlin languages. The work shows the recognition rate of 83.75% for chinese data-set and 87.55% for the Berlin data-set. Although the work gives a good recognition rate, the work does not correctly classify the emotion between fear and sadness [12]. Similarly, Xingfeng Li proposed a new work called “Improving multilingual speech emotion recognition by combining acoustic features in a three-layer model”. The work initially finds acoustic features from speech data-set. After predicting the acoustic features, the work normalizes the features using speaker normalization method and selects some of the features using Fisher Discriminant Ratio (FDR). With the help of selected features, the different emotion dimensions like arousal, pleasure and power are identified using training the logistic model trees. But, the work achieves different accuracy for different languages In order to improve the accuracy, Xingfeng Li introduces a technique called segment reptitation with data augmentation. This technique yields the high accuracy of 98.16% after data augmentation. Even though the method gives high accuracy the method fails to classify the emotion based on gender[13]. Due to evolution of recent technologies, authors used Deep Learning techniques to recognize emotion from speech. Bagas Adi Prayitno proposed a new methodology called “ Segment Repetition Based on High Amplitude to Enhance a Speech Emotion Recognition” to recognize emotion using deep learning techniques. The work uses the Berlin Emotional Speech Database (Berlin EMO-DB) and Long Short Term Memory (LSTM) with 3 core layers and 2 recurrent layers for classification of different emotions. The method does not classify the emotions correctly and gives less accuracy of 66.18% [14]. To identify the emotion based on the gender, Ftoon Abu Shaqra proposed a new model called “Recognizing Emotion from Speech Based on Age and Gender Using Hierarchical Models”. The proposed model uses hierarchical classification models to find the necessity of identifying gender and age in the process of emotion recognition. The model uses Toronto Emotional Speech Set (TESS) for emotion recognition. Although the work classifies the emotions based on the gender, the system obtains very less accuracy of 74%[15]. To effectively differentiate emotions from the given speech data-set, Mohit Shah proposed a new method called “Articulation constrained learning with application to speech emotion recognition”. A discriminative learning method is used to effectively recognize emotions by separating the features based on vowel arousal. The model distinguishes happiness from other emotions more accurately in the ElectroMagnetic Articulography (EMA) database and Interactive Emotional Dyadic Motion Capture (IEMOCAP) database. But the method fails to recognize the emotions from the long audio speech [16]. To recognize the emotion from the long audio speech, Jian-Hua Tao introduced a new method *called “Semi-supervised Ladder Networks for Speech Emotion Recognition” *to recognize emotions through a semi supervised ladder network. This method has two encoders for reducing the noise and to clean the input signal. The method trains the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database which has 12 hours audio duration. The results show that Ladder network obtains 83.3% which outperforms the existing network like Denoising Autoencoders (DAE) and Variational Autoencoders (VAE). Moreover, the method has a higher classification rate for angry and happy emotions[17]. To effectively reduce the confusion between emotions and to improve the speech emotion recognition rate Linhui Sun proposed a new method *called “Speech Emotion Recognition Based on DNN-Decision Tree SVM Model”*. The input signal is preprocessed by pre-emphasis method and the MFCC feature is extracted by framing. The extracted features are trained in decision tree SVM and DNN to classify the emotions. The results obtained from the method shows that the average emotion recognition rate is 75.83%[18]. From the state of the art, the emotion recognition still has some limitations for accurately predicting the emotions. This limitation motivates to introduce a new approach. The following section describes the motivation of the proposed work.