You are on page 1of 27

Deep Learning Techniques for Identifying Voices

Abstract

Our research has shown a thorough review of deep learning-based voice identification. We
examined the applicability of (CNN) Convolutional Neural Networks in identifying and verifying
the speakers by their voices, and we tested and identified the voice of a particular speaker.
Based on the study, we apply deep learning approaches to discuss algorithms for the problem
of voice identification. In the initial step of data preparation, we employed the MFCC algorithm
in the voice preprocessing process. A comparative examination of two classification methods
was performed to solve the problem. Furthermore, the current study developed a custom
ResNet (Residual neural network) model utilizing the well-known Python deep learning
framework. AlexNet is a convolutional neural network that has had a significant influence on
the field of machine learning, particularly in the application of deep learning. ResNet Accuracy
is 97.2039 % , AlexNet Accuracy is 65.95% and We examined two models and found that the
ResNet model produced the best results.

Keywords: Deep Learning, Voice identification, Convolutional Neural Network (CNN), AlexNet,
ResNet.

1. Introduction

Deep learning is an artificial intelligence (AI) structure that impersonates the activities of the
human mind in handling information for use in identifying objects, perceiving speech,
interpreting dialects, and finally making decisions based on the data collected (Bae et al, 2016).
Deep learning AI can learn without human oversight, drawing from information that is both
unstructured and unlabeled. Deep Learning is a subfield of AI focused on algorithms propelled
by the construction and capacity of the brain called artificial neural organizations. Deep learning
algorithms endeavor to make comparative determinations as people would by consistently
examining information with a given intelligent design (Zhou et al., 2020). To accomplish this,
deep learning utilizes a multi-layered design of calculations called neural networks. The plan of
the neural organization depends on the design of the human brain. Similarly, as humans think
carefully to distinguish designs and order various sorts of data, neural networks can be
instructed to perform similar actions on data. The individual layers of neural organizations can
likewise be considered as such a channel that works from gross to unpretentious, improving the
probability of identifying and yielding a right outcome.

Deep learning has been gaining considerable importance in the recent decade owing to
continuous growth in the Artificial Intelligence (AI) technology. Deep learning points to a
machine-learning programme that can locate objects, identify speech, translate, and interpret
languages, and make decisions without human supervision. While deep learning techniques
have been successfully implemented in recognising characters and objects, its development in
identifying voices and interpreting them is still under process. Meanwhile, Deng (2018) say that
the importance surrounding the identification and interpretation of Voices is increasing in
recent years. The growing importance is the result of a rise in the demand for voice-controlled
AI systems in the technologies. Nowadays, the adaptation of AI technology has extended
applicability. It includes smartphones, laptops, speakers, and many other technological home
appliances. The market has also gotten used to the AI-powered personal assistant such as "OK
Google" and "Alexa". Using the personal assistant, the individuals can control different wireless
appliances using only their voice. While the AI's performance in voice recognition has been
widely praised, it is also necessary to address the AI's issues in distinguishing the voices of
individuals. The demand for discerning the voices is high now more than ever, since Voice-
based AI systems are included personal security and critical surveillance technologies. In this
regard, the current study attempted to identify voices using deep learning techniques. The
study primarily focused on the utilisation of the convolutional neural networks as they offer a
non-linearity towards an audio-based set that makes it easier to control.
The human brain works in the same way with each point involving humans accessing new data,
the brain attempts to contrast it to other known items. A similar idea is additionally utilized by
deep neural networks where the systems use the data collected to be compared to already
existed information (Deng et al., 2013). Deep learning is notable for its relevance in image
recognition. However, another critical utilization of the innovation is in speech recognition
utilized in digital assistances such as Amazon's Alexa or messaging through voice recognition.
The benefit of deep learning in voice identification comes from the adaptability and flexibility
that come from deep neural networks that lately become more accessible. Voice identification
at its most basic level enables the conversion of Voice waves into individual letters, which in
turn form full statements. A critical challenge in identifying the right words involves the
variability of the Voice produced for the same word given in different accents such as Hello vs
Hellooo.When issued with an audible sentence the voice recognition starts by altering the Voice
waves using Fast Fourier Transformation and concatenating structures from adjacent windows
in order to come up with a spectrogram (Nguyen et al., 2019). The main aim is to take down the
dimensionality of the univariate voice data enabling prediction of the letters and words that are
coming next. Computer oriented processing and the recognition of human voices is referred to
as speech recognition. It can be used to validate user credentials in certain systems as well as
give instructions to digital assistants such as Siri, Cortana or Google Assistant. These systems
work through storage of human voices which then train the voice identification system to listen
in for vocabulary and patters within the speech produced.

Another vital test in voice identification is the issue of redundancy; to decipher progressively,
the model should foresee words accurately without the entire sentence. A portion of the deep
learning models like bi-directional repetitive neural networks advantage exceptionally from
utilizing the entire sentence because of the additional unique situation (Lee et al., 2017). The
arrangement in diminishing inactivity is to remember restricted setting for the model
construction by permitting the neural organization to approach a short measure of data after a
particular word (Fayek et al., 2017). At last, deep learning is still genuinely in its outset,
however is rapidly moving toward a cutting-edge capacity in voice identification. Studies
indicate that there is still a great deal of opportunity to get better in model designing to
diminish inertness and increment precision (Bashar, 2019).

The architecture of deep learning, such as AlexNet and ResNet models and the network of the
adversarial generative and recurrent neural networks for memory and long-term as obstetric
non-supervised models. Neural Networks sets the background of the AlexNet and ResNet as
models discriminatory moderated , presented and developed it. The application of this
structure on a large scale to many applications such as automatic speech, natural language
processing, and the establishment of voice and detection of objects in three-dimensional where
it is made proven that it produces results developed for different tasks.

1.1 Problem statement

While there is a significant development in the researches related to speech and music, there
has been only limited growth concerning the classification of voices.

Deep learning focuses to an AI program that can find objects, recognize discourse, decipher,
and decipher languages, and make choices without human management. While deep learning
methods have been successfully used in perceiving characters and items, its advancement in
recognizing voices and deciphering them is still in the works.

the significance of encompassing the identification and translation of voices is expanding lately.
The developing significance is the aftereffect of an ascent in the interest for voice-controlled AI
frameworks in the advances. These days, the transformation of AI innovation has expanded
pertinence. It incorporates cell phones, PCs, speakers, and numerous other innovative home
machines. Generally, voice identification models are utilized in the characterization of
algorithms to come up with a distribution of potential phonemes for each structure .

deep learning has advanced for the past few years with the increasing processes of information,
which has allowed machines to learning through the process of mimicking human intelligence
and make a decision. In the same way, voice recognition for individuals involves a machine
consuming voices, speeches, utterances and phrases from an individual so that each time the
individual speaks, the speech is converted to voice waves and compared to previously stored
datasets in order to recognize the specific person. It is worth noting to identify that
identification of voices through deep learning improves with time and the measure of data fed
into the system. Artificial intelligence has made it possible for machine learning to evolve
through the process of data processing. With increased data, machines are now able to make
more informed decisions. This concept is applicable in deep learning techniques utilize in
identifying voices. with increasing measures of data, the voice identification process becomes
more efficient with time and is able to distinguish between different phrases even when they
are said in different ways as long as the basic characteristic learned by the machine is
maintained.

The problem statement above has been arrived at after going through a considerable number
of researches in the fields of deep learning, and audio technologies. The technique involved
combining variables into features, thereby enabling an acceleration in the data process speed.
However, as the technique lacks a neural approach, it fails to discern between the voices of
individuals. Meanwhile, deep learning has made use of the convolutional neural networks for
classifying the images with higher accuracy. It is expected by the application of these
techniques, the accuracy in voice recognition and classification could be increased.

1.2 Research aim

This research aims to apply deep learning techniques to identify the voices of the individual in
an environment and classify them based on the use of Convolutional Neural Networks models
AlexNet and ResNet, that are used in voice identification. Through the study of different deep
learning techniques to identify ways through which a system identifies different characteristics
of an individual voice and is able to differentiate one from another and inspect the practicality
of the recently developed deep learning architectures in the real-life environment with access
to a large dataset and the accuracy of the system in voice identification.
1.3 Research Objectives
1. Explore ResNet and AlexNet models in identifying the voice.
2. Apply Convolutional Neural Network models to evaluate the performance in identifying and
verifying the voice of a particular speaker by ResNet and AlexNet models.
3. Evaluate the models results and compare them with other deep learning or machine
learning models.

1.4 Research Questions


1. How can ResNet and AlexNet models identifying the voice?
2. How to improve ResNet and AlexNet models performance ?
3. What are the main factors that affecting the performace of the models?

1.5 Summary
In the introduction Section, we took a look at successfully implemented systems that utilize
deep learning techniques. And introduced the problem statement of the study about
identifying voices using CNN models. Moreover, described the aim and the research
questions and showed the objectives of the research.
2. Related work :
we discussed related studies of our work and how it help us in our research Convolutional
Neural Network studies , Recurrent neural network (RNN) studies. Deep Belief Network
(DBN) studies is a more complicated technique structure as compared with the other
techniques structure Identification of speaker by text-independent. Also, A hierarchical
deep learning technique is a recursive neural network (RvNN) studies and Markov Random
fields are used for deriving the deep Boltzmann machine, and it is having different types of
hidden layers studies. Also, AlexNet and ResNet algorithms. As seen from the review, since
convolutional neural network offers better results than any other approaches, it could be
used in identifying and verifying the speaker by their voices.
2.1 Recurrent neural network (RNN)

A branch of the neural network is RNN, and it is based on sequential information; in this, input
and outputs are interdependent. Interdependency of the out and input is helpful in the
forecasting of the future state in input and output. Memory is required for the storing of the
information, as in the case of a convolution neural network. Storing of the information is in the
form of sequential steps as it is in the modeling of deep learning. It only works efficiently for the
steps used in the backpropagation (Khalil et al., 2019). The sensitivity for the gradient
disappearance is the major problem of the recurrent neural network, and it affectted the whole
functioning of the RNN. Decaying of the gradient be there with the training phase and get
multiplied with large and small derivatives. Input that is there at the initial level is being
forgotten, and sensitivity has been reduced with respect to the increasing time. In the recurrent
connection, the block is placed with the help of Long Short-Term Memory (LSTM), which
reduced the issues faced due to the sensitivity towards the gradient. The temporal state of the
network stored by each block of memory, and a unit of the gate has been there for the
management of the new information coming into the system (Bunrit et al., 2019).

2.2 Deep Belief Network (DBN):

It is a more complicated technique structure as compared with the other techniques structure.
A cascade Restricted Boltzmann Machine structure is required for building this technique. The
extension of the Restricted Boltzmann Machine is known as DBN. The bottom-up manner is
being preferred by RBM. Recognition of speech emotion is being done with the help of a Deep
Belief Network. It is because of the reason that DBN is having the ability of pattern recognition
in an effective manner, and it is not dependent on the size of the parameter (Boles & Rad,
2017). A layer is provided by the non-linearity with the help of a deep belief network. The
backpropagation algorithm is used for the problem of slow speed during the period of training.
2.3 Recursive neural network (RvNN)

A hierarchical deep learning technique is a recursive neural network (RvNN), and it is not
dependent on the structure of the tree in the input sequence. The input is divided into smaller
parts for a better understanding of the data in an easy way. This is mostly used for speech
emotion recognition, the pattern of the visual and audio input. Natural language processing is
being done with the help of RvNN, but it can also help in different types of speech recognition
and SER (speech emotion recognition). Natural image processing and natural image sentence
classification can be done with the help of RvNN. A syntactic tree built by computing the
overall score for the pair which are possible and merging them into pairs. Compositional vector
is combined by the pair having the highest score in the combining process. Multiple units
generated by the recursive neural network when the merging of pairs is completed (Fang et al.,
2019).

METHODOLOGY

3.1 overview
we discussed proposed methodology, data collection, data preparation and MFCC (Mel-
Frequency Cepstral Coefficients) feature extraction , custom models for classification contain
ResNet model and AlexNet model.
The current research make use of the quantitative research methodology. Qualitative
research methodology uses non-numerical data for understanding the underlying reasons
behind selected subject (Taylor, 2005). Meanwhile, quantitative research methodology
encourages the use of numerical data. It can help in identifying patterns and making
predictions using statistical means (McEvoy and Richards, 2006). The current research
requires the use of numerical data and recognition of patterns in the changes with regard to
voices and their classification. Hence, it becomes a necessity to use quantitative research
approach. The data required for carrying on the analysis is an existing voice dataset of
OpenSLR. The convolutional neural network (CNN) model would be trained using the
spectrogram images of a specific set. The varying levels of accuracy in identifying the
specific set would enable the current research to understand the level of applicability of
CNN in identifying voices.
The following are the steps to implement the proposed methodology of voice recognition of
the speaker.:

Step 1: Collection of voice datasets contains 2937 audio files.

Step 2: Feature Extraction using Librosa python tool audio processing library to compute
the MFCC features of voices.

Step 3: Divide the dataset into train and test sets as a ratio of 9:1.

Step 4: Set labels IDs for Recognition.

Step 5: Training the CNN model for about 50 epochs for better prediction.

Step 6: Save the best models then can use the saved model to test and recognize voices.

In voice identification, the main process is speech pre-processing. Spectrograms are a way in to
represent wave files. By selected MFCC(Mel-Frequency Cepstral Coefficients) as a tool for
extracting voice dynamics functions. Voice signals in the time domain change very quickly and
dramatically, but if we convert speech signals from the time domain to the frequency domain,
then the corresponding spectrum can be clearly defined. We separates the signals into frames
and calls the window function to increase the continuity of voice signals in the frame. Discrete
Cosine Transform (DCT) is being used for quantitative evaluation of spectral energy data into
data units that can be analyzed by MFCC (Nasr et al., 2018).

Figure 1: MFCC (mel-frequency cepstral coefficients) [Zinchenko et al., 2017]


Although deep learning eliminates the need for hand-engineered features, we have to choose a
representation model for our data. Instead of directly using the Voice file as an amplitude vs
time signal we use a log-scaled mel-spectrogram with 128 components (bands) covering the
audible frequency range (0-22050 Hz), using a window size of 23 ms (1024 samples at 44.1 kHz)
and a hop size of the same duration. This conversion takes into account the fact that human ear
hears Voice on log-scale, and closely scaled frequency are not well distinguished by the human
Cochlea. The effect becomes stronger as frequency increases. Hence we take only into account
power in different frequency bands. Before fed into the MFCC computation pipeline, the raw
audios are resampled, normalized, and padded to have a fixed size of 64000.

3.2 Proposed Methodology

In this section we show methodology of our work .It involves Data collection and pre-
processing, ResNet and AlexNet models, classification evaluations and in Section 4 shows the
Results.

Data collection
and pre- Models Evaluations Results
processing

voice data Precision


Discrete
Transform(DCT) ResNet
Recall

F1-Sore Result and


Features Visualization
extraction
AlexNet Support
MFCC

Accuracy

Figure 2. Proposed Methodology structure


3.2.1 Data preparation

we collected raw human voice dataset from LibriSpeech ASR corpus .( Vassil Panayotov et al
,2015) LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech,
prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read
audiobooks from the LibriVox project, and has been carefully segmented and aligned.

Table-1. Sample of speaker’s dataset

our data contains The “dev-clean.tar.gz” contains 2937 audio files of people and is considered
as voice classification of 40 classes.
3.2.2 MFCC feature extraction

In voice identification, the main process is speech pre-processing. I used MFCC(Mel-Frequency


Cepstral Coefficients) as a tool for extracting voice dynamics functions as acoustic features
(such as MFCCs or spectra) as inputs and speaker IDs as target variable.

Voice signals in the time domain change very quickly and dramatically, but if we convert speech
signals from the time domain to the frequency domain, then the corresponding spectrum can
be clearly defined. We separates the signals into frames and calls the window function to
increase the continuity of voice signals in the frame. DCT is being used for quantitative
evaluation of spectral energy data into data units that can be analyzed by MFCC. This is the
interaction to change over the log Mel range into time area utilizing DCT. The consequence of
the change is called Mel Frequency Cestrum Coefficient. The arrangement of coefficient is
called acoustic vectors. In this way, each input expression is changed into a grouping of acoustic
vector.

MFCC is a portrayal from discourse signals in light of view of individual. MFCC utilizes the
Discrete Cosine Transform. The principle contrast is that the groups are put in a logarithmic
manner, as indicated by the Mel scale. Thusly, the discourse is demonstrated by a more human
response, permitting a more effective sign handling .

Figure -3. MFCC Feature Extraction.

MFCC process calculation is completed initial with signal division into outlines. Thereafter, the
range is sifted utilizing thirteen three-sided windows relating to the Mel recurrence scale.
Logarithmic capacities are applied to the energy registered in the Mel recurrence groups, and
Discrete Cosine Transform (DCT) is utilized for each log energy. At long last, the MFCC relate to
the sufficiency range gave after DCT . In this review, signal was isolated into casings of 30
milliseconds with a casing movement of 10 milliseconds. For each casing, thirteen coefficients
are figured to address the acoustic lung signal in light of execution acquired in past
examinations

Figure 4: Computing MFCC features

Although deep learning eliminates the need for hand-engineered features, we have to choose a
representation model for our data. Instead of directly using the Voice file as an amplitude vs
time signal we use a log-scaled mel-spectrogram with 128 components (bands) covering the
audible frequency range (0-22050 Hz), using a window size of 23 ms (1024 samples at 44.1 kHz)
and a hop size of the same duration. This conversion takes into account the fact that human ear
hears voice on log-scale, and closely scaled frequency are not well distinguished by the human
Cochlea. The effect becomes stronger as frequency increases. As shows in figure 6, we take only
into account power in different frequency bands. Before fed into the MFCC computation
pipeline, the raw audios are resampled, normalized, and padded to have a fixed size of 64000.
We used the famous audio processing library “librosa 0.7.2” to compute the MFCC features of
audios. The sampling rate is set to 16000 and the number of MFCCs is set to 24. Finally, we
extracted file (data.pkl ) and extracted feature that we used in training and recognition model
using ResNet and AlexNet.

3.2.3 CNN models for classification


3.2.3.1 ResNet model
As a classification model, the current research created a custom ResNet(Residual
neural network) model using the famous python deep learning framework: “PyTorch
1.2”.
PyTorch is an open source machine learning library based on the Torch library used
for applications such as computer vision and natural language processing, primarily
developed by Facebook's AI Research lab.

Figure 5: The ResNet block model


Figure- 7 shows the ResNet consists of many CNN layers, batch norm layers, max pooling layers,
and ReLu layers. The core idea of ResNet is introducing a so-called “identity shortcut
connection” that skips one or more layers, so it’s very good at classification tasks.

This research custom ResNet model consists of 11 ResNet blocks, and a ResNet block consists of
3 CNN layers and batch norm and max pooling layers.

3.2.3.2 AlexNet model

models of convolutional neural networks AlexNet Shown in figure 9

Figure -6. Architecture of AlexNet

In figure 8, AlexNet is a convolutional neural network which has had a large impact on the field
of machine learning, specifically in the application of deep learning to machine vision. It
consisted of 11×11, 5×5,3×3, convolutions, max pooling, dropout, data augmentation, ReLU
activations, SGD with momentum. It attached ReLU activations after every convolutional and
fully-connected layer. AlexNet was trained simultaneously on two Nvidia Geforce GTX 580 GPUs
which is the reason for why their network is split into two pipelines.
3.4. Performance Evaluation
This current study used the famous audio processing library “librosa 0.7.2” to compute the
MFCC features of audios. The sampling rate is set to 16000 and the number of MFCCs is set to
24.

Figure 7: Classification evaluations metrics equations

Classification Evaluations Metrics includes Precision that used to measure accuracy equations
given in figure 5, its measured the positive patterns that are correctly predicted from the total
predicted patterns in a positive class. Recall is used to measure the fraction of positive patterns
that are correctly classified and the accuracy metric measures the ratio of correct predictions
over the total number of instances evaluated. F-score derived from two measures which are
precision and recall, this metric represents the harmonic mean between recall and precision
values. (M and M.N, 2015).

In this research implements a voice classification model using a novel structure of convolutional
neural networks and MFCC feature of a wave, ResNet and AlexNet models.

We compare our work with In paper Orken Mamyrbayev et al (2019) that discussed the classification
algorithms for the problem of personality identification by voice using machine learning methods.

4. RESULTS AND DISCUSSION


4.1 Overview

In this Section, a description of the experimental results and performance evaluation, which
shows the effectiveness of algorithms that extract results of experiment and comparison
between results with graphs and tables and discuss it. We implemented the methods evaluated
for the problem of speech recognition and conducted a comparison study. Experiments
revealed that the best results were obtained using a custom deep convolutional neural network
dubbed ResNet and AlexNet, which we compared. The outcomes of the suggested system's
implementation are shown in this Section. Two techniques were used: the first used MFCC to
extract features from audio data and then used ResNet and AlexNet for training and
recognition.

4.2 Training Data

The original dataset is splited into training and validation sets as a ratio of 9:1.

We used loss function as objective functions. The optimization algorithm is a stochastic


gradient descent variant that has lately gained popularity for deep learning applications in
computer vision and natural language processing. During training, the training history and the
most accurate MFCC model were retained for further inference on raw audio files.

No Hyperparameter Value

1 Learning rate 0.0005

2 Batch size 16

3 Num of epochs 50

Table-2. Hyper parameters for training of data

In table 2, show parameters for training of data from audio dataset learning rate. Learning rate
is ratio into training and validation is 0.0005, batch size is 16, the size of a batch must be more
than or equal to one and less than or equal to the number of samples in the training dataset.
Also, num of epochs is 50, it is the number of complete passes through the training dataset.
4.3 Resnet model Results

Resnet(Residual neural network) model using As a classification model. Classification


Evaluations Metrics includes accuracy equations contains number of epochs, loss value,
accuracy Resnet (Residual neural network) model using a classification model draw in Roc-curve
model for (Training loss, train_acc and val_acc) of the Resnet model

Figure-8. Roc-curve model for (Training loss, train_acc and val_acc )


of the Resnet model

As you can see in Figure -8, the information about the accuracy of the ResNer model in each
epoch (50 epochs) or iteration along with the loss is in the graph. The X-axis of the graph is the
Epochs and the Y-axis is the result percentage. The minimum loss value is 0.0259, best training
accuracy is 98.64%, and best validation accuracy is 87.86%. The validation accuracy is still lower
than training accuracy, this means our model is over-fitted.

Resnet(Residual neural network) model using a classification model. In table 4, Evaluations of


Classification Metrics comprise Precision is a term used to describe the positive patterns that
are correctly predicted out of the total predicted patterns in a positive class. The recall metric is
used to calculate the proportion of positive patterns that are correctly classified, whereas the
accuracy metric computes the proportion of correct predictions divided by the total number of
instances evaluated. The F-score is derived from two measures: precision and recall, and it
represents the harmonic mean of the recall and precision values.

4.4 AlexNet model results

The AlexNet model, as a type of classification model and Evaluations of Classification Metrics
include accuracy equations.
AlexNet model using As a classification model draw in Roc-curve model for (Training loss,
train_acc and val_acc ) of the AlexNet model

Figure -9. Roc-curve model for (Training loss, train_acc and val_acc )of the AlexNet model
As see in Figure -9, the information about the accuracy of the AlexNet model in each epoch (50
epochs) or iteration along with the loss is in the graph. The X-axis of the graph is the Epochs and
the Y-axis is the result percentage. The minimum loss value is 1.064 %, best training accuracy is
65.95 %, and best validation accuracy is 54.04%. The validation accuracy is still lower than
trainig accuracy.

In table 5, this classification report belong to test the speaker ID :1911 to test AlexNet model
classification to identify voice of ID :1911 and Accuracy is 39% the percentage of true
predictions made by the model when compared to the actual classifications in the dataset. On
the other hand, precision and recall are two assessment methodologies that are detrmined
using the confusion matrix. Finally, AlexNet model accuracy 39%.
4.5 Discussion

we answered the research questions that conducted on the effectiveness of CNN models
especially the ResNet model in identifying voice. In this section, we continued to present the
accuracy of ResNet and AlexNet model classification according to calculating and showed
graphicly the averages of precision, recall, and f1-score. Then we compared our work with a
related work.
Figure - 12. show relation between training(loss, acc) using AlexNet model.

Accuracy AlexNet model Classification


0.43 0.425714286
0.42 0.412857143
0.41
0.4
0.39
0.38 Average of precision
0.371162791
0.37 Average of recall
0.36 Average of f1-score
0.35
0.34
Total
Average of precision 0.425714286
Average of recall 0.412857143
Average of f1-score 0.371162791

Figure - 10. Accuracy AlexNet model Classification


classification Accuracy Machine or deep Sharing algorithm
algorithms learning methods

From
Figure-13, we found result of AlexNet model can help good on problem identifying voices and
discusses the classification AlexNet algorithms for the problem of identification by voice using
deep learning methods by Precision 42.5% and Recall 41.3 % and F1-score 37.1%.

Accuracy Resnet model Classification


0.915 0.913571429

0.91

0.905
0.899761905
0.9 0.89744186
Average of precision
0.895 Average of recall
0.89 Average of f1-score

0.885
Total
Average of precision 0.913571429
Average of recall 0.899761905
Average of f1-score 0.89744186

Figure - 11. Accuracy Resnet model Classification

From Figure-14, we found result of ResNet model can help good on problem identifying voices
and discusses the classification ResNet algorithms for the problem of identification by voice
using deep learning methods by Precision 91.3% and Recall 89.9 % and F1-score 89.7 %.
our work ResNet 97.2039% deep learning MFCC algorithm in
methods the speech
AlexNet 65.95 % deep learning preprocessing
methods process
Orken support vector 83% Machine learning
Mamyrbayev method methods
et al (2019) Robust scaler method 90% Machine learning
methods

Table – 3: Comparison between our work with a Related work


In paper Orken Mamyrbayev et al (2019), In table-6 they discusses the classification algorithms
for the problem of personality identification by voice using machine learning methods.

In table-3, we compared between our work with Orken Mamyrbayev work, we shared in one
algorithm MFCC algorithm in the speech preprocessing process and different in classification
algorithms our work used deep learning methods ResNet with accuracy 97.2039 % and AlexNet
with accuracy 65.95 %, Orken Mamyrbayev work used Machine learning methods (support
vector method with accuracy 83 % and e Robust scaler method with accuracy 90% . Our work is
better using ResNet with accuracy 97.2039 % and worst using AlexNet with accuracy 65.95 %.

5. Conclusion
Our research has shown a thorough review of deep learning-based voice identification. We
examined the applicability of (CNN) Convolutional Neural Networks in identifying and verifying the
speakers by their voices, and we tested and identified the voice of a particular speaker. Based on the
study, we apply deep learning approaches to discuss algorithms for the problem of voice
identification. In the initial step of data preparation, we employed the MFCC algorithm in the voice
preprocessing process. A comparative examination of two classification methods was performed to
solve the problem. Furthermore, the current study developed a custom ResNet (Residual neural
network) model utilizing the well-known Python deep learning framework. AlexNet is a
convolutional neural network that has had a significant influence on the field of machine learning,
particularly in the application of deep learning. ResNet Accuracy is 97.2039 % , AlexNet Accuracy is
65.95% and We examined two models and found that the ResNet model produced the best results.
References

Andrei, V., Cucu, H., & Burileanu, C. (2017). Detecting Overlapped Speech on Short Timeframes Using Deep Learning. Interspeech
2017.

Andrew. (2016). Nuts and bolts of building AI applications using Deep Learning. NIPS Keynote Talk.

Bae, H. S., Lee, H. J., & Lee, S. G. (2016). Voice recognition based on adaptive MFCC and deep learning. In 2016 IEEE 11th
Conference on Industrial Electronics and Applications (ICIEA) (pp. 1542-1546). IEEE.

Bashar, A. (2019). Survey on evolving deep learning neural network architectures. Journal of Artificial Intelligence, 1(02), 73-82.

Boles, A., & Rad, P. (2017). Voice biometrics: Deep learning-based voiceprint authentication system. 2017 12Th System of
Systems Engineering Conference (Sose).

Bunrit, S., Inkian, T., Kerdprasop, N., & Kerdprasop, K. (2019). Text-Independent Speaker Identification Using Deep Learning
Model of Convolution Neural Network. International Journal Of Machine Learning And Computing, 9(2), 143-148.

Chen, H., Engkvist, O., Wang, Y., Olivecrona, M., & Blaschke, T. (2018). The rise of deep learning in drug discovery. Drug discovery
today, 23(6), 1241-1250.

Deng, L. (2018). Artificial intelligence in the rising wave of deep learning: The historical path and future outlook
[perspectives]. IEEE Signal Processing Magazine, 35(1), 180-177.

Deng, L., & Liu, Y. (Eds.). (2018). Deep learning in natural language processing. Springer.

Deng, L., Hinton, G., & Kingsbury, B. (2013). New types of deep neural network learning for speech recognition and related
applications: An overview. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 8599-8603).
IEEE.

Fang, S., Tsao, Y., Hsiao, M., Chen, J., Lai, Y., Lin, F., & Wang, C. (2019). Detection of Pathological Voice Using Cepstrum Vectors: A
Deep Learning Approach. Journal Of Voice, 33(5), 634-641.

Fayek, H. M., Lech, M., & Cavedon, L. (2017). Evaluating deep learning architectures for Speech Emotion Recognition. Neural
Networks, 92, 60-68.

Grover, S., Bhartia, S., Akshama, Yadav, A., & K.R., S. (2011). Predicting Severity Of Parkinson’s Disease Using Deep
Learning. Procedia Computer Science, 132, 1788-1794.

Han, Y., Kim, J., & Lee, K. (2016). Deep convolutional neural networks for predominant instrument recognition in polyphonic
music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(1), 208-221.

Heaven, D. (2019). Why deep-learning AIs are so easy to fool. Nature, 574(7777), 163-166.
Hegde, S., Shetty, S., Rai, S., & Dodderi, T. (2019). A Survey on Machine Learning Approaches for Automatic Detection of Voice
Disorders. Journal of Voice, 33(6), 947.e11-947.e33.

Khalil, R., Jones, E., Babar, M., Jan, T., Zafar, M., & Alhussain, T. (2019). Speech Emotion Recognition Using Deep Learning
Techniques: A Review. IEEE Access, 7, 117327-117345.

Lee, J. G., Jun, S., Cho, Y. W., Lee, H., Kim, G. B., Seo, J. B., & Kim, N. (2017). Deep learning in medical imaging: general overview.
Korean journal of radiology, 18(4), 570-584.

Lemley, J., Bazrafkan, S., & Corcoran, P. (2017). Deep Learning for Consumer Devices and Services: Pushing the limits for machine
learning, artificial intelligence, and computer vision. IEEE Consumer Electronics Magazine, 6(2), 48-56.

Liu, H., Fang, T., Zhou, T., Wang, Y., & Wang, L. (2018). Deep learning-based multimodal control interface for human-robot
collaboration. Procedia CIRP, 72, 3-8.

Malik, M., Adavanne, S., Drossos, K., Virtanen, T., Ticha, D., & Jarina, R. (2017). Stacked convolutional and recurrent neural
networks for music emotion recognition.

Marcus, G. (2018). Deep learning: A critical appraisal.

McEvoy, P., & Richards, D. (2006). A critical realist rationale for using a combination of quantitative and qualitative
methods. Journal of research in nursing, 11(1), 66-78.

Nguyen, G., Dlugolinsky, S., Bobák, M., Tran, V., García, Á. L., Heredia, I., ... & Hluchý, L. (2019). Machine learning and deep
learning frameworks and libraries for large-scale data mining: a survey. Artificial Intelligence Review, 52(1), 77-124.

Parcollet, T., Zhang, Y., Morchid, M., Trabelsi, C., Linarès, G., De Mori, R., & Bengio, Y. (2018). Quaternion convolutional neural
networks for end-to-end automatic speech recognition.

Pons, J., Slizovskaia, O., Gong, R., Gómez, E., & Serra, X. (2017, August). Timbre analysis of music audio signals with convolutional
neural networks. In 2017 25th European Signal Processing Conference (EUSIPCO) (pp. 2744-2748). IEEE.

Price, M., Glass, J., & Chandrakasan, A. (2018). A Low-Power Speech Recognizer and Voice Activity Detector Using Deep Neural
Networks. IEEE Journal of Solid-State Circuits, 53(1), 66-75.

Qazi, K., Nawaz, T., Mehmood, Z., Rashid, M., & Habib, H. (2018). A hybrid technique for speech segregation and classification
using a sophisticated deep neural network. PLOS ONE, 13(3), e0194151.

Samek, W., Wiegand, T., & Müller, K. R. (2017). Explainable artificial intelligence: Understanding, visualizing and interpreting
deep learning models. arXiv preprint arXiv:1708.08296.

Taylor, G. R. (Ed.). (2005). Integrating quantitative and qualitative methods in research. University press of America.
Wlodarczak, P., Soar, J., & Ally, M. (2015). Multimedia data mining using deep learning. 2015 Fifth International Conference on
Digital Information Processing and Communications (ICDIPC). h

Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Bengio, C. L. Y., & Courville, A. (2017). Towards end-to-end speech recognition with
deep convolutional neural networks. arXiv preprint arXiv:1701.02720.

Zhang, Z., Xu, Y., Yu, M., Zhang, S. X., Chen, L., & Yu, D. (2020). ADL-MVDR: All deep learning MVDR beamformer for target
speech separation. arXiv preprint arXiv:2008.06994.

Zhou, R., Liu, F., & Gravelle, C. W. (2020). Deep learning for modulation recognition: a survey with a demonstration. IEEE Access,
8, 67366-67376.

Nagrani, A., Chung, J.S. and Zisserman, A., 2017. Voxceleb: a large-scale speaker identification dataset. arXiv preprint
arXiv:1706.08612.

Ravanelli, M. and Bengio, Y., 2018, December. Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken
Language Technology Workshop (SLT) (pp. 1021-1028). IEEE.

Lu, W.K. and Zhang, Q., 2009. Deconvolutive short-time Fourier transform spectrogram. IEEE Signal Processing Letters, 16(7),
pp.576-579.

Vassil Panayotov, Guoguo Chen, Daniel Povey and Sanjeev Khudanpur, Open Speech and Language Resources. (2015). OpenSLR
(Version 1) [Dev-clean]. http://www.openslr.org/12/

A. Krizhevsky, I. Sutskever, G. E. Hinton, in Advances in Neural Information

Processing Systems. Imagenet classification with deep convolutional


neural networks, (2012), pp. 1097–1105. https://doi.org/10.1145/3065386

M, H. and M.N, S., 2015. A Review on Evaluation Metrics for Data Classification Evaluations. International Journal of Data Mining
& Knowledge Management Process, 5(2), pp.3-5.

Nasr, M., Abd-Elnaby, M., El-Fishawy, A., El-Rabaie, S., & Abd El-Samie, F. (2018). Speaker identification based on normalized
pitch frequency and Mel Frequency Cepstral Coefficients. International Journal Of Speech Technology, 21(4).

Mamyrbayev, O., Mekebayev, N., Turdalyuly, M., Oshanova, N., Medeni, T., & Yessentay, A. (2019). Voice Identification Using
Classification Algorithms. Intelligent System And Computing.

Zinchenko, K.; Wu, C.Y.; Song, K.T. A study on speech recognition control for a surgical robot. IEEE Trans. Ind. Inform. 2017, 13,
607–615.

You might also like