Professional Documents
Culture Documents
1 Introduction
Parkinson’s disease was discovered by James Parkinson in the year 1817 who described
this disease as “paralysis agitans” (Váradi, 2020). Parkinson’s disease is considered one of the
world’s fastest-growing neurological disorders after Alzheimer’s disease (Changqin Quan, Ren,
Luo, Chen, & Ling, 2022). It affects 1%-2% of people over the age of 60. It is an incurable
disease that is more common in men than in women as women consist of estrogen that acts as
the protective effect. Apart from that, nicotine which has monoamine-oxidase-B (MAO-B) and
caffeine adenosine A2a in coffee has potential protective effects against Parkinson’s disease.
The signs and symptoms shown in Parkinson’s disease will be more obvious as the disease
progresses. It will lead to a decrease in the quality of life for both patients and their family
members. According to Váradi (2020), the United States is diagnosed with 60,000 people with
Parkinson’s disease each year and it causes a significant socioeconomic cost of approximately
messages to the basal ganglia which is part of the brain used for movement and coordination
control. The presence of motor symptoms of Parkinson’s disease is noticeable only after the
loss of about 70% of neurons, and it can begin more than 20 years from the onset of the
neurodegenerative process (Semenova et al., 2023). Therefore, the long prodromal period is
the dominant reason for late diagnosis of the disease and causes late treatment to reduce the
symptoms.
1
2.2.1 Different stages of Parkinson’s disease
The stages of Parkinson’s disease are dependent on the degree of progression of the
Rating Scale (MDS-UPDRS) is the scale used to measure the progression of Parkinson’s
disease. MDS-UPDRS is used by neurologists to assess the motor and non-motor symptoms of
the patient. MDS-UPDRS has four parts, part 1 is about the non-motor experiences of daily
living. This part is used for the assessment of the impact of Parkinson's disease on daily
activities, mood, and cognition of the patient. In part 2, it describes motor experiences of daily
living. This part is used for the evaluation of motor function during daily activities such as
speaking, writing, eating, or swallowing. Part 3 describes the motor examinations. This part
utilized the assessment of the motor symptoms of the patient such as tremors, rigidity,
bradykinesia (slowness of movement), and postural instability. The last part is about the motor
complications. The last part addresses the motor complications (side effects) related to the
The stages of Parkinson’s disease can be distributed into preclinical, prodromal, and
clinical stages. During the preclinical stage, neurodegeneration begins in the substantia nigra
of the brain, but it does not display any signs. Next, it will proceed to the prodromal stage with
a duration of greater than 10 years. In the prodromal stage, non-motor symptoms started due to
the continuous neuronal loss (Váradi, 2020). Next, the progression of the disease reaches the
early stage. During this stage, approximately 40% to 60% of the dopamine in the brain cells
had been lost and motor symptoms such as bradykinesia (slowed movements), rigidity, and
tremors had shown. Non-motor symptoms appear earlier as compared with motor symptoms in
the prodromal stage which is useful to be utilized as the indication of early detection of the
disease. With the further deterioration of Parkinson’s disease, it will proceed to the mid-stage
which shows non-motor symptoms such as orthostatic hypotension, some urinary symptoms,
2
and motor symptoms such as axial deformities, and dyskinesias. In the late stage of the disease,
motor symptoms include postural instability, and non-motor symptoms such as hallucinations,
and dementia.
The symptoms of Parkinson’s disease can be divided into motor and non-motor
symptoms. Motor symptoms are related to movement such as resting tremor, muscle rigidity,
bradykinesia (slowed movements), and postural instability. Patients with bradykinesia are not
able to generate enough energy for muscle movements, thus the patients fail to produce a faster
movement. This triggered the patient with bradykinesia to have a slower reaction time and
rhythmic muscle contraction and relaxation occurs usually in the hands, lips, chin, and jaws.
Apart from that, inflexibility of the limbs and neck is the major reason for the rigidity of the
patient and causes a limited range of motion due to muscle stiffness. Motor symptoms are more
3
Non-motor symptoms include disruptions in sleeping, and difficulty in swallowing,
chewing, and speaking (Er, Isik, & Isik, 2021). In terms of psychotic symptoms, the patient
will have potential impulse control disorders followed by hallucinations. Next, examples of
disruptions in sleeping are such as restless legs syndrome, REM behavior disorder, and sleep
apnea (Váradi, 2020). Moreover, non-motor symptoms include potential cognitive impairment
such as impaired judgment, and identity confusion. Furthermore, it includes other indications
such as very low speech, a face with no expressions, shaky handwriting, and difficulty in
standing from the chair. For example, finger tremors will trigger changes in the handwriting of
a patient with Parkinson’s disease and they will tend to have small and cramped handwriting.
Paralinguistic features such as body language, gestures, facial expressions, tone, and
pitch of voice are part of the speech signals. Patients with Parkinson’s disease will have issues
in daily activities such as greeting each other, reciprocating and initiating social interaction,
and using suitable body language (Narasimha Rao & Meher, 2024). Up to 90% of people with
Parkinson’s disease will develop voice and speech disorders (Di Pietro et al., 2022). Moreover,
speech disorders due to Parkinson’s disease can be detected as early as five years before the
appearance of other motor dysfunctions (Narendra, Schuller, & Alku, 2021). The effect on the
4
For example, patients of Parkinson’s disease tend to have impaired speech pronunciations in
terms of vowels, phrases, and terms (Er et al., 2021). Speech signals are used to represent the
flow of air in the lung region and the appearance of glottis under the control of air volume and
flow of laryngeal muscle activation in terms of mechanical way of processing. The vibration
of the vocal cord will impair the fundamental frequency of voice quality along with the vocal
and voice intensity (Narasimha Rao & Meher, 2024). Ventilation dysfunction is the main
reason that triggers the underlying voice impairment in patients with Parkinson’s disease. The
speech characteristics of patient with Parkinson’s disease include reduced vocal tract volume
and tongue flexibility, inappropriate pauses, impairments in voice quality, and reduction in
pitch range and voice intensity (Narendra et al., 2021). Moreover, the fundamental frequency
range will be reduced followed by a monotone voice, and there are possibilities of imprecision
of vowel and consonant words during the pronunciation. For example, abnormal vowel
articulation is a sign of speech disorder in the early stages of Parkinson’s disease as the fine
muscles that control the voice production could be affected more intensely (Cordella, Paffi, &
Pallotti, 2021). The speech disorders due to Parkinson’s disease are closely-associated with
pulmonary dysfunction, as rigidity in respiratory muscles could reduce the ability to generate
enough expiratory pressure for voice production or airway clearance. Currently, there is no
exact proper diagnosis method to diagnose someone with Parkinson’s disease. However, MRI
scans help in the diagnosis of Parkinson’s disease as it provides details about the subcortical
structures of the human brain, but it is difficult to analyse the details of MRI scan through
human eyes (Bhatele, 2020). Therefore, an accurate and timely diagnosis method for
5
2.2.4 Conventional approach in diagnosis of Parkinson’s disease
Currently, the diagnosis of Parkinson’s disease is merely based on clinical criteria along
with the presence of symptoms such as bradykinesia (slowness in movements), rigidity, and
tremor. The advancement in the imaging, genetics, and biomarkers field does not elevate the
accuracy of diagnosis of Parkinson’s disease, especially in the early stages of the disease.
Magnetic resonance imaging (MRI) is used in aiding the diagnosis of Parkinson’s disease as it
shows accurate neuro-anatomic biomarkers. MRI scans can show the subcortical structures of
the human brain in detail, but it is difficult to analyse using human eyes. Next, positron
emission tomography (PET) and single photon emission computed tomography (SPECT) is
used for quantifying the loss of nigrostriatal dopaminergic fibers in Parkinson’s disease and for
the detection of the presence of dopaminergic dysfunction in patients (Guatelli, Aubin, Mora,
Naranjo-Torres, & Mora-Olivari, 2023). Both PET and SPECT scans can provide useful
Parkinson’s disease (Sigcha et al., 2023). Examples of drugs used for improving the signs of
Parkinson’s disease are such as levodopa and dopamine agonists. As patients with Parkinson’s
disease had a reduced amount of dopamine in the substantia nigra of the brain, levodopa acts
as a dopamine replacement agent that provides good control of motor symptoms in the early
stages of the disease. For instance, levodopa is highly-effective in reducing the bradykinetic
symptoms but does not stop the progression of the disease. In the long run, the use of drugs
could trigger side effects such as motor fluctuations and dyskinesias (Sigcha et al., 2023).
medications used to replicate the actions of chemical messengers in the brain. These drugs have
6
conjunction with levodopa, and they can be utilized in the initial phases of the disease or to
extend the effectiveness of levodopa. However, dopamine agonists will trigger greater side
effects as compared with levodopa. Examples of side effects include the feeling of drowsiness
Apart from medications, surgery can be one of the approaches to alleviate the symptoms
of Parkinson’s disease. Examples of surgeries that can be carried out are such as pallidotomy,
thalamotomy, and deep brain stimulation (DBS) ("Parkinson’s Disease," 2023). Pallidotomy is
recommended for individuals with severe Parkinson's symptoms or those who exhibit
insertion of a wire probe into the globus pallidus, which is approximately a quarter inch in size
of the brain region that is responsible for movement control. Normal movements of patients
with Parkinson’s disease can be restored through the introduction of lesions to globus pallidus.
and the gradual decline of spontaneous movement. Next, thalamotomy is a surgery that
involves the use of radiofrequency energy currents to eliminate a specific and small portion of
the thalamus. Thalamotomy is advantageous for patients with debilitating tremors in the hand
or arm, particularly those with essential tremors. Deep Brain Stimulation (DBS) presents a
safer alternative to both pallidotomy and thalamotomy. It uses small electrodes that are
implanted to deliver electrical impulses to either the subthalamic nucleus of the thalamus or
the globus pallidus which are the brain regions that are responsible for the function of the motor.
Magnetic resonance imaging (MRI) and neurophysiological mapping are used as guidance for
the implantation of electrodes into the brain. The electrodes are connected to an impulse
generator (IPG) resembling a pacemaker, which is positioned under the collarbone and beneath
the skin. There are wires interconnected between the electrodes and the generator. Electrodes
are placed on one side of the brain. For example, an electrode on the left side of the brain
7
controls symptoms on the right side of the body, and vice versa. In some cases, patients may
Parkinson’s disease diagnosis approaches can be used. Deep learning is a subset of machine
learning (ML) that utilize complex algorithms and deep neural network to train a model ("The
Best Introduction to Deep Learning - A Step by Step Guide," 2023). Deep learning is highly
dependent on artificial neural networks (ANN) such as deep neural networks (DNN). In large
pools of data, deep learning stands out for its capacity to automatically learn intricate patterns
and relationships within datasets without explicit programming. However, a large amount of
data is required to have a greater accuracy. Neural networks are used by deep learning which
consist of multiple layers of interconnected nodes. Applications of deep learning are such as in
systems. Examples of popular deep learning architectures are such as Convolutional Neural
Networks (CNN), Recurrent Neural Networks (RNN), and Deep Belief Networks (DBN).
In the fully-connected deep neural network, the architecture consists of an input layer
and one or more hidden layers that are sequentially connected. Some neurons are
interconnected between different layers. Neurons of each layer can have the option to receive
input from the preceding layer’s neurons or directly from the input layer. The output generated
by one neuron acts as the input for other neurons in subsequent layers until the final layer
produces the ultimate output of the network. Neural network undergoes non-linear
8
2.3.1 Convolutional Neural Network (CNN)
extensively utilized for the classification of images, audio, or videos. It stands out as a
specialized form of artificial neural network designed specifically for processing and analyzing
visual data. The applications of CNN involve image recognition, object detection, and the
representation autonomously and adaptively from input images. The structure of CNN consists
of different layers such as convolutional layers, pooling layers, and fully connected layers.
Convolutional layers employ filters or kernels to process input data and allow the network to
discern spatial patterns and features across different areas of an image. Pooling layers are
responsible for minimizing the spatial dimensions of the data through down-sampling and
focusing more on the most pertinent information. Subsequently, fully connected layers will be
learned features to formulate predictions or classifications. For instance, the first layer of CNN
is used to extract the basic features of an image such as horizontal and diagonal edges (Mandal,
2023). The output detected in the first layer will then pass to the next layers for detecting more
complicated features such as corners or combinational edges. Through the passing of outputs
among different layers, the network can detect more complex features such as objects or faces.
In the final convolutional layer, a final classification output can be present in terms of
confidence scores (0 or 1) that specify the class of the output or according to the input.
9
2.3.2 Transformer
Transformers are a category of neural networks that acquire context and comprehension
by sequentially analyzing data. Transformers are widely used in the field of natural language
processing (NLP) and computer vision. Transformers uses contemporary and evolving
mathematical techniques, which can be known as attention or self-attention that facilitate the
recognition of the influence and interdependence of distant data elements. Transformers will
differentially weigh the significance of each part of the input data (Chauhan, March 15, 2022).
Transformers are the deep learning model that works with sequence. The architecture of
transformers is made up of three compartments as encoder, the decoder, and the attention
mechanism. The encoder functions to transform an input sequence into state representation
selectively concentrate on pertinent aspects of the sequential input stream. The attention
mechanism helps in the contextualization of input data. A decoder is responsible for decoding
the state representation vector to produce the desired target output sequence. During the
beginning, inputs and outputs will be embedded into an n-dimensional space. Therefore, the
process of encoding the inputs is required. Within the transformer model, a recurrent neural
network does not exist to remember the sequence of feeding into the model. So, it is significant
to assign each word or sequence part with a relative position. The relative positions will be
10
Figure 4: The architecture of the Transformer.
According to Figure 4 which displays the architecture of the transformer, the encoder
represents the left-half of the architecture while the decoder represents the right-half of the
representations, which is then fed into a decoder, while the decoder receives the output from
the encoder along with the decoder output at the previous time step, and generates an output
2.3.3 VGG19
VGG19, also known as Visual Geometry Group 19 is a type of deep transfer learning
model belonging to Convolutional Neural Network (CNN) with 19 layers. VGG19 is made up
11
VGG19 can be known as a deep CNN that is used to classify images (Kaushik, 2023). The
models of VGG19 employ small 3x3 convolutional filters throughout the network, along with
max-pooling layers for spatial dimension down sampling. The uniform structure simplifies
both the comprehension and implementation of the model, contributing significantly to its
widespread adoption. During the task of image recognition, VGG19 showed a remarkable
performance in terms of its proficiency in learning intricate hierarchical features from images.
A Deep Neural Network (DNN), which can also be known as Deep Nets, is a neural
network characterized by its elevated complexity. A DNN consists of stacked neural networks
in multiple layers and it consists of input, output, and at least one hidden layer between them.
The task of DNN involves the handling of unlabelled and unstructured data, and it is often used
in the field of computer vision. Deep neural networks (DNN) often have a complex hidden
layer structure with a wide variety of different layers, such as a convolutional layer, max-
pooling layer, dense layer, and other unique layers ("Introduction to Deep Neural Networks,"
July, 2023). The extra layers enhance the ability of the model to comprehend problems
thoroughly and facilitate the delivery of optimal solutions for intricate projects. As compared
with Artificial Neural Network (ANN), DNN has more layers with additional complexity to
12
each of the layers. The increased depth empowers the model to process inputs more
comprehensively and provide optimal solutions as outputs. The applications of deep neural
networks (DNN) are such as in object detection, language translation tasks with BERT
such as VGG-19, RESNET-50, efficient net, and other similar networks for image processing
projects.
2.4 Related work with Deep Learning Approaches used for detection of Parkinson’s
Disease
Convolutional Neural Network (CNN) can capture complex patterns through multiple
layers of convolutional and pooling operations. For instance, speech signals consist of distinct
frequency bands and time segments which CNN can learn automatically from the features of
the input data. Moreover, speech signals can be represented in the form of spectrograms which
is a visual representation of the signal over time and frequency. CNN can undergo the process
of analysis of spectrogram data, thus allowing it to be a suitable fit in speech signal analysis.
13
According to the study of Goyal, Khandnor, and Aseri (2021), a hybrid approach with the use
with the use of CNN as a classifier. CNN can observe and capture the underlying patterns of
each decomposition level of the speech signals. There are 2 datasets used such as dataset D1
which consists of 573 single-channeled recordings (190 patients with Parkinson’s disease and
383 healthy control) and dataset D2 which consists of 374 double-channelled recordings. After
the collection of speech recordings from the subject, a noise reduction technique is performed,
and the crucial parts of the audio are manually segmented. The RSSD technique is used to
extract the resonance-based components of the signals. The signal will then decompose into
high-resonating components through TQWT. All the selected features will be converted into
Power Spectral Density (PSD) images to train the CNN classifier and the results will be
analyzed. Throughout the study, it is proved that CNN classifiers are better than other state-of-
the-art classification techniques such as K-Nearest Neighbour (KNN) and Support Vector
Machine (SVM). CNN has an accuracy of 98.12%, a recall of 0.97, a precision of 0.96, an F1-
score of 0.97, and a G-Mean of 0.98. On the other hand, KNN had an accuracy of only 60.90%,
a recall of 0.70, a precision and F1-score of 0.41, G-Mean of 0.54. Moreover, SVM had an
accuracy of 97.67% which is slightly lower as compared with CNN, a recall of 0.94, a precision
of 0.98, an F1-score, and a G-Mean of 0.96. The pitfalls of the proposed hybrid approach with
the CNN classifier are the small sample size. A large sample size is necessary for the
generalization purpose in the clinical use. Moreover, the classification of Parkinson’s disease
should not be limited only to the binary classification which determines if a person is with or
without the disease, but the detection approach should include the function of identifying the
Based on the study of Changqin Quan et al. (2022), an end-to-end deep learning model
for the detection of Parkinson’s disease from speech signals is used. Time series dynamic
14
features are extracted using a time-distributed two-dimensional convolutional neural network
(2D-CNN), and then one-dimensional CNN (1D-CNN) is used to find the dependencies among
them. There are two databases used in this study. The first database is obtained at the GYENNO
SCIENCE Parkinson’s Disease Research Center and it consists of 15 healthy control and 30
patients of Parkinson’s disease. The speech tasks involved in the first database are the vowel
“a” uttered in a sustained manner, and the reading of a short sentence. The second database is
obtained from PC-GITA which is collected from 100 people (50 healthy controls and 50
patients of Parkinson’s disease). The speech tasks involved are such as the three repetitions of
Spanish vowels “a” and “u” uttered in a sustained manner, reading of different words, reading
of simple sentences, and reading of complex sentences. The analysis of speech signals is based
on phonation (vowel task), articulation (vowel and word tasks), and prosody (simple and
complex sentence tasks). The ratio of splitting of training (validation) and testing sets is 6:4 for
the first database and 8:2 for the second database. In comparison with other end-to-end deep
learning models on the speech tasks of sustained vowel “a” for the first database such as MLP,
FCN, ResNet, Time-CNN, Encoder, CNN-LSTM, the proposed method (time-distributed 2D-
CNNs and 1D-CNN) had the highest accuracy of 81.56%, a F-score of 87.66%, specificity of
98.33%, sensitivity of 79.17%, and MCC of 0.5847. For the task of reading short sentences for
the first database, the proposed model displayed the highest accuracy of 75.33%, F-score of
83.62%, specificity of 93.33%, sensitivity of 76.71%, and MCC of 0.3782. For the second
database, the task of reading a simple and complex sentence reached the highest accuracy of
92%. For the task of reading a simple sentence, the F-score is 91.04%, the specificity is 94%,
the sensitivity of 91.21%, and the MCC of 0.8560. Next, for the task of reading a complex
sentence, the F-score is 92.68%, the specificity is 97%, the sensitivity of 78.40%, and the MCC
of 0.7234. Although the proposed model can maintain the interpretability of the data analysis
15
by ranking the importance of input features in the prediction of Parkinson’s disease, the dataset
Moreover, according to the study of Fang, Gong, Zhang, Sui, and Li (2021), a 6-layer
Convolutional Neural Network is applied for the classification purpose of Parkinson’s disease.
The study was done as a comparison with other deep learning models such as Long-Short Term
Memory (LSTM), and end-to-end systems. The speech data was obtained from the speech
samples of 34 patients with Parkinson’s disease and 34 healthy controls that were recorded via
different phones and DV. The task performed by the subjects is text reading and MFCC is the
speech features used for analysis. MFCC matrices are directly used without manual extraction
to avoid the loss of time-variant information. As a result, the 6D-CNN showed the best
performance as it had achieved an AUC of 0.984 and an accuracy of 0.938 or 93.8%. For further
improvement, neural networks that are more specifically-designed are required for better
accuracy. Apart from that, the use of apparatus to record should be standardized for constant
16
According to Narendra et al. (2021), raw speech signals without pre-processing are
added to the CNN network and classified using Multi-layer perceptron (MLP). CNN functions
to extract relevant information and pass it to MLP for estimation. The three techniques used
for the computing of time-domain waveforms are GIF (IAIF and QCP analysis) and ZFF
method. The speech signals are obtained from the PC-GITA database with a total of 50 patients
with Parkinson’s disease and 50 healthy controls. The speech tasks performed by the subjects
are sustained phonation, reading words aloud, diadochokinetic exercises, reading sentences
aloud, reading a text, and giving a monologue. The analysis of speech signals is according to
baseline features such as articulation, phonation, and prosody features along with glottal
features. The highest accuracy is the time-domain waveforms processed using QCP which is
68.56%, sensitivity of 63.40%, and specificity of 73.73%. To improve the accuracy of the
the Machine Learning (ML) method. Based on the study of Celik and Başaran (2023), the input
speech signal will first be trained under the SkipConNet module to extract the important feature
vectors then proceed with the use of Random Forest (RF) which is an ML method for the final
results. The first dataset used was collected from the University of Colorado National Center
for Voice and Speech and the University of Oxford which consists of 31 individuals (8 healthy
controls and 23 Parkinson’s patients), while the second dataset used was collected at the
of 252 individuals (64 healthy controls and 188 Parkinson’s patients). The speech features used
for analysis included baseline features, time-frequency features, MFCCs, wavelet transform-
based features, vocal fold features, and TWQT features. The proposed approach (SkipConNet
+ RF) reached the highest accuracy of 99.11%, a precision of 0.99, a recall of 0.99, an F1_score
of 0.99, a specificity of 98.77%, and AUC of 98.77% for the first database. For the second
17
database, the proposed approach (SkipConNet + RF) had reached the highest accuracy of
98.30 %, a precision of 0.99, a recall of 0.96, an F1_score of 0.97, a specificity of 95.83%, and
AUC of 95.83%. The second database was used by a similar study conducted by Gunduz
(2019), CNN is used for the detection of Parkinson’s disease through vocal signals, and it only
achieves an accuracy of 86.90%, and an F1_score of 0.917 which shows a lower percentage as
1D CNN and 2D CNN are used along with pre-trained models such as Wav2Vec2.0, BERT,
and BETO to distinguish between healthy people and patient with Parkinson’s disease. In terms
CNN only achieved 72.6%. In terms of sensitivity, 2D CNN had a value of 81.3 while 1D CNN
only achieved 53.8. In terms of specificity, 2D CNN had a value of 87.6 while 1D CNN had a
greater value of 92.5. In terms of the F1-score, 2D CNN had reached a value of 84.3 while 1D
According to the study of Costantini et al. (2023), the research focuses on the use of
Artificial Intelligence (AI) for the voice assessment of patients with Parkinson’s disease. The
target of the research involves the process of distinguishing healthy individuals, early untreated
Parkinson’s disease patients, and mid-advanced Parkinson’s disease patients treated with
levodopa through the analysis of their voices. Data is collected from 266 healthy control and
160 Parkinson’s patients (72 subjects are newly diagnosed, 88 subjects with medium-to-
advanced impairment patients). Convolutional Neural Network (CNN) is the deep learning
approach used for the diagnosis of Parkinson’s disease in the study. The results are compared
with the traditional Machine Learning (ML) models such as KNN, SVM, and Naïve Bayes
using the augmented Mel-spectograms as the speech features. The study found out that the
traditional Machine Learning (ML) models, such as KNN and SVM, had provided a higher
18
performance in three out of four binary classification tasks, with a mean accuracy of 81.75%
compared to 69.75% reached by the CNN. However, the CNN approach showed a slight
advantage for the multiclass tasks, with a 61% mean accuracy versus the 59.5% reported by
the traditional ML methods for the three classes. In summary, the traditional Machine Learning
(ML) approach demonstrated higher performance in most binary classification tasks, while the
CNN approach showed a slight advantage in multiclass tasks. To further improve the accuracy
of CNN, the sample size needs to be increased or a larger dataset should be used for more
reliable and accurate results. Moreover, additional speech tasks need to be carried out to have
an optimal result.
Based the research of (Khaskhoussy & Ayed, 2023), it involves the evaluation of
Support Vector Machine (SVM) and Convolutional Neural Network (CNN) to classify the data
from speech tasks. There are two input data types used such as raw speech signal values and i-
vector features of different dimensions. There are three approaches used for the diagnosis of
Parkinson’s disease through voice analysis. The first approach uses Convolutional Neural
Networks (CNN) for deep feature extraction from raw speech signals, and the use of MLP for
classification purposes. The second system uses deep features obtained by CNN for
classification with Support Vector Machines (SVM) employing different kernels. The third
system utilizes i-vectors obtained from Mel-frequency cepstral coefficients (MFCC) for feature
extraction and classification using SVM with different kernels. Evaluation of the performance
of the systems is based on with or without a cross-validation process. Additionally, the study
compares the proposed approach with other related works in the field of PD detection through
speech analysis. The results show that the hybridization of CNN features and Support Vector
Machines (SVM) displays good performance in the detection of Parkinson’s disease, with the
i-vectors of dimensions 200 exhibiting the best accuracies and F-scores in discriminating
between patients of Parkinson’s disease and healthy controls. The study also compares the
19
proposed approach with other related works and demonstrates that the proposed systems
outperform existing approaches, achieving an accuracy of 100%. To further improve the study,
other deep learning techniques should be used with more extensive experimentation with larger
recordings were utilized along with Extreme Learning Machine (ELM) random weight neural
networks for the detection of Parkinson’s disease. This study emphasizes the potential for non-
invasive and early diagnosis of the disease using voice alterations due to muscle stiffness in
patients. The database used is collected from 55 patients with Parkinson’s disease and 64
healthy controls. The performance of the experiments is compared among CNN and ELM in
terms of accuracy, training time, sensitivity, and specificity. CNN displayed a higher accuracy,
but it requires a longer training time and more resources. This research outlined the process
used for the experiments, such as the use of different types of spectrograms, data augmentation,
and the application of CNN and ELM for classification. Transfer learning will be applied to
the pre-trained CNN due to the smaller dataset. ELM utilized the features extracted from CNN
In summary, the highlights of the study involve the possibilities of using spectrograms of voice
recordings and ELM for the early and non-invasive detection of Parkinson’s disease. ELM had
a higher accuracy as compared with CNN with a reduced training time. ELM is a viable option
The datasets contain very few patients which increases the difficulties to train deep learning
models from scratch. Moreover, the study emphasizes the limitations of traditional CNN data
augmentation techniques. The performance between CNN and ELM is comparable with
fluctuations of accuracy between 83.91% for CNN and 81.74% for ELM.
20
The results also showed that an increase in the number of samples led to better performance
values, with the experiment considering color spectrograms of sound fragments showing the
best result. Additionally, the study found that the ELM had lower training times compared to
CNN. The study also used 10-fold cross-validation to ensure the objectivity of the experimental
results. Overall, the results indicate the potential of using ELM for the non-invasive and early
According to the study conducted by Yao, Chi, and Khishe (2022), it is about
the application of deep convolutional neural networks (DCNN) in the diagnosis of pathological
speech related to Parkinson's disease and cleft lip and palate. The best architecture for DCNN
will be automatically-selected using the whale optimization algorithm (WOA). The use of
unsupervised representation learning and challenges in selecting the best structure for DCNN
have been highlighted. The dataset used is from the CIEMPIESS corpus dataset with a total of
16717 sound recordings and 700 utterances taken out of the entire corpus. The highest accuracy
of 95.77% was achieved for the proposed model. In addition to this, the proposed model
achieved a high percentage in precision which is up to 96.33%. The results showed that the
proposed IPWOA model achieved higher accuracy than hand-crafted models. In summary, the
The constraints of the proposed model such as the complexity of the structures of DCNN may
cause the learning of datasets to become challenging. Moreover, the process of selecting the
best structure for DCNN can be difficult due to the high complexity associated with these
techniques.
21
2.4.2 Transformer
Since the detection of Parkinson’s disease often deals with the process of analysis of
the time-series data such as speech signals, therefore transformer-based deep learning model is
suitable to process the sequential data. The model has a great ability to capture the long-range
dependencies and temporal relationships, this has led them to effectively provide the modelling
of the dynamic nature of Parkinson’s symptoms. Based on the study of Nijhawan et al. (2023),
retrieving dysphonia measures from the voice recording of the subjects. The dataset is sourced
from the UCI Machine Learning Repository and it comprises patient voice features presented
in a comma-separated values (CSV) sheet format. The dataset consists of records from 188
Parkinson’s patients (107 men, 81 women) aged 33 to 87, while the healthy group comprises
64 individuals (23 men, 41 women). Vocal features utilized for Parkinson’s disease
classification include wavelet transform-based features, baseline features, vocal fold features,
TWQT features, and MFCC features. Despite the dataset containing 753 unique vocal features
Parkinson's and non-Parkinson's records. Employing a stratified k-fold strategy, the dataset is
partitioned into tenfold training and testing sets. The model's functionality incorporates
mechanisms such as feature selection within a trainable neural network (NN) model and is
categorized into three major blocks that are made up of feature embedder, transformer block,
and MLP (Multilayer Perceptron) head. The feature selection step is crucial for minimizing the
complexity of the transformer-based model and enhancing the overall accuracy. XgBoost
which is a Gradient Boosting Decision Trees (GBDT) framework, is chosen for feature
selection due to the dataset's complexity and high feature count. XgBoost ranks feature
importance based on scores, influencing Parkinson's disease classifications. The top features
identified are utilized to train the proposed network. Additionally, support vector classifier
22
(SVC) feature scores and permutation feature scores are used for comparison. The proposed
vocal tab transformer network comprises a feature embedder block, transforming each feature
encoder block, converting each input vector into a highly contextualized vector representation.
The representation vector is utilized by the MLP head for predicting Parkinson’s disease
scores. The proposed approach, utilizing dysphonia measures, outperforms the current state-
in precision and recall scores. The average ROC-AUC for the proposed network is 0.8574.
2.4.3 VGG19
employed directly for the detection of Parkinson's disease for non-image data such as speech
signals. However, there is a study performed by Bhatele (2020) that employed VGG19 deep
transfer learning architecture in the detection of Parkinson’s disease through the MRI scans of
patients with Parkinson’s disease. The dataset used for Parkinson’s disease is from Parkinson’s
Progression Markers Initiative (PPMI) databases. The dataset consists of 50 patients with
Parkinson’s disease (20 females and 30 males) and 50 healthy controls (26 females and 24
males). Firstly, MRI scans in the form of DICOM (Digital Imaging and Communications in
Medicine) are converted into the PNG (Portable Network Graphics) format. Then, the process
of training of VGG19 occurs along with the modifications of the last three layers of the
divided into 5 blocks and 5 max pooling functions are used for joining these blocks. The input
size is kept at 224*224*3 and the size of the filter is kept at 3*3 in each layer for the handling
of trainable parameters. Lastly, the last 3 layers such as Flatten, dropout, and dense will be
added into the architecture. The proposed VGG19 approach gained an accuracy of 90%,
23
precision of 80%, sensitivity of 83%, and F1 score of 79%. As compared with a similar study
conducted by Sivaranjini and Sujatha (2019) which utilizes the AlexNet model, the accuracy
is lower which is 88.9% and it has a sensitivity of 89.3%. This proved that using VGG19 in the
study of Bhatele (2020) has greater accuracy, efficiency, and adaptability in the classification
As explained by Rahman, Hasan, Sarkar, and Khan (2023), Machine Learning (ML)
and deep learning (DL) are the two approaches used for the prediction and categorization of
healthy controls and patients with Parkinson’s disease at an early stage using speech signals.
The dataset used is from the UCI repository for Parkinson’s disease. Rahman et al. (2023)
highlighted the prevalence of speech problems in patients with Parkinson’s disease and the
importance of early diagnosis for improving patient outcomes. Comparisons had been made
between multiple Machine Learning (ML) models such as Extreme Gradient Boosting, Ada
Boost, Random Forest, and Support Vector Machine, along with deep neural network (DNN)
models. The results showed that the Extreme Gradient Boosting classifier had achieved the
highest classification accuracy of 92.18% among the ML classifiers, and the three-layer DNN
(DNN2) achieved the best accuracy of 95.41% among the DL techniques. Deep neural
networks are proved to perform better than machine learning. The evaluation metrics used in
the study include accuracy, precision, recall, F1 score, and AUC curve to quantify the
performance and efficacy of the classifiers. To further improve, a larger dataset needs to be
used along with cutting-edge deep learning techniques. For example, amounts of medical data,
resource efficiency, security, and privacy need to be managed to make Machine Learning (ML)
24
According to the study of Bhatt, Jayanthi, and Kumar (2023), the authors highlighted
the use of high-resolutions superlet transform-based techniques for the detection of Parkinson’s
disease using speech signals. Superlet Transform (SLT) is used to convert the speech signals
into 2D spectrograms, and then a Deep Neural Network (DNN) is used for classification
purposes. The speech features used for evaluations are the sustainable vowels, modulated
vowels, DDK analysis, and the isolated words for the PC-GITA dataset and vowels for the
ItalianPVS dataset. The results show that the proposed method outperforms existing techniques
for the detection of Parkinson’s disease with an overall accuracy of 92% on the PC-GITA
dataset and 96% on the ItalianPVS dataset. The performance of VGG16 with SLT shows the
best performance on the PC-GITA dataset as compared with other DNN models such as
InceptionResNetV2 and ResNet50v2. The proposed framework using SLT and DNN provided
a non-intrusive, low-cost, and remote method for the detection of Parkinson’s disease, which
leverages the non-motor symptoms of PD evident in speech signals. The method shows
2.4.6 Others
Based on the study of (Zahid et al., 2020), it is about the development of a computer-
aided diagnostic system for Parkinson's disease using speech signals. The research emphasizes
the use of spectrogram-based deep features and acoustic features for the detection of
Parkinson's disease. There are three research methods such as the transfer learning-based
approach, deep feature extraction using machine learning classifiers, and simple acoustic
feature evaluation using machine learning classifiers. The dataset used is the Spanish dataset
from PC-GITA with 50 patients of Parkinson’s disease and 50 healthy controls. The speech
features used are such as articulation, prosody, and phonation. The proposed framework which
is the deep feature extraction method outperformed the other approaches with an accuracy of
99.3%. The results showed that deep features are suitable to distinguish and diagnose healthy
25
people or a patient with Parkinson’s disease. In summary, the research displayed promising
results for the development of a computer-aided diagnostic system for Parkinson's disease
using speech signals, with a focus on deep feature extraction and machine learning
classification. The proposed approach shows potential for accurate and early detection of
framework for the identification of Parkinson’s disease from speech signals was developed.
The proposed framework consists of a pre-processing step for the speech signal using the
Determinate Haar Wavelet (DHW) transformation technique, a feature extraction step using
the Statistical Time Frequency Renyi (STFR) model, a feature optimization step using the
Adaptive Intelligent Polar Bear (AIPB) optimization algorithm, and the use of Quantized
Contempo Neural Network (QCNN) algorithm for the detection of Parkinson’s disease.
Different speech signal datasets were used, and comparisons were made with the current
conventional methods for the detection of Parkinson’s disease. In this study, the proposed
framework is compared with existing Machine Learning (ML) and Deep Learning (DL)
techniques for the diagnosis of Parkinson’s disease. AIPB + QCNN had outperformed other
approaches in terms of accuracy, sensitivity, specificity, and F1-score. Moreover, the detection
accuracy and ROC characteristics had improved as compared to existing approaches. This
research provides a novel and effective approach for automated diagnosis of Parkinson’s
disease using speech signals. Specifically, the AIPB + QCNN model outperforms other ML
techniques such as Decision Trees (DT), Multilayer Perceptron (MLP), K-Nearest Neighbors
(KNN), Gaussian Naive Bayes (GNB), and Support Vector Machine (SVM) in terms of
accuracy, sensitivity, specificity, and F1-score. For example, the AIPB + QCNN model
98.8%. The results showcased its superior performance in accurately detecting PD from speech
26
signals. To further improve, the framework could have an ability for the classifications for
different stages of Parkinson’s disease. The drawbacks of the proposed frameworks included
the possibility of overfitting due to the increased dimensionality. Moreover, challenges will be
faced as the increase in the complexity of the system and the increase in computational
expenses.
Based on C. Quan, Ren, and Luo (2021), the study focuses on the detection of
Parkinson's Disease (PD) using dynamic speech features. To capture the time-series
characteristics of speech signals for the detection of Parkinson’s disease, the authors proposed
a framework that combines Bidirectional long short-term memory (LSTM) models with
dynamic articulation transition features. The dataset was collected from 45 subjects (15 healthy
controls and 30 Parkinson’s patients) who are volunteers at the GYENNO SCIENCE Parkinson
Disease Research Center. A comparison study had been done between Deep Learning (DL)
models and Machine Learning (ML) models using static speech features. The research explored
both motor and non-motor symptoms of Parkinson’s disease and focused on speech
disturbances that are common among the patients. The study found that the dynamic speech
between healthy control (HC) speakers and patients of Parkinson’s disease. The proposed
method using Bidirectional LSTM models and dynamic features yields remarkable
improvements in the detection accuracy over traditional Machine Learning (ML) models using
static features. The research emphasizes the possibility of the proposed Deep Learning-based
method for the detection of Parkinson’s disease. 10-fold cross-validation is used to enhance the
accuracy of the detection of Parkinson’s disease over traditional Machine Learning (ML)
methods using static features and end-to-end Deep Learning using CNN models. The
bidirectional LSTM model using dynamic speech features achieved an accuracy of 84.29% for
27
over traditional Machine Learning (ML) models. The study also compared the performance of
Deep Learning (DL) models with different speech inputs, showing that both CNN and
Bidirectional LSTM model demonstrating the highest accuracy. Therefore, the proposed
framework has shown promising accuracy in the detection of Parkinson’s disease, particularly
when utilizing dynamic speech features and DL models. The pitfalls of the study include the
potential bias in performance evaluation due to the use of leave-one-out cross-validation and
the need for further exploration of more complex network architectures. Additionally, the study
did not directly compare its results with other related studies, which could affect the
Next, Narasimha Rao and Meher (2024) presented a novel automated model for the
diagnosis of multiple diseases using speech or voice signals. The proposed model, termed
GoogleNet, Radial Basis Function (RBF), and Gated Recurrent Unit (GRU). Dataset 1 was
collected from the Kaggle website, dataset 2 was obtained from GitHub, and Dataset 3 was
obtained through the given link. Firstly, the signals will be decomposed using Empirical
Wavelet Transform (EWT) and then fed into the classification models. The model aims to
address the limitations of existing approaches and improve the accuracy and efficiency of
disease diagnosis. Various datasets were used for validation and a comparison was made
between the proposed model with the existing conventional approaches. The heuristic
learning models. The results showed that the developed model achieved high accuracy and
voice signals. There are three classification frameworks for the proposed model such as STFT
+ deep features 1 + ResNet and GoogleNet, STFT + weighted features 2 + ORGRU, and STFT
28
+ deep features 3 + ORGRU. Different feature extractions and classification techniques are
used for different frameworks. The performance metrics of the proposed model for diagnosing
multiple diseases using speech or voice signals include accuracy, F1-score, false negative rate
(FNR), false positive rate (FPR), Matthews Correlation Coefficient (MCC), precision,
sensitivity, and specificity. The proposed ORG-RGRU model achieved high accuracy values
of 95.75% for dataset 1, 95.72% for dataset 2, and 95.74% for dataset 3. In future work, efforts
will be made to solve the computational burden of multiple disease diagnosis using the
developed model.
In the study of (Ma et al., 2021), a novel deep dual-side learning ensemble model has
been proposed for the speech recognition task of Parkinson’s disease. The preliminary focus is
on the early diagnosis of Parkinson’s disease through Machine Learning (ML)-based speech
data analysis. The proposed model combined deep feature learning and deep sample learning
for accuracy enhancement. An embedded stack group sparse autoencoder for deep feature
learning and an iterative mean clustering algorithm were introduced. The model achieved a
high accuracy rate of 98.4% and 99.6% on two representative PD speech datasets,
Parkinson’s disease speech recognition has been demonstrated in this study. The document
emphasized the importance of early diagnosis of Parkinson’s disease due to the increasing
prevalence of the disease. It discusses the significance of non-invasive and efficient detection
methods and the potential of speech data analysis for PD diagnosis and rehabilitation
assessment. The proposed algorithm is validated using two public Parkinson’s disease speech
datasets, and the results demonstrate its effectiveness in improving classification accuracy.
subjects with Parkinson’s disease and each subject had 9 speech samples of different
pronunciation tasks. Dataset 2 is the Sakar dataset that contains 40 subjects with 6 women and
29
14 men as PD patients. Moreover, the document compares the proposed algorithm with existing
relevant algorithms and highlights its superior performance in terms of accuracy, sensitivity,
and specificity. The proposed algorithm outperforms other methods, showcasing its
other methods, showcasing its effectiveness in the speech recognition of Parkinson’s disease.
The deep dual-side learning integration model combines deep feature learning and deep sample
advancement in the field. The limitation in this study is the small sample size which causes
restrictions in the number of layers in the deep sample space. Furthermore, the comparison
with control groups indicates that the number of samples is small, and the deep features may
The study of (Er et al., 2021), is about the detection of Parkinson’s disease (PD) using
speech signals using a new approach based on pre-trained deep networks and Long Short-Term
Memory (LSTM) using mel-spectrograms obtained from the denoised speech signals with
Variational Mode Decomposition (VMD). The study utilized the PC-GITA dataset, which
consists of speech recordings from 50 patients with Parkinson’s disease and 50 healthy controls.
There are several steps involved in the proposed methods such as the use of VMD for noise
removal, the extraction of mel-spectrograms, the feature extraction using pre-trained deep
networks (ResNet-18, ResNet-50, and ResNet-101), and the classification using LSTM to
define sequential information from the extracted features. The proposed method was proved to
have a better performance as compared with other methods. The results showed that the
proposed method achieved high accuracy rates, ranging from 94.26% to 98.61%, depending on
the specific model architecture, batch size, and learning rate. In conclusion, the study
demonstrated that the proposed approach, which combined CNN and LSTM using mel-
spectrograms and VMD for PD detection, outperformed other methods in the literature. The
30
findings suggest that the proposed method has the potential to improve the accuracy of PD
diagnosis using speech signals. In terms of drawbacks, the dataset of PC-GITA may not fully
represent the diversity of speech signals in different populations. Additionally, the study
focuses on the detection of Parkinson's disease using speech signals and deep learning
approaches, which may not account for other potential factors or biomarkers that could
Throughout the study, there are several datasets used such as open-access database such
as speech dataset from UCI Machine Learning, PC-GITA, from GYENNO SCIENCE
Parkinson’s Disease Research Center, ItalianPVS dataset, Kaggle and GitHub website, and
CIEMPIESS corpus dataset, and some self-collected database. The speech dataset employed in
this study to investigate Parkinson's disease comprises a diverse assortment of audio recordings
from individuals with Parkinson's and a control group of healthy subjects. The dataset was
acquired from a reputable repository specializing in medical and healthcare datasets, ensuring
its reliability and alignment with the research objectives. Each participant contributed speech
samples spanning various linguistic tasks and vocal exercises. To uphold ethical standards, the
processing steps involved extracting acoustic features and normalizing speech signals to
potential biases and confounding factors in the dataset. The dataset's inclusivity of detailed
characteristics associated with Parkinson's disease. Its availability and adherence to ethical
31
2.5 Summary
non-motor symptoms, emphasizing the critical importance of early and precise diagnosis for
detect Parkinson’s disease are unable to have an early-detection during the early stages of the
disease progression. Recent investigations have delved into the capabilities of deep learning
models to automate the categorization of Parkinson's disease, utilizing various data modalities
such as speech signals. Therefore, an intelligent system that can detect Parkinson’s disease
through early signs or symptoms is vital to have a better treatment to alleviate the medical
conditions. There are several types of deep learning models such as Convolutional Neural
Network (CNN), Transformer, VGG19, and Deep Neural Network (DNN). Firstly, CNN has
demonstrated its effectiveness in handling speech signals across various applications, such as
speech recognition. These networks excel at capturing local patterns and temporal
dependencies within audio spectrograms. Through the utilization of convolutional layers, CNN
can autonomously learn and extract pertinent features from diverse frequency bands in speech
signals, making them well-suited for tasks where the hierarchical representation of audio
features is pivotal. Next, transformer models have proven their adaptability to tasks related to
speech. The self-attention mechanism inherent in Transformers facilitates the capture of long-
range dependencies in speech signals, thereby enhancing the modeling of context and
relationships among different segments of the audio sequence. Moreover, VGG19 had been
successfully applied to the task of detection of Parkinson’s disease using MRI scan images. In
medical imaging, particularly with MRI scans, VGG19's deep layer structure proves
advantageous in extracting hierarchical features from the intricate details present in medical
images. VGG19 has demonstrated its ability to capture and interpret complex patterns and
structures within medical imagery. The strength of VGG19 lies in its capacity to discern
32
intricate patterns and subtle details within MRI scans, contributing to the accurate detection of
anomalies or specific medical conditions. However, it's crucial to consider the domain gap
between the original training data of VGG19 and the medical imaging data, necessitating
careful fine-tuning and adaptation strategies for optimal performance. Next, DNN exhibited
signals, making them adaptable to tasks such as speech recognition, speaker identification, and
emotion analysis. DNN architectures, particularly recurrent and convolutional variants, possess
the capability to capture temporal dependencies and spectral features, enabling a more
and DNN - offers distinct advantages in processing speech signals. Their efficacy is contingent
upon the specific task at hand, be it speech recognition, emotion analysis, or speaker
identification. Researchers often select or adapt these models based on the inherent
characteristics of the speech data and the specific requirements of the target applications
33
List of reviewed articles and journals with deep learning models used for the detection of Parkinson’s disease using speech signals with accuracy.
Table 2.1: Summary of deep learning models used for the detection of Parkinson’s disease using speech signals.
(Nijhawan et al., Transformer- UCI Machine Learning Repository Dysphonia measures It outperforms the current SOTA GBDT-
2023) based model: collected at the Dept. of Neurology based solution by at least 1% Area Under
XGBoost of in the Faculty of Medicine, the Receiver Operating Characteristic
GBDT. Istanbul University. It has 188 Curve (AUC) score. The average ROC-
patients with Parkinson’s disease AUC for the proposed network is 0.8574.
and 64 healthy controls.
(Goyal, Khandnor, CNN Dataset D1 was from the “Mobile Resonance-based Accuracy using the combination features:
& Aseri, 2021) Device Voice Recordings at features (TQWT: 100% Validation accuracy: 99.37%.
King’s College London with 16 Tunable Q-factor
patients with Parkinson’s disease wavelet transform)
and 21 healthy controls.
34
people who are Parkinson’s pitch perturbation
disease patients and 50 healthy quotient, energy), The read-text dataset shows an accuracy
controls. prosody (duration, F0 of 91%.
contour, energy
contour), and The monologue dataset shows an accuracy
articulation (MFCCs). of 86.36%.
35
Parkinson’s disease and 34 healthy
controls via different phones and
DVs.
(Narendra et al., End-to-end PC-GITA database. Baseline and glottal The accuracy of Baseline + Glottal (IAIF)
2021) approach (CNN + features (articulation, is 67%, while the accuracy of baseline +
MLP) phonation, and prosody glottal (QCP) is 67.93%.
features.)
(Pragadeeswaran AIPB + QCNN UCI Parkinson speech dataset. Monophonic speech, The detection accuracy of AIPB + QCNN is
& Kannimuthu, short sentences, and 98.5%, sensitivity of 98.6%, and specificity
2024) dynamic speech of 98.5%.
features.
(Celik & Başaran, SkipConNet + RF The first dataset (PDO_Dataset) PDO_Dataset: It The accuracy of PDO_Dataset using
2023) based on CNN and was created in collaboration with extracts 23 features SkipConNet + RF is 99.11% while the
random forest the University of Colorado from each sound accuracy of PD_Dataset has an accuracy of
(RF). National Center for Voice and recording (no mention 98.30%.
Speech and the University of of exact speech
Oxford. This dataset consists of features).
195 speech recordings from 31
individuals (8 healthy controls and PD_Dataset: It extracts
23 Parkinson’s patients). 6 different signal
processing techniques
The other dataset (PD_Dataset) is such as baseline
collected at the Istanbul University features, time-frequency
Cerrahpasa Faculty of Medicine, features, MFCCs,
Department of Neurology. It wavelet transform-
consists of 252 individuals with 64 based features, vocal
healthy controls and 188 with fold features, and
Parkinson’s disease. TQWT features.
36
(Bhatt et al., 2023) SLT and DNN. PC-GITA dataset and ItalianPVS Features used are The accuracy of VGG-16 with SLT
dataset. phonation, articulation, achieves an accuracy of 92% while it
and prosody features. achieves 96% accuracy on the ItalianPVS
dataset.
(C. Quan et al., Bidirectional The dataset was collected from 45 Phonation (jitter, The accuracy of articulation features using
2021) Long-short term subjects (15 healthy controls and temporal perturbation of Bidirectional LSTM is 77.36% for input
memory (LSTM) 30 Parkinson’s patients) who are the fundamental monophonic speech /a/ and it has an
model to capture volunteers at the GYENNO frequency, shimmer, accuracy of 84.29% for input short
time-series SCIENCE Parkinson Disease amplitude perturbation sentence speech.
dynamic features Research Center. quotient, pitch
of a speech signal. perturbation quotient),
and articulation features
(vocal formants, vowel
space area, vocal
pentagon area, formant
centralization ratio).
(Escobar-Grisales 1D CNN and 2D Data is collected from 165 Speech features: The The accuracy of speech modality yielded an
et al., 2023) CNN. Colombian Spanish native participant was asked to accuracy of 88% and outperformed all
speakers, wherein 80 of them describe a regular day in language representations, including the
suffer from Parkinson’s disease. his/her life for multi-modal approach.
approximately 90 days.
(Costantini et al., CNN. Data is collected from 266 healthy Mel spectrograms. The mean accuracy of CNN is 69.75% and
2023) control and 160 Parkinson’s 53% for the mid-advanced level. The
patients (72 subjects are newly accuracy of CNN is 61% of the mean
diagnosed, 88 subjects with accuracy for multiclass tasks.
medium-to-advanced impairment
patients).
(Khaskhoussy & CNN + MLP, and Data is collected by Sakar et al. It Phonetic-prosodic The proposed systems outperform existing
Ayed, 2023) CNN + SVM. consists of 1208 voice recordings features. approaches, achieving an accuracy of
in WAV format. 100%.
37
(Narasimha Rao & ORG-RGRU Dataset 1: It is collected from the MFCC, Cepstral and The proposed model shows an accuracy of
Meher, 2024) Kaggle website “https Spectral features, 95%.
://www.kaggle.com/datasets/dipa principle speech
yanbiswas/parkinsons-disease- features, and pitch
speech-signal-features” features (zero frequency
response filter).
Dataset 2: It is represented by
“https://github.com/Mak-
Sim/Troparion/tree/master/SPA20
19″.
(Ma et al., 2021) A deep dual-side Dataset 1: Dataset 1: Continuous The accuracy reaches 98.4% and 99.6% for
learning ensemble LSVT_voice_rehabilitation vowel sounds the two respective datasets.
model is dataset. This dataset consists of 14
developed. subjects with Parkinson’s disease
Dataset 2: 26 Turkish
and each subject had 9 speech pronunciation tasks
samples of different pronunciation
including continuous
tasks. vowels, numbers,
words, and short
Dataset 2: Sakar dataset that sentences.
contains 40 subjects with 6 women
and 14 men as PD patients.
(Guatelli et al., CNN and hybrid The database is obtained from Sustained phonation of The accuracy of ELM and CNN is similar
2023) CNN-ELM Giuliano. This dataset consists of the letter “a”. and reaches a maximum of 91.30%.
(Extreme 55 people suffering from
Learning Parkinson’s disease and 64 healthy
Machine). controls.
38
(Er et al., 2021) Pre-trained deep
PC-GITA Spanish dataset. The Mel spectrogram The highest classification performance is
dataset consists of 50 Parkinson’s
networks and long obtained from RESNET-101 + LSTM
short-term patients and 50 healthy controls. model with VMD as 98.61%.
memory (LSTM).The details of the dataset include
recording monologues and vowels
and reading the text.
(Yao et al., 2022) DCNN. CIEMPIESS corpus dataset with a Auditory The accuracy is up to 95.77%.
total of 16717 sound recordings. characteristics, such as
spectro-temporal
sparsity, frequency
masking, time masking,
pitch shifting, and time
stretching.
(Gemci & Ibrikci, Feed-forward UCI Machine Learning Repository Jitter variants, shimmer The results are classified with 100%
2019) Neural Network variants, fundamental accuracy using an 80-20% train-test data
(FFNN) frequency, baseline partition and 30 epoch numbers.
feature, harmonicity,
recurrent period density
entropy (RPDE), pitch
period entropy (PPE),
intensity parameters,
formant frequencies,
glottis quotient (GQ),
vocal fold features, mel
frequency cepstral
coefficients (MFCCs),
wavelet transform-
based features, and
39
tunable Q-factor
wavelet transform
(TWQT) features.
(Huseyn, 2020) Feed-forward UCI Machine Learning Repository Jitter variants, shimmer The results are classified with 100%
Neural Network variants, fundamental accuracy using an 80-20% train-test data
(FFNN) frequency, baseline partition and 30 epoch numbers.
feature, harmonicity,
recurrent period density
entropy (RPDE), pitch
period entropy (PPE),
intensity parameters,
formant frequencies,
glottis quotient (GQ),
vocal fold features, mel
frequency cepstral
coefficients (MFCCs),
wavelet transform-
based features, and
tunable Q-factor
wavelet transform
(TWQT) features.
40
The Best Introduction to Deep Learning - A Step by Step Guide. (2023, July 21, 2023). Retrieved from
https://www.simplilearn.com/tutorials/deep-learning-tutorial/introduction-to-deep-
learning
Bhatele, K. R. (2020). Classification of Neurodegenerative Diseases Based on VGG 19 Deep Transfer
Learning Architecture: A Deep Learning Approach. Bioscience Biotechnology Research
Communications, 13, 1972-1980. doi:10.21786/bbrc/13.4/51
Bhatt, K., Jayanthi, N., & Kumar, M. (2023). High-resolution superlet transform based techniques for
Parkinson's disease detection using speech signal. Applied Acoustics, 214, 109657.
doi:https://doi.org/10.1016/j.apacoust.2023.109657
Celik, G., & Başaran, E. (2023). Proposing a new approach based on convolutional neural networks and
random forest for the diagnosis of Parkinson's disease from speech signals. Applied Acoustics,
211, 109476. doi:https://doi.org/10.1016/j.apacoust.2023.109476
Chauhan, N. S. (March 15, 2022). Transformer Neural Network in Deep Learning: Explained. Retrieved
from https://www.theaidream.com/post/transformer-neural-network-in-deep-learning-
explained#:~:text=Transformers%20are%20the%20current%20state,these%20is%20a%20ma
chine%20translation.
Cordella, F., Paffi, A., & Pallotti, A. (2021, 23-25 June 2021). Classification-based screening of
Parkinson’s disease patients through voice signal. Paper presented at the 2021 IEEE
International Symposium on Medical Measurements and Applications (MeMeA).
Costantini, G., Cesarini, V., Di Leo, P., Amato, F., Suppa, A., Asci, F., . . . Saggio, G. (2023). Artificial
Intelligence-Based Voice Assessment of Patients with Parkinson's Disease Off and On
Treatment: Machine vs. Deep-Learning Comparison. In Sensors (Basel, Switzerland) (Vol. 23).
Switzerland: MDPI.
Di Pietro, D. A., Olivares, A., Comini, L., Vezzadini, G., Luisa, A., Petrolati, A., . . . Vitacca, M. (2022).
Voice Alterations, Dysarthria, and Respiratory Derangements in Patients With Parkinson's
Disease. J Speech Lang Hear Res, 65(10), 3749-3757. doi:10.1044/2022_jslhr-21-00539
Er, M. B., Isik, E., & Isik, I. (2021). Parkinson’s detection based on combined CNN and LSTM using
enhanced speech signals with Variational mode decomposition. Biomedical Signal Processing
and Control, 70, 103006. doi:https://doi.org/10.1016/j.bspc.2021.103006
Escobar-Grisales, D., Ríos-Urrego, C. D., & Orozco-Arroyave, J. R. (2023). Deep Learning and Artificial
Intelligence Applied to Model Speech and Language in Parkinson’s Disease.
Diagnostics, 13(13). doi:10.3390/diagnostics13132163
Fang, H., Gong, C., Zhang, C., Sui, Y., & Li, L. (2021). Parkinsonian Chinese Speech Analysis towards
Automatic Classification of Parkinson's Disease.
Gemci, F., & Ibrikci, T. (2019). USING DEEP LEARNING ALGORITHM TO DIAGNOSE PARKINSON DISEASE
WITH HIGH ACCURACY.
Goyal, J., Khandnor, P., & Aseri, T. C. (2021). A Hybrid Approach for Parkinson’s Disease diagnosis with
Resonance and Time-Frequency based features from Speech signals. Expert Systems with
Applications, 182, 115283. doi:https://doi.org/10.1016/j.eswa.2021.115283
Guatelli, R., Aubin, V., Mora, M., Naranjo-Torres, J., & Mora-Olivari, A. (2023). Detection of Parkinson’s
disease based on spectrograms of voice recordings and Extreme Learning Machine random
weight neural networks. Engineering Applications of Artificial Intelligence, 125, 106700.
doi:https://doi.org/10.1016/j.engappai.2023.106700
Gunduz, H. (2019). Deep Learning-Based Parkinson’s Disease Classification Using Vocal Feature Sets.
IEEE Access, 7, 115540-115551. doi:10.1109/ACCESS.2019.2936564
Huseyn, E. (2020). Diagnosing Parkinson's Disease using Deep Learning Algorithms.
Introduction to Deep Neural Networks. (July, 2023). Retrieved from
https://www.datacamp.com/tutorial/introduction-to-deep-neural-networks
Kaushik, A. (2023). Understanding the VGG19 Architecture. Retrieved from
https://iq.opengenus.org/vgg19-architecture/
41
Khaskhoussy, R., & Ayed, Y. B. (2023). Improving Parkinson’s disease recognition through voice
analysis using deep learning. Pattern Recognition Letters, 168, 64-70.
doi:https://doi.org/10.1016/j.patrec.2023.03.011
Ma, J., Zhang, Y., Li, Y., Zhou, L., Qin, L., Zeng, Y., . . . Lei, Y. (2021). Deep dual-side learning ensemble
model for Parkinson speech recognition. Biomedical Signal Processing and Control, 69, 102849.
doi:https://doi.org/10.1016/j.bspc.2021.102849
Mandal, M. (2023, November 16, 2023). Introduction to Convolutional Neural Networks (CNN).
Retrieved from https://www.analyticsvidhya.com/blog/2021/05/convolutional-neural-
networks-cnn/
Narasimha Rao, P. V. L., & Meher, S. (2024). ORG-RGRU: An automated diagnosed model for multiple
diseases by heuristically based optimized deep learning using speech/voice signal. Biomedical
Signal Processing and Control, 88, 105493. doi:https://doi.org/10.1016/j.bspc.2023.105493
Narendra, N. P., Schuller, B., & Alku, P. (2021). The Detection of Parkinson's Disease From Speech
Using Voice Source Information. IEEE/ACM Transactions on Audio, Speech, and Language
Processing, 29, 1925-1936. doi:10.1109/TASLP.2021.3078364
Nijhawan, R., Kumar, M., Arya, S., Mendirtta, N., Kumar, S., Towfek, S. K., . . . Abdelhamid, A. A. (2023).
A Novel Artificial-Intelligence-Based Approach for Classification of Parkinson’s Disease
Using Complex and Large Vocal Features. Biomimetics, 8(4). doi:10.3390/biomimetics8040351
Parkinson’s Disease. (2023). Retrieved from https://www.aans.org/en/Patients/Neurosurgical-
Conditions-and-Treatments/Parkinsons-Disease
Pragadeeswaran, S., & Kannimuthu, S. (2024). An Adaptive Intelligent Polar Bear (AIPB) Optimization-
Quantized Contempo Neural Network (QCNN) model for Parkinson’s disease diagnosis using
speech dataset. Biomedical Signal Processing and Control, 87, 105467.
doi:https://doi.org/10.1016/j.bspc.2023.105467
Quan, C., Ren, K., & Luo, Z. (2021). A Deep Learning Based Method for Parkinson’s Disease Detection
Using Dynamic Features of Speech. IEEE Access, 9, 10239-10252.
doi:10.1109/ACCESS.2021.3051432
Quan, C., Ren, K., Luo, Z., Chen, Z., & Ling, Y. (2022). End-to-end deep learning approach for Parkinson’s
disease detection from speech signals. Biocybernetics and Biomedical Engineering, 42(2), 556-
574. doi:https://doi.org/10.1016/j.bbe.2022.04.002
Rahman, S., Hasan, M., Sarkar, A., & Khan, F. (2023). Classification of Parkinson’s Disease using Speech
Signal with Machine Learning and Deep Learning Approaches. European Journal of Electrical
Engineering and Computer Science, 7, 20-27. doi:10.24018/ejece.2023.7.2.488
Semenova, E. I., Partevian, S. A., Shulskaya, M. V., Rudenok, M. M., Lukashevich, M. V., Baranova, N.
M., . . . Alieva, A. K. (2023). Analysis of ADORA2A, MTA1, PTGDS, PTGS2, NSF, and HNMT Gene
Expression Levels in Peripheral Blood of Patients with Early Stages of Parkinson's Disease.
BioMed Research International, 1-8. doi:10.1155/2023/9412776
Sigcha, L., Borzì, L., Amato, F., Rechichi, I., Ramos-Romero, C., Cárdenas, A., . . . Olmo, G. (2023). Deep
learning and wearable sensors for the diagnosis and monitoring of Parkinson’s disease: A
systematic review. Expert Systems with Applications, 229, 120541.
doi:https://doi.org/10.1016/j.eswa.2023.120541
Sivaranjini, S., & Sujatha, C. M. (2019). Deep learning based diagnosis of Parkinson’s disease using
convolutional neural network. Multimedia Tools and Applications, 79, 15467 - 15479.
Váradi, C. (2020). Clinical Features of Parkinson’s Disease: The Evolution of Critical Symptoms. Biology,
9(5). doi:10.3390/biology9050103
Yao, D., Chi, W., & Khishe, M. (2022). Parkinson’s disease and cleft lip and palate of pathological speech
diagnosis using deep convolutional neural networks evolved by IPWOA. Applied Acoustics, 199,
109003. doi:https://doi.org/10.1016/j.apacoust.2022.109003
Zahid, L., Maqsood, M., Durrani, M. Y., Bakhtyar, M., Baber, J., Jamal, H., . . . Song, O. Y. (2020). A
Spectrogram-Based Deep Feature Assisted Computer-Aided Diagnostic System for Parkinson’s
Disease. IEEE Access, 8, 35482-35495. doi:10.1109/ACCESS.2020.2974008
42
43