You are on page 1of 43

2.

1 Introduction

Parkinson’s disease was discovered by James Parkinson in the year 1817 who described

this disease as “paralysis agitans” (Váradi, 2020). Parkinson’s disease is considered one of the

world’s fastest-growing neurological disorders after Alzheimer’s disease (Changqin Quan, Ren,

Luo, Chen, & Ling, 2022). It affects 1%-2% of people over the age of 60. It is an incurable

disease that is more common in men than in women as women consist of estrogen that acts as

the protective effect. Apart from that, nicotine which has monoamine-oxidase-B (MAO-B) and

caffeine adenosine A2a in coffee has potential protective effects against Parkinson’s disease.

The signs and symptoms shown in Parkinson’s disease will be more obvious as the disease

progresses. It will lead to a decrease in the quality of life for both patients and their family

members. According to Váradi (2020), the United States is diagnosed with 60,000 people with

Parkinson’s disease each year and it causes a significant socioeconomic cost of approximately

20 billion dollars per year.

2.2 Overview of Parkinson’s disease

Parkinson’s disease is a progressive disorder caused by the loss of a type of

neurotransmitter in the brain known as dopamine. Dopamine is responsible for sending

messages to the basal ganglia which is part of the brain used for movement and coordination

control. The presence of motor symptoms of Parkinson’s disease is noticeable only after the

loss of about 70% of neurons, and it can begin more than 20 years from the onset of the

neurodegenerative process (Semenova et al., 2023). Therefore, the long prodromal period is

the dominant reason for late diagnosis of the disease and causes late treatment to reduce the

symptoms.

1
2.2.1 Different stages of Parkinson’s disease

The stages of Parkinson’s disease are dependent on the degree of progression of the

disease. Movement Disorder Society-sponsored revision of the Unified Parkinson’s Disease

Rating Scale (MDS-UPDRS) is the scale used to measure the progression of Parkinson’s

disease. MDS-UPDRS is used by neurologists to assess the motor and non-motor symptoms of

the patient. MDS-UPDRS has four parts, part 1 is about the non-motor experiences of daily

living. This part is used for the assessment of the impact of Parkinson's disease on daily

activities, mood, and cognition of the patient. In part 2, it describes motor experiences of daily

living. This part is used for the evaluation of motor function during daily activities such as

speaking, writing, eating, or swallowing. Part 3 describes the motor examinations. This part

utilized the assessment of the motor symptoms of the patient such as tremors, rigidity,

bradykinesia (slowness of movement), and postural instability. The last part is about the motor

complications. The last part addresses the motor complications (side effects) related to the

treatment of the disease, such as motor fluctuations and dyskinesias.

The stages of Parkinson’s disease can be distributed into preclinical, prodromal, and

clinical stages. During the preclinical stage, neurodegeneration begins in the substantia nigra

of the brain, but it does not display any signs. Next, it will proceed to the prodromal stage with

a duration of greater than 10 years. In the prodromal stage, non-motor symptoms started due to

the continuous neuronal loss (Váradi, 2020). Next, the progression of the disease reaches the

early stage. During this stage, approximately 40% to 60% of the dopamine in the brain cells

had been lost and motor symptoms such as bradykinesia (slowed movements), rigidity, and

tremors had shown. Non-motor symptoms appear earlier as compared with motor symptoms in

the prodromal stage which is useful to be utilized as the indication of early detection of the

disease. With the further deterioration of Parkinson’s disease, it will proceed to the mid-stage

which shows non-motor symptoms such as orthostatic hypotension, some urinary symptoms,

2
and motor symptoms such as axial deformities, and dyskinesias. In the late stage of the disease,

motor symptoms include postural instability, and non-motor symptoms such as hallucinations,

and dementia.

Figure 1: Timeline of clinical signs for different stages of Parkinson’s disease.

2.2.2 General symptoms of Parkinson’s disease

The symptoms of Parkinson’s disease can be divided into motor and non-motor

symptoms. Motor symptoms are related to movement such as resting tremor, muscle rigidity,

bradykinesia (slowed movements), and postural instability. Patients with bradykinesia are not

able to generate enough energy for muscle movements, thus the patients fail to produce a faster

movement. This triggered the patient with bradykinesia to have a slower reaction time and

increased difficulties in performing multiple tasks simultaneously. A resting tremor is a

rhythmic muscle contraction and relaxation occurs usually in the hands, lips, chin, and jaws.

Apart from that, inflexibility of the limbs and neck is the major reason for the rigidity of the

patient and causes a limited range of motion due to muscle stiffness. Motor symptoms are more

obvious as compared with non-motor symptoms.

3
Non-motor symptoms include disruptions in sleeping, and difficulty in swallowing,

chewing, and speaking (Er, Isik, & Isik, 2021). In terms of psychotic symptoms, the patient

will have potential impulse control disorders followed by hallucinations. Next, examples of

disruptions in sleeping are such as restless legs syndrome, REM behavior disorder, and sleep

apnea (Váradi, 2020). Moreover, non-motor symptoms include potential cognitive impairment

such as impaired judgment, and identity confusion. Furthermore, it includes other indications

such as very low speech, a face with no expressions, shaky handwriting, and difficulty in

standing from the chair. For example, finger tremors will trigger changes in the handwriting of

a patient with Parkinson’s disease and they will tend to have small and cramped handwriting.

The changes in handwriting of a PD's patient are known as micrographia.

Figure 2: Handwriting of a PD’s patient.

2.2.3 Speech changes associated with Parkinson’s disease

Paralinguistic features such as body language, gestures, facial expressions, tone, and

pitch of voice are part of the speech signals. Patients with Parkinson’s disease will have issues

in daily activities such as greeting each other, reciprocating and initiating social interaction,

and using suitable body language (Narasimha Rao & Meher, 2024). Up to 90% of people with

Parkinson’s disease will develop voice and speech disorders (Di Pietro et al., 2022). Moreover,

speech disorders due to Parkinson’s disease can be detected as early as five years before the

appearance of other motor dysfunctions (Narendra, Schuller, & Alku, 2021). The effect on the

speech of a Parkinson’s disease patient is described as prosody, articulation, and phonation.

4
For example, patients of Parkinson’s disease tend to have impaired speech pronunciations in

terms of vowels, phrases, and terms (Er et al., 2021). Speech signals are used to represent the

flow of air in the lung region and the appearance of glottis under the control of air volume and

flow of laryngeal muscle activation in terms of mechanical way of processing. The vibration

of the vocal cord will impair the fundamental frequency of voice quality along with the vocal

and voice intensity (Narasimha Rao & Meher, 2024). Ventilation dysfunction is the main

reason that triggers the underlying voice impairment in patients with Parkinson’s disease. The

speech characteristics of patient with Parkinson’s disease include reduced vocal tract volume

and tongue flexibility, inappropriate pauses, impairments in voice quality, and reduction in

pitch range and voice intensity (Narendra et al., 2021). Moreover, the fundamental frequency

range will be reduced followed by a monotone voice, and there are possibilities of imprecision

of vowel and consonant words during the pronunciation. For example, abnormal vowel

articulation is a sign of speech disorder in the early stages of Parkinson’s disease as the fine

muscles that control the voice production could be affected more intensely (Cordella, Paffi, &

Pallotti, 2021). The speech disorders due to Parkinson’s disease are closely-associated with

pulmonary dysfunction, as rigidity in respiratory muscles could reduce the ability to generate

enough expiratory pressure for voice production or airway clearance. Currently, there is no

exact proper diagnosis method to diagnose someone with Parkinson’s disease. However, MRI

scans help in the diagnosis of Parkinson’s disease as it provides details about the subcortical

structures of the human brain, but it is difficult to analyse the details of MRI scan through

human eyes (Bhatele, 2020). Therefore, an accurate and timely diagnosis method for

Parkinson’s disease is crucial to be developed.

5
2.2.4 Conventional approach in diagnosis of Parkinson’s disease

Currently, the diagnosis of Parkinson’s disease is merely based on clinical criteria along

with the presence of symptoms such as bradykinesia (slowness in movements), rigidity, and

tremor. The advancement in the imaging, genetics, and biomarkers field does not elevate the

accuracy of diagnosis of Parkinson’s disease, especially in the early stages of the disease.

Magnetic resonance imaging (MRI) is used in aiding the diagnosis of Parkinson’s disease as it

shows accurate neuro-anatomic biomarkers. MRI scans can show the subcortical structures of

the human brain in detail, but it is difficult to analyse using human eyes. Next, positron

emission tomography (PET) and single photon emission computed tomography (SPECT) is

used for quantifying the loss of nigrostriatal dopaminergic fibers in Parkinson’s disease and for

the detection of the presence of dopaminergic dysfunction in patients (Guatelli, Aubin, Mora,

Naranjo-Torres, & Mora-Olivari, 2023). Both PET and SPECT scans can provide useful

images for the progression of Parkinson’s disease.

2.2.5 Current treatments in improving the symptoms of Parkinson’s disease.


Pharmacotherapy and surgery are the approaches used to alleviate the symptoms of

Parkinson’s disease (Sigcha et al., 2023). Examples of drugs used for improving the signs of

Parkinson’s disease are such as levodopa and dopamine agonists. As patients with Parkinson’s

disease had a reduced amount of dopamine in the substantia nigra of the brain, levodopa acts

as a dopamine replacement agent that provides good control of motor symptoms in the early

stages of the disease. For instance, levodopa is highly-effective in reducing the bradykinetic

symptoms but does not stop the progression of the disease. In the long run, the use of drugs

could trigger side effects such as motor fluctuations and dyskinesias (Sigcha et al., 2023).

Dopamine agonists such as bromocriptine, pergolide, pramipexole, and ropinirole are

medications used to replicate the actions of chemical messengers in the brain. These drugs have

similar functionality as dopamine. Dopamine agonists can be prescribed independently or in

6
conjunction with levodopa, and they can be utilized in the initial phases of the disease or to

extend the effectiveness of levodopa. However, dopamine agonists will trigger greater side

effects as compared with levodopa. Examples of side effects include the feeling of drowsiness

and dizziness, nausea, vomiting, and dry mouth.

Apart from medications, surgery can be one of the approaches to alleviate the symptoms

of Parkinson’s disease. Examples of surgeries that can be carried out are such as pallidotomy,

thalamotomy, and deep brain stimulation (DBS) ("Parkinson’s Disease," 2023). Pallidotomy is

recommended for individuals with severe Parkinson's symptoms or those who exhibit

resistance to medications such as levodopa. The procedure of pallidotomy involves the

insertion of a wire probe into the globus pallidus, which is approximately a quarter inch in size

of the brain region that is responsible for movement control. Normal movements of patients

with Parkinson’s disease can be restored through the introduction of lesions to globus pallidus.

In sum, pallidotomy can alleviate medication-induced dyskinesias, tremors, muscle rigidity,

and the gradual decline of spontaneous movement. Next, thalamotomy is a surgery that

involves the use of radiofrequency energy currents to eliminate a specific and small portion of

the thalamus. Thalamotomy is advantageous for patients with debilitating tremors in the hand

or arm, particularly those with essential tremors. Deep Brain Stimulation (DBS) presents a

safer alternative to both pallidotomy and thalamotomy. It uses small electrodes that are

implanted to deliver electrical impulses to either the subthalamic nucleus of the thalamus or

the globus pallidus which are the brain regions that are responsible for the function of the motor.

Magnetic resonance imaging (MRI) and neurophysiological mapping are used as guidance for

the implantation of electrodes into the brain. The electrodes are connected to an impulse

generator (IPG) resembling a pacemaker, which is positioned under the collarbone and beneath

the skin. There are wires interconnected between the electrodes and the generator. Electrodes

are placed on one side of the brain. For example, an electrode on the left side of the brain

7
controls symptoms on the right side of the body, and vice versa. In some cases, patients may

require stimulators on both sides of the brain.

2.3 Introduction to Deep Learning

To increase the classification accuracy of Parkinson’s disease, deep-learning-based

Parkinson’s disease diagnosis approaches can be used. Deep learning is a subset of machine

learning (ML) that utilize complex algorithms and deep neural network to train a model ("The

Best Introduction to Deep Learning - A Step by Step Guide," 2023). Deep learning is highly

dependent on artificial neural networks (ANN) such as deep neural networks (DNN). In large

pools of data, deep learning stands out for its capacity to automatically learn intricate patterns

and relationships within datasets without explicit programming. However, a large amount of

data is required to have a greater accuracy. Neural networks are used by deep learning which

consist of multiple layers of interconnected nodes. Applications of deep learning are such as in

natural language processing, speech recognition, image recognition, and recommendation

systems. Examples of popular deep learning architectures are such as Convolutional Neural

Networks (CNN), Recurrent Neural Networks (RNN), and Deep Belief Networks (DBN).

In the fully-connected deep neural network, the architecture consists of an input layer

and one or more hidden layers that are sequentially connected. Some neurons are

interconnected between different layers. Neurons of each layer can have the option to receive

input from the preceding layer’s neurons or directly from the input layer. The output generated

by one neuron acts as the input for other neurons in subsequent layers until the final layer

produces the ultimate output of the network. Neural network undergoes non-linear

transformations through different layers.

8
2.3.1 Convolutional Neural Network (CNN)

Convolutional Neural Network (CNN) is a prominent deep learning model that is

extensively utilized for the classification of images, audio, or videos. It stands out as a

specialized form of artificial neural network designed specifically for processing and analyzing

visual data. The applications of CNN involve image recognition, object detection, and the

classification of images. CNN exhibits the capability to learn hierarchical feature

representation autonomously and adaptively from input images. The structure of CNN consists

of different layers such as convolutional layers, pooling layers, and fully connected layers.

Convolutional layers employ filters or kernels to process input data and allow the network to

discern spatial patterns and features across different areas of an image. Pooling layers are

responsible for minimizing the spatial dimensions of the data through down-sampling and

focusing more on the most pertinent information. Subsequently, fully connected layers will be

learned features to formulate predictions or classifications. For instance, the first layer of CNN

is used to extract the basic features of an image such as horizontal and diagonal edges (Mandal,

2023). The output detected in the first layer will then pass to the next layers for detecting more

complicated features such as corners or combinational edges. Through the passing of outputs

among different layers, the network can detect more complex features such as objects or faces.

In the final convolutional layer, a final classification output can be present in terms of

confidence scores (0 or 1) that specify the class of the output or according to the input.

Figure 3: Visual representation of the application of Convolutional Neural Network (CNN).

9
2.3.2 Transformer

Transformers are a category of neural networks that acquire context and comprehension

by sequentially analyzing data. Transformers are widely used in the field of natural language

processing (NLP) and computer vision. Transformers uses contemporary and evolving

mathematical techniques, which can be known as attention or self-attention that facilitate the

recognition of the influence and interdependence of distant data elements. Transformers will

differentially weigh the significance of each part of the input data (Chauhan, March 15, 2022).

Transformers are the deep learning model that works with sequence. The architecture of

transformers is made up of three compartments as encoder, the decoder, and the attention

mechanism. The encoder functions to transform an input sequence into state representation

vectors. An Attention mechanism is responsible for empowering the transformer model to

selectively concentrate on pertinent aspects of the sequential input stream. The attention

mechanism helps in the contextualization of input data. A decoder is responsible for decoding

the state representation vector to produce the desired target output sequence. During the

beginning, inputs and outputs will be embedded into an n-dimensional space. Therefore, the

process of encoding the inputs is required. Within the transformer model, a recurrent neural

network does not exist to remember the sequence of feeding into the model. So, it is significant

to assign each word or sequence part with a relative position. The relative positions will be

incorporated into the embedded representation of each word.

10
Figure 4: The architecture of the Transformer.

According to Figure 4 which displays the architecture of the transformer, the encoder

represents the left-half of the architecture while the decoder represents the right-half of the

architecture. The encoder will map an input sequence to a sequence of continuous

representations, which is then fed into a decoder, while the decoder receives the output from

the encoder along with the decoder output at the previous time step, and generates an output

sequence (Chauhan, March 15, 2022).

2.3.3 VGG19

VGG19, also known as Visual Geometry Group 19 is a type of deep transfer learning

model belonging to Convolutional Neural Network (CNN) with 19 layers. VGG19 is made up

of 16 convolutional layers, 3 fully-connected layers, 5 MaxPool layers, and 1 SoftMax layer.

11
VGG19 can be known as a deep CNN that is used to classify images (Kaushik, 2023). The

models of VGG19 employ small 3x3 convolutional filters throughout the network, along with

max-pooling layers for spatial dimension down sampling. The uniform structure simplifies

both the comprehension and implementation of the model, contributing significantly to its

widespread adoption. During the task of image recognition, VGG19 showed a remarkable

performance in terms of its proficiency in learning intricate hierarchical features from images.

Figure 5: The architecture of VGG19.

2.3.4 Deep Neural Network (DNN)

A Deep Neural Network (DNN), which can also be known as Deep Nets, is a neural

network characterized by its elevated complexity. A DNN consists of stacked neural networks

in multiple layers and it consists of input, output, and at least one hidden layer between them.

The task of DNN involves the handling of unlabelled and unstructured data, and it is often used

in the field of computer vision. Deep neural networks (DNN) often have a complex hidden

layer structure with a wide variety of different layers, such as a convolutional layer, max-

pooling layer, dense layer, and other unique layers ("Introduction to Deep Neural Networks,"

July, 2023). The extra layers enhance the ability of the model to comprehend problems

thoroughly and facilitate the delivery of optimal solutions for intricate projects. As compared

with Artificial Neural Network (ANN), DNN has more layers with additional complexity to

12
each of the layers. The increased depth empowers the model to process inputs more

comprehensively and provide optimal solutions as outputs. The applications of deep neural

networks (DNN) are such as in object detection, language translation tasks with BERT

(Bidirectional Encoder Representations from Transformers) models, transfer learning models,

such as VGG-19, RESNET-50, efficient net, and other similar networks for image processing

projects.

Figure 6: The architecture of Deep Neural Network (DNN)

2.4 Related work with Deep Learning Approaches used for detection of Parkinson’s

Disease

2.4.1 Convolutional Neural Network

Convolutional Neural Network (CNN) can capture complex patterns through multiple

layers of convolutional and pooling operations. For instance, speech signals consist of distinct

frequency bands and time segments which CNN can learn automatically from the features of

the input data. Moreover, speech signals can be represented in the form of spectrograms which

is a visual representation of the signal over time and frequency. CNN can undergo the process

of analysis of spectrogram data, thus allowing it to be a suitable fit in speech signal analysis.

13
According to the study of Goyal, Khandnor, and Aseri (2021), a hybrid approach with the use

of combinations of resonance-based and time-frequency-based signals was implemented along

with the use of CNN as a classifier. CNN can observe and capture the underlying patterns of

each decomposition level of the speech signals. There are 2 datasets used such as dataset D1

which consists of 573 single-channeled recordings (190 patients with Parkinson’s disease and

383 healthy control) and dataset D2 which consists of 374 double-channelled recordings. After

the collection of speech recordings from the subject, a noise reduction technique is performed,

and the crucial parts of the audio are manually segmented. The RSSD technique is used to

extract the resonance-based components of the signals. The signal will then decompose into

high-resonating components through TQWT. All the selected features will be converted into

Power Spectral Density (PSD) images to train the CNN classifier and the results will be

analyzed. Throughout the study, it is proved that CNN classifiers are better than other state-of-

the-art classification techniques such as K-Nearest Neighbour (KNN) and Support Vector

Machine (SVM). CNN has an accuracy of 98.12%, a recall of 0.97, a precision of 0.96, an F1-

score of 0.97, and a G-Mean of 0.98. On the other hand, KNN had an accuracy of only 60.90%,

a recall of 0.70, a precision and F1-score of 0.41, G-Mean of 0.54. Moreover, SVM had an

accuracy of 97.67% which is slightly lower as compared with CNN, a recall of 0.94, a precision

of 0.98, an F1-score, and a G-Mean of 0.96. The pitfalls of the proposed hybrid approach with

the CNN classifier are the small sample size. A large sample size is necessary for the

generalization purpose in the clinical use. Moreover, the classification of Parkinson’s disease

should not be limited only to the binary classification which determines if a person is with or

without the disease, but the detection approach should include the function of identifying the

stages of Parkinson’s disease according to the severity of voice disorders.

Based on the study of Changqin Quan et al. (2022), an end-to-end deep learning model

for the detection of Parkinson’s disease from speech signals is used. Time series dynamic

14
features are extracted using a time-distributed two-dimensional convolutional neural network

(2D-CNN), and then one-dimensional CNN (1D-CNN) is used to find the dependencies among

them. There are two databases used in this study. The first database is obtained at the GYENNO

SCIENCE Parkinson’s Disease Research Center and it consists of 15 healthy control and 30

patients of Parkinson’s disease. The speech tasks involved in the first database are the vowel

“a” uttered in a sustained manner, and the reading of a short sentence. The second database is

obtained from PC-GITA which is collected from 100 people (50 healthy controls and 50

patients of Parkinson’s disease). The speech tasks involved are such as the three repetitions of

Spanish vowels “a” and “u” uttered in a sustained manner, reading of different words, reading

of simple sentences, and reading of complex sentences. The analysis of speech signals is based

on phonation (vowel task), articulation (vowel and word tasks), and prosody (simple and

complex sentence tasks). The ratio of splitting of training (validation) and testing sets is 6:4 for

the first database and 8:2 for the second database. In comparison with other end-to-end deep

learning models on the speech tasks of sustained vowel “a” for the first database such as MLP,

FCN, ResNet, Time-CNN, Encoder, CNN-LSTM, the proposed method (time-distributed 2D-

CNNs and 1D-CNN) had the highest accuracy of 81.56%, a F-score of 87.66%, specificity of

98.33%, sensitivity of 79.17%, and MCC of 0.5847. For the task of reading short sentences for

the first database, the proposed model displayed the highest accuracy of 75.33%, F-score of

83.62%, specificity of 93.33%, sensitivity of 76.71%, and MCC of 0.3782. For the second

database, the task of reading a simple and complex sentence reached the highest accuracy of

92%. For the task of reading a simple sentence, the F-score is 91.04%, the specificity is 94%,

the sensitivity of 91.21%, and the MCC of 0.8560. Next, for the task of reading a complex

sentence, the F-score is 92.68%, the specificity is 97%, the sensitivity of 78.40%, and the MCC

of 0.7234. Although the proposed model can maintain the interpretability of the data analysis

15
by ranking the importance of input features in the prediction of Parkinson’s disease, the dataset

is not large enough to train a robust deep-learning model.

Figure 7: Spectrogram of speech signals.

Moreover, according to the study of Fang, Gong, Zhang, Sui, and Li (2021), a 6-layer

Convolutional Neural Network is applied for the classification purpose of Parkinson’s disease.

The study was done as a comparison with other deep learning models such as Long-Short Term

Memory (LSTM), and end-to-end systems. The speech data was obtained from the speech

samples of 34 patients with Parkinson’s disease and 34 healthy controls that were recorded via

different phones and DV. The task performed by the subjects is text reading and MFCC is the

speech features used for analysis. MFCC matrices are directly used without manual extraction

to avoid the loss of time-variant information. As a result, the 6D-CNN showed the best

performance as it had achieved an AUC of 0.984 and an accuracy of 0.938 or 93.8%. For further

improvement, neural networks that are more specifically-designed are required for better

accuracy. Apart from that, the use of apparatus to record should be standardized for constant

speech quality from the subjects.

16
According to Narendra et al. (2021), raw speech signals without pre-processing are

added to the CNN network and classified using Multi-layer perceptron (MLP). CNN functions

to extract relevant information and pass it to MLP for estimation. The three techniques used

for the computing of time-domain waveforms are GIF (IAIF and QCP analysis) and ZFF

method. The speech signals are obtained from the PC-GITA database with a total of 50 patients

with Parkinson’s disease and 50 healthy controls. The speech tasks performed by the subjects

are sustained phonation, reading words aloud, diadochokinetic exercises, reading sentences

aloud, reading a text, and giving a monologue. The analysis of speech signals is according to

baseline features such as articulation, phonation, and prosody features along with glottal

features. The highest accuracy is the time-domain waveforms processed using QCP which is

68.56%, sensitivity of 63.40%, and specificity of 73.73%. To improve the accuracy of the

approach, larger amounts of data for proper training are crucial.

Furthermore, Convolutional Neural Network (CNN) is also applied in conjunction with

the Machine Learning (ML) method. Based on the study of Celik and Başaran (2023), the input

speech signal will first be trained under the SkipConNet module to extract the important feature

vectors then proceed with the use of Random Forest (RF) which is an ML method for the final

results. The first dataset used was collected from the University of Colorado National Center

for Voice and Speech and the University of Oxford which consists of 31 individuals (8 healthy

controls and 23 Parkinson’s patients), while the second dataset used was collected at the

Istanbul University Cerrahpasa Faculty of Medicine, Department of Neurology which consists

of 252 individuals (64 healthy controls and 188 Parkinson’s patients). The speech features used

for analysis included baseline features, time-frequency features, MFCCs, wavelet transform-

based features, vocal fold features, and TWQT features. The proposed approach (SkipConNet

+ RF) reached the highest accuracy of 99.11%, a precision of 0.99, a recall of 0.99, an F1_score

of 0.99, a specificity of 98.77%, and AUC of 98.77% for the first database. For the second

17
database, the proposed approach (SkipConNet + RF) had reached the highest accuracy of

98.30 %, a precision of 0.99, a recall of 0.96, an F1_score of 0.97, a specificity of 95.83%, and

AUC of 95.83%. The second database was used by a similar study conducted by Gunduz

(2019), CNN is used for the detection of Parkinson’s disease through vocal signals, and it only

achieves an accuracy of 86.90%, and an F1_score of 0.917 which shows a lower percentage as

compared with the proposed approach (SkipConNet + RF).

Based on the research of (Escobar-Grisales, Ríos-Urrego, & Orozco-Arroyave, 2023),

1D CNN and 2D CNN are used along with pre-trained models such as Wav2Vec2.0, BERT,

and BETO to distinguish between healthy people and patient with Parkinson’s disease. In terms

of accuracy, 2D CNN displays a better performance as it had an accuracy of 84.4% while 1D

CNN only achieved 72.6%. In terms of sensitivity, 2D CNN had a value of 81.3 while 1D CNN

only achieved 53.8. In terms of specificity, 2D CNN had a value of 87.6 while 1D CNN had a

greater value of 92.5. In terms of the F1-score, 2D CNN had reached a value of 84.3 while 1D

CNN only achieved 68.0.

According to the study of Costantini et al. (2023), the research focuses on the use of

Artificial Intelligence (AI) for the voice assessment of patients with Parkinson’s disease. The

target of the research involves the process of distinguishing healthy individuals, early untreated

Parkinson’s disease patients, and mid-advanced Parkinson’s disease patients treated with

levodopa through the analysis of their voices. Data is collected from 266 healthy control and

160 Parkinson’s patients (72 subjects are newly diagnosed, 88 subjects with medium-to-

advanced impairment patients). Convolutional Neural Network (CNN) is the deep learning

approach used for the diagnosis of Parkinson’s disease in the study. The results are compared

with the traditional Machine Learning (ML) models such as KNN, SVM, and Naïve Bayes

using the augmented Mel-spectograms as the speech features. The study found out that the

traditional Machine Learning (ML) models, such as KNN and SVM, had provided a higher

18
performance in three out of four binary classification tasks, with a mean accuracy of 81.75%

compared to 69.75% reached by the CNN. However, the CNN approach showed a slight

advantage for the multiclass tasks, with a 61% mean accuracy versus the 59.5% reported by

the traditional ML methods for the three classes. In summary, the traditional Machine Learning

(ML) approach demonstrated higher performance in most binary classification tasks, while the

CNN approach showed a slight advantage in multiclass tasks. To further improve the accuracy

of CNN, the sample size needs to be increased or a larger dataset should be used for more

reliable and accurate results. Moreover, additional speech tasks need to be carried out to have

an optimal result.

Based the research of (Khaskhoussy & Ayed, 2023), it involves the evaluation of

Support Vector Machine (SVM) and Convolutional Neural Network (CNN) to classify the data

from speech tasks. There are two input data types used such as raw speech signal values and i-

vector features of different dimensions. There are three approaches used for the diagnosis of

Parkinson’s disease through voice analysis. The first approach uses Convolutional Neural

Networks (CNN) for deep feature extraction from raw speech signals, and the use of MLP for

classification purposes. The second system uses deep features obtained by CNN for

classification with Support Vector Machines (SVM) employing different kernels. The third

system utilizes i-vectors obtained from Mel-frequency cepstral coefficients (MFCC) for feature

extraction and classification using SVM with different kernels. Evaluation of the performance

of the systems is based on with or without a cross-validation process. Additionally, the study

compares the proposed approach with other related works in the field of PD detection through

speech analysis. The results show that the hybridization of CNN features and Support Vector

Machines (SVM) displays good performance in the detection of Parkinson’s disease, with the

i-vectors of dimensions 200 exhibiting the best accuracies and F-scores in discriminating

between patients of Parkinson’s disease and healthy controls. The study also compares the

19
proposed approach with other related works and demonstrates that the proposed systems

outperform existing approaches, achieving an accuracy of 100%. To further improve the study,

other deep learning techniques should be used with more extensive experimentation with larger

datasets and additional types of features to improve the accuracy of diagnosis.

According to the research conducted by (Guatelli et al., 2023), spectrograms of voice

recordings were utilized along with Extreme Learning Machine (ELM) random weight neural

networks for the detection of Parkinson’s disease. This study emphasizes the potential for non-

invasive and early diagnosis of the disease using voice alterations due to muscle stiffness in

patients. The database used is collected from 55 patients with Parkinson’s disease and 64

healthy controls. The performance of the experiments is compared among CNN and ELM in

terms of accuracy, training time, sensitivity, and specificity. CNN displayed a higher accuracy,

but it requires a longer training time and more resources. This research outlined the process

used for the experiments, such as the use of different types of spectrograms, data augmentation,

and the application of CNN and ELM for classification. Transfer learning will be applied to

the pre-trained CNN due to the smaller dataset. ELM utilized the features extracted from CNN

and both ELM and CNN displayed similar accuracy.

In summary, the highlights of the study involve the possibilities of using spectrograms of voice

recordings and ELM for the early and non-invasive detection of Parkinson’s disease. ELM had

a higher accuracy as compared with CNN with a reduced training time. ELM is a viable option

for the diagnosis of Parkinson’s disease. In terms of pitfalls,

The datasets contain very few patients which increases the difficulties to train deep learning

models from scratch. Moreover, the study emphasizes the limitations of traditional CNN data

augmentation techniques. The performance between CNN and ELM is comparable with

fluctuations of accuracy between 83.91% for CNN and 81.74% for ELM.

20
The results also showed that an increase in the number of samples led to better performance

values, with the experiment considering color spectrograms of sound fragments showing the

best result. Additionally, the study found that the ELM had lower training times compared to

CNN. The study also used 10-fold cross-validation to ensure the objectivity of the experimental

results. Overall, the results indicate the potential of using ELM for the non-invasive and early

detection of Parkinson's disease using voice signal spectrograms.

According to the study conducted by Yao, Chi, and Khishe (2022), it is about

the application of deep convolutional neural networks (DCNN) in the diagnosis of pathological

speech related to Parkinson's disease and cleft lip and palate. The best architecture for DCNN

will be automatically-selected using the whale optimization algorithm (WOA). The use of

unsupervised representation learning and challenges in selecting the best structure for DCNN

have been highlighted. The dataset used is from the CIEMPIESS corpus dataset with a total of

16717 sound recordings and 700 utterances taken out of the entire corpus. The highest accuracy

of 95.77% was achieved for the proposed model. In addition to this, the proposed model

achieved a high percentage in precision which is up to 96.33%. The results showed that the

proposed IPWOA model achieved higher accuracy than hand-crafted models. In summary, the

proposed model demonstrated superior performance compared to hand-crafted models and

other well-known algorithms in classifying disordered speech signals, showcasing its

effectiveness in the automatic selection of the best DCNN architectures.

The constraints of the proposed model such as the complexity of the structures of DCNN may

cause the learning of datasets to become challenging. Moreover, the process of selecting the

best structure for DCNN can be difficult due to the high complexity associated with these

techniques.

21
2.4.2 Transformer

Since the detection of Parkinson’s disease often deals with the process of analysis of

the time-series data such as speech signals, therefore transformer-based deep learning model is

suitable to process the sequential data. The model has a great ability to capture the long-range

dependencies and temporal relationships, this has led them to effectively provide the modelling

of the dynamic nature of Parkinson’s symptoms. Based on the study of Nijhawan et al. (2023),

the transformer-based method was implemented in the detection of Parkinson’s disease by

retrieving dysphonia measures from the voice recording of the subjects. The dataset is sourced

from the UCI Machine Learning Repository and it comprises patient voice features presented

in a comma-separated values (CSV) sheet format. The dataset consists of records from 188

Parkinson’s patients (107 men, 81 women) aged 33 to 87, while the healthy group comprises

64 individuals (23 men, 41 women). Vocal features utilized for Parkinson’s disease

classification include wavelet transform-based features, baseline features, vocal fold features,

TWQT features, and MFCC features. Despite the dataset containing 753 unique vocal features

with balanced gender representation, it has limitations related to imbalances between

Parkinson's and non-Parkinson's records. Employing a stratified k-fold strategy, the dataset is

partitioned into tenfold training and testing sets. The model's functionality incorporates

mechanisms such as feature selection within a trainable neural network (NN) model and is

categorized into three major blocks that are made up of feature embedder, transformer block,

and MLP (Multilayer Perceptron) head. The feature selection step is crucial for minimizing the

complexity of the transformer-based model and enhancing the overall accuracy. XgBoost

which is a Gradient Boosting Decision Trees (GBDT) framework, is chosen for feature

selection due to the dataset's complexity and high feature count. XgBoost ranks feature

importance based on scores, influencing Parkinson's disease classifications. The top features

identified are utilized to train the proposed network. Additionally, support vector classifier

22
(SVC) feature scores and permutation feature scores are used for comparison. The proposed

vocal tab transformer network comprises a feature embedder block, transforming each feature

into N-dimensional embeddings. These features are further processed in a transformer-based

encoder block, converting each input vector into a highly contextualized vector representation.

The representation vector is utilized by the MLP head for predicting Parkinson’s disease

classification. Performance comparison metrics include ROC-AUC, precision, and recall

scores. The proposed approach, utilizing dysphonia measures, outperforms the current state-

of-the-art GBDT-based solution by at least 1% in AUC score, accompanied by improvements

in precision and recall scores. The average ROC-AUC for the proposed network is 0.8574.

2.4.3 VGG19

Since VGG19 is a well-known architecture for image classification, it is not typically

employed directly for the detection of Parkinson's disease for non-image data such as speech

signals. However, there is a study performed by Bhatele (2020) that employed VGG19 deep

transfer learning architecture in the detection of Parkinson’s disease through the MRI scans of

patients with Parkinson’s disease. The dataset used for Parkinson’s disease is from Parkinson’s

Progression Markers Initiative (PPMI) databases. The dataset consists of 50 patients with

Parkinson’s disease (20 females and 30 males) and 50 healthy controls (26 females and 24

males). Firstly, MRI scans in the form of DICOM (Digital Imaging and Communications in

Medicine) are converted into the PNG (Portable Network Graphics) format. Then, the process

of training of VGG19 occurs along with the modifications of the last three layers of the

architecture for accurate classifications of Parkinson’s disease. A total of 19 layers will be

divided into 5 blocks and 5 max pooling functions are used for joining these blocks. The input

size is kept at 224*224*3 and the size of the filter is kept at 3*3 in each layer for the handling

of trainable parameters. Lastly, the last 3 layers such as Flatten, dropout, and dense will be

added into the architecture. The proposed VGG19 approach gained an accuracy of 90%,

23
precision of 80%, sensitivity of 83%, and F1 score of 79%. As compared with a similar study

conducted by Sivaranjini and Sujatha (2019) which utilizes the AlexNet model, the accuracy

is lower which is 88.9% and it has a sensitivity of 89.3%. This proved that using VGG19 in the

study of Bhatele (2020) has greater accuracy, efficiency, and adaptability in the classification

problems as compared with another study.

2.4.4 Deep Neural Network (DNN)

As explained by Rahman, Hasan, Sarkar, and Khan (2023), Machine Learning (ML)

and deep learning (DL) are the two approaches used for the prediction and categorization of

healthy controls and patients with Parkinson’s disease at an early stage using speech signals.

The dataset used is from the UCI repository for Parkinson’s disease. Rahman et al. (2023)

highlighted the prevalence of speech problems in patients with Parkinson’s disease and the

importance of early diagnosis for improving patient outcomes. Comparisons had been made

between multiple Machine Learning (ML) models such as Extreme Gradient Boosting, Ada

Boost, Random Forest, and Support Vector Machine, along with deep neural network (DNN)

models. The results showed that the Extreme Gradient Boosting classifier had achieved the

highest classification accuracy of 92.18% among the ML classifiers, and the three-layer DNN

(DNN2) achieved the best accuracy of 95.41% among the DL techniques. Deep neural

networks are proved to perform better than machine learning. The evaluation metrics used in

the study include accuracy, precision, recall, F1 score, and AUC curve to quantify the

performance and efficacy of the classifiers. To further improve, a larger dataset needs to be

used along with cutting-edge deep learning techniques. For example, amounts of medical data,

resource efficiency, security, and privacy need to be managed to make Machine Learning (ML)

and Deep Learning-based solutions more useful.

24
According to the study of Bhatt, Jayanthi, and Kumar (2023), the authors highlighted

the use of high-resolutions superlet transform-based techniques for the detection of Parkinson’s

disease using speech signals. Superlet Transform (SLT) is used to convert the speech signals

into 2D spectrograms, and then a Deep Neural Network (DNN) is used for classification

purposes. The speech features used for evaluations are the sustainable vowels, modulated

vowels, DDK analysis, and the isolated words for the PC-GITA dataset and vowels for the

ItalianPVS dataset. The results show that the proposed method outperforms existing techniques

for the detection of Parkinson’s disease with an overall accuracy of 92% on the PC-GITA

dataset and 96% on the ItalianPVS dataset. The performance of VGG16 with SLT shows the

best performance on the PC-GITA dataset as compared with other DNN models such as

InceptionResNetV2 and ResNet50v2. The proposed framework using SLT and DNN provided

a non-intrusive, low-cost, and remote method for the detection of Parkinson’s disease, which

leverages the non-motor symptoms of PD evident in speech signals. The method shows

promise for accurate and early detection of PD using speech signals.

2.4.6 Others

Based on the study of (Zahid et al., 2020), it is about the development of a computer-

aided diagnostic system for Parkinson's disease using speech signals. The research emphasizes

the use of spectrogram-based deep features and acoustic features for the detection of

Parkinson's disease. There are three research methods such as the transfer learning-based

approach, deep feature extraction using machine learning classifiers, and simple acoustic

feature evaluation using machine learning classifiers. The dataset used is the Spanish dataset

from PC-GITA with 50 patients of Parkinson’s disease and 50 healthy controls. The speech

features used are such as articulation, prosody, and phonation. The proposed framework which

is the deep feature extraction method outperformed the other approaches with an accuracy of

99.3%. The results showed that deep features are suitable to distinguish and diagnose healthy

25
people or a patient with Parkinson’s disease. In summary, the research displayed promising

results for the development of a computer-aided diagnostic system for Parkinson's disease

using speech signals, with a focus on deep feature extraction and machine learning

classification. The proposed approach shows potential for accurate and early detection of

Parkinson's disease based on speech impairments.

As mentioned by Pragadeeswaran and Kannimuthu (2024), an automated diagnosis

framework for the identification of Parkinson’s disease from speech signals was developed.

The proposed framework consists of a pre-processing step for the speech signal using the

Determinate Haar Wavelet (DHW) transformation technique, a feature extraction step using

the Statistical Time Frequency Renyi (STFR) model, a feature optimization step using the

Adaptive Intelligent Polar Bear (AIPB) optimization algorithm, and the use of Quantized

Contempo Neural Network (QCNN) algorithm for the detection of Parkinson’s disease.

Different speech signal datasets were used, and comparisons were made with the current

conventional methods for the detection of Parkinson’s disease. In this study, the proposed

framework is compared with existing Machine Learning (ML) and Deep Learning (DL)

techniques for the diagnosis of Parkinson’s disease. AIPB + QCNN had outperformed other

approaches in terms of accuracy, sensitivity, specificity, and F1-score. Moreover, the detection

accuracy and ROC characteristics had improved as compared to existing approaches. This

research provides a novel and effective approach for automated diagnosis of Parkinson’s

disease using speech signals. Specifically, the AIPB + QCNN model outperforms other ML

techniques such as Decision Trees (DT), Multilayer Perceptron (MLP), K-Nearest Neighbors

(KNN), Gaussian Naive Bayes (GNB), and Support Vector Machine (SVM) in terms of

accuracy, sensitivity, specificity, and F1-score. For example, the AIPB + QCNN model

achieves an accuracy of 98.5%, sensitivity of 98.7%, specificity of 98.9%, and F1-score of

98.8%. The results showcased its superior performance in accurately detecting PD from speech

26
signals. To further improve, the framework could have an ability for the classifications for

different stages of Parkinson’s disease. The drawbacks of the proposed frameworks included

the possibility of overfitting due to the increased dimensionality. Moreover, challenges will be

faced as the increase in the complexity of the system and the increase in computational

expenses.

Based on C. Quan, Ren, and Luo (2021), the study focuses on the detection of

Parkinson's Disease (PD) using dynamic speech features. To capture the time-series

characteristics of speech signals for the detection of Parkinson’s disease, the authors proposed

a framework that combines Bidirectional long short-term memory (LSTM) models with

dynamic articulation transition features. The dataset was collected from 45 subjects (15 healthy

controls and 30 Parkinson’s patients) who are volunteers at the GYENNO SCIENCE Parkinson

Disease Research Center. A comparison study had been done between Deep Learning (DL)

models and Machine Learning (ML) models using static speech features. The research explored

both motor and non-motor symptoms of Parkinson’s disease and focused on speech

disturbances that are common among the patients. The study found that the dynamic speech

features, particularly articulation transition characteristics, show significant differences

between healthy control (HC) speakers and patients of Parkinson’s disease. The proposed

method using Bidirectional LSTM models and dynamic features yields remarkable

improvements in the detection accuracy over traditional Machine Learning (ML) models using

static features. The research emphasizes the possibility of the proposed Deep Learning-based

method for the detection of Parkinson’s disease. 10-fold cross-validation is used to enhance the

accuracy of the detection of Parkinson’s disease over traditional Machine Learning (ML)

methods using static features and end-to-end Deep Learning using CNN models. The

bidirectional LSTM model using dynamic speech features achieved an accuracy of 84.29% for

the detection of Parkinson’s disease, which shows an indication of a substantial improvement

27
over traditional Machine Learning (ML) models. The study also compared the performance of

Deep Learning (DL) models with different speech inputs, showing that both CNN and

Bidirectional LSTM achieved significant improvements in classification accuracy, with the

Bidirectional LSTM model demonstrating the highest accuracy. Therefore, the proposed

framework has shown promising accuracy in the detection of Parkinson’s disease, particularly

when utilizing dynamic speech features and DL models. The pitfalls of the study include the

potential bias in performance evaluation due to the use of leave-one-out cross-validation and

the need for further exploration of more complex network architectures. Additionally, the study

did not directly compare its results with other related studies, which could affect the

generalizability of the findings.

Next, Narasimha Rao and Meher (2024) presented a novel automated model for the

diagnosis of multiple diseases using speech or voice signals. The proposed model, termed

ORG-RGRU, implements a three-stage classification framework using optimized ResNet,

GoogleNet, Radial Basis Function (RBF), and Gated Recurrent Unit (GRU). Dataset 1 was

collected from the Kaggle website, dataset 2 was obtained from GitHub, and Dataset 3 was

obtained through the given link. Firstly, the signals will be decomposed using Empirical

Wavelet Transform (EWT) and then fed into the classification models. The model aims to

address the limitations of existing approaches and improve the accuracy and efficiency of

disease diagnosis. Various datasets were used for validation and a comparison was made

between the proposed model with the existing conventional approaches. The heuristic

optimization algorithm (ACP-AVOA) is used to optimize the hyperparameters of the deep

learning models. The results showed that the developed model achieved high accuracy and

Matthews Correlation Coefficient (MCC) in diagnosing multiple diseases using speech or

voice signals. There are three classification frameworks for the proposed model such as STFT

+ deep features 1 + ResNet and GoogleNet, STFT + weighted features 2 + ORGRU, and STFT

28
+ deep features 3 + ORGRU. Different feature extractions and classification techniques are

used for different frameworks. The performance metrics of the proposed model for diagnosing

multiple diseases using speech or voice signals include accuracy, F1-score, false negative rate

(FNR), false positive rate (FPR), Matthews Correlation Coefficient (MCC), precision,

sensitivity, and specificity. The proposed ORG-RGRU model achieved high accuracy values

of 95.75% for dataset 1, 95.72% for dataset 2, and 95.74% for dataset 3. In future work, efforts

will be made to solve the computational burden of multiple disease diagnosis using the

developed model.

In the study of (Ma et al., 2021), a novel deep dual-side learning ensemble model has

been proposed for the speech recognition task of Parkinson’s disease. The preliminary focus is

on the early diagnosis of Parkinson’s disease through Machine Learning (ML)-based speech

data analysis. The proposed model combined deep feature learning and deep sample learning

for accuracy enhancement. An embedded stack group sparse autoencoder for deep feature

learning and an iterative mean clustering algorithm were introduced. The model achieved a

high accuracy rate of 98.4% and 99.6% on two representative PD speech datasets,

outperforming existing algorithms. The effectiveness of deep dual-side learning for

Parkinson’s disease speech recognition has been demonstrated in this study. The document

emphasized the importance of early diagnosis of Parkinson’s disease due to the increasing

prevalence of the disease. It discusses the significance of non-invasive and efficient detection

methods and the potential of speech data analysis for PD diagnosis and rehabilitation

assessment. The proposed algorithm is validated using two public Parkinson’s disease speech

datasets, and the results demonstrate its effectiveness in improving classification accuracy.

Dataset 1 is obtained from the LSVT_voice_rehabilitation dataset. This dataset consists of 14

subjects with Parkinson’s disease and each subject had 9 speech samples of different

pronunciation tasks. Dataset 2 is the Sakar dataset that contains 40 subjects with 6 women and

29
14 men as PD patients. Moreover, the document compares the proposed algorithm with existing

relevant algorithms and highlights its superior performance in terms of accuracy, sensitivity,

and specificity. The proposed algorithm outperforms other methods, showcasing its

effectiveness in PD speech recognition. Additionally, the proposed algorithm outperforms

other methods, showcasing its effectiveness in the speech recognition of Parkinson’s disease.

The deep dual-side learning integration model combines deep feature learning and deep sample

learning to enhance the accuracy of PD speech recognition, making it a significant

advancement in the field. The limitation in this study is the small sample size which causes

restrictions in the number of layers in the deep sample space. Furthermore, the comparison

with control groups indicates that the number of samples is small, and the deep features may

be too large, leading to potential issues such as redundancy.

The study of (Er et al., 2021), is about the detection of Parkinson’s disease (PD) using

speech signals using a new approach based on pre-trained deep networks and Long Short-Term

Memory (LSTM) using mel-spectrograms obtained from the denoised speech signals with

Variational Mode Decomposition (VMD). The study utilized the PC-GITA dataset, which

consists of speech recordings from 50 patients with Parkinson’s disease and 50 healthy controls.

There are several steps involved in the proposed methods such as the use of VMD for noise

removal, the extraction of mel-spectrograms, the feature extraction using pre-trained deep

networks (ResNet-18, ResNet-50, and ResNet-101), and the classification using LSTM to

define sequential information from the extracted features. The proposed method was proved to

have a better performance as compared with other methods. The results showed that the

proposed method achieved high accuracy rates, ranging from 94.26% to 98.61%, depending on

the specific model architecture, batch size, and learning rate. In conclusion, the study

demonstrated that the proposed approach, which combined CNN and LSTM using mel-

spectrograms and VMD for PD detection, outperformed other methods in the literature. The

30
findings suggest that the proposed method has the potential to improve the accuracy of PD

diagnosis using speech signals. In terms of drawbacks, the dataset of PC-GITA may not fully

represent the diversity of speech signals in different populations. Additionally, the study

focuses on the detection of Parkinson's disease using speech signals and deep learning

approaches, which may not account for other potential factors or biomarkers that could

contribute to the diagnosis of the disease.

2.4.7 Dataset used

Throughout the study, there are several datasets used such as open-access database such

as speech dataset from UCI Machine Learning, PC-GITA, from GYENNO SCIENCE

Parkinson’s Disease Research Center, ItalianPVS dataset, Kaggle and GitHub website, and

CIEMPIESS corpus dataset, and some self-collected database. The speech dataset employed in

this study to investigate Parkinson's disease comprises a diverse assortment of audio recordings

from individuals with Parkinson's and a control group of healthy subjects. The dataset was

acquired from a reputable repository specializing in medical and healthcare datasets, ensuring

its reliability and alignment with the research objectives. Each participant contributed speech

samples spanning various linguistic tasks and vocal exercises. To uphold ethical standards, the

dataset underwent anonymization, removing any personally identifiable information. Pre-

processing steps involved extracting acoustic features and normalizing speech signals to

alleviate variations in recording conditions. Additionally, efforts were undertaken to address

potential biases and confounding factors in the dataset. The dataset's inclusivity of detailed

demographic information and clinical assessments facilitates a thorough analysis of speech

characteristics associated with Parkinson's disease. Its availability and adherence to ethical

guidelines make it a valuable resource for advancing our comprehension of speech-related

markers in the context of Parkinson's disease.

31
2.5 Summary

Parkinson's disease (PD) is a neurodegenerative condition marked by both motor and

non-motor symptoms, emphasizing the critical importance of early and precise diagnosis for

effective management. According to the literature review, current conventional approaches to

detect Parkinson’s disease are unable to have an early-detection during the early stages of the

disease progression. Recent investigations have delved into the capabilities of deep learning

models to automate the categorization of Parkinson's disease, utilizing various data modalities

such as speech signals. Therefore, an intelligent system that can detect Parkinson’s disease

through early signs or symptoms is vital to have a better treatment to alleviate the medical

conditions. There are several types of deep learning models such as Convolutional Neural

Network (CNN), Transformer, VGG19, and Deep Neural Network (DNN). Firstly, CNN has

demonstrated its effectiveness in handling speech signals across various applications, such as

speech recognition. These networks excel at capturing local patterns and temporal

dependencies within audio spectrograms. Through the utilization of convolutional layers, CNN

can autonomously learn and extract pertinent features from diverse frequency bands in speech

signals, making them well-suited for tasks where the hierarchical representation of audio

features is pivotal. Next, transformer models have proven their adaptability to tasks related to

speech. The self-attention mechanism inherent in Transformers facilitates the capture of long-

range dependencies in speech signals, thereby enhancing the modeling of context and

relationships among different segments of the audio sequence. Moreover, VGG19 had been

successfully applied to the task of detection of Parkinson’s disease using MRI scan images. In

medical imaging, particularly with MRI scans, VGG19's deep layer structure proves

advantageous in extracting hierarchical features from the intricate details present in medical

images. VGG19 has demonstrated its ability to capture and interpret complex patterns and

structures within medical imagery. The strength of VGG19 lies in its capacity to discern

32
intricate patterns and subtle details within MRI scans, contributing to the accurate detection of

anomalies or specific medical conditions. However, it's crucial to consider the domain gap

between the original training data of VGG19 and the medical imaging data, necessitating

careful fine-tuning and adaptation strategies for optimal performance. Next, DNN exhibited

versatility in handling both time-domain and frequency-domain representations of speech

signals, making them adaptable to tasks such as speech recognition, speaker identification, and

emotion analysis. DNN architectures, particularly recurrent and convolutional variants, possess

the capability to capture temporal dependencies and spectral features, enabling a more

comprehensive representation of speech signals.

In conclusion, each of these deep learning architectures - CNN, Transformer, VGG19,

and DNN - offers distinct advantages in processing speech signals. Their efficacy is contingent

upon the specific task at hand, be it speech recognition, emotion analysis, or speaker

identification. Researchers often select or adapt these models based on the inherent

characteristics of the speech data and the specific requirements of the target applications

33
List of reviewed articles and journals with deep learning models used for the detection of Parkinson’s disease using speech signals with accuracy.

Table 2.1: Summary of deep learning models used for the detection of Parkinson’s disease using speech signals.

Authors (Year) Deep Learning Dataset Selections of speech Accuracy


Model features

(Nijhawan et al., Transformer- UCI Machine Learning Repository Dysphonia measures It outperforms the current SOTA GBDT-
2023) based model: collected at the Dept. of Neurology based solution by at least 1% Area Under
XGBoost of in the Faculty of Medicine, the Receiver Operating Characteristic
GBDT. Istanbul University. It has 188 Curve (AUC) score. The average ROC-
patients with Parkinson’s disease AUC for the proposed network is 0.8574.
and 64 healthy controls.
(Goyal, Khandnor, CNN Dataset D1 was from the “Mobile Resonance-based Accuracy using the combination features:
& Aseri, 2021) Device Voice Recordings at features (TQWT: 100% Validation accuracy: 99.37%.
King’s College London with 16 Tunable Q-factor
patients with Parkinson’s disease wavelet transform)
and 21 healthy controls.

Dataset D2 was collected from 20


healthy controls in India but only
12 were used for further
processing due to noises.
(Zahid et al., 2020) AlexNet PC-GITA Spanish language Phonation (includes Vowel dataset “a” shows an accuracy of
dataset (dataset consists of jitter, shimmer, first and 73.7%, while vowel “e” shows an accuracy
spectrograms of vowels, second derivatives of of 75%, the vowel “I” shows an accuracy of
monologues, and read text. It F0, amplitude 81.4%, and the vowel “o” achieves an
consists of speech recordings of 50 perturbation quotient, accuracy of 72.6%.

34
people who are Parkinson’s pitch perturbation
disease patients and 50 healthy quotient, energy), The read-text dataset shows an accuracy
controls. prosody (duration, F0 of 91%.
contour, energy
contour), and The monologue dataset shows an accuracy
articulation (MFCCs). of 86.36%.

For words, /apto/ achieved 77.2% of


accuracy while /delta/showed an accuracy
of 73.7%.
(Changqin Quan et 2D CNN and 1D Database1-1 was collected at the Phonation, articulation, For database 1, the accuracy of 81.6% on
al., 2022) CNN. GYENNO SCIENCE Parkinson’s and prosody. vowel “a” and 75.3% on reading the
Disease Research Center with 15 sentence in Chinese.
healthy controls and 30
Parkinson’s disease subjects. For database 2, an accuracy of 92% on
reading simple and complex sentences in
Database-2 was from PC-GITA Spanish.
and consisted of speech samples in
Spanish from 100 people with 50
healthy controls and 50
Parkinson’s subjects.
(Rahman et al., Deep Neural UCI ML Parkinson dataset. The Vocal average, 95% of accuracy.
2023) Network (DNN) dataset consists of voice maximum, and
measurements from 31 individuals minimum fundamental
with 23 individuals being frequency, variation in
Parkinson’s patients. amplitude, and the ratio
of noise to tonal
components in the
voice.
(Fang et al., 2021) 6-layer CNN Self-collected Chinese speech 6-layer CNN and Accuracy of 6-layer CNN is 93.8%.
corpus with speech samples LSTM: Uses MFCC
recorded from 34 patients with input,

35
Parkinson’s disease and 34 healthy
controls via different phones and
DVs.
(Narendra et al., End-to-end PC-GITA database. Baseline and glottal The accuracy of Baseline + Glottal (IAIF)
2021) approach (CNN + features (articulation, is 67%, while the accuracy of baseline +
MLP) phonation, and prosody glottal (QCP) is 67.93%.
features.)
(Pragadeeswaran AIPB + QCNN UCI Parkinson speech dataset. Monophonic speech, The detection accuracy of AIPB + QCNN is
& Kannimuthu, short sentences, and 98.5%, sensitivity of 98.6%, and specificity
2024) dynamic speech of 98.5%.
features.
(Celik & Başaran, SkipConNet + RF The first dataset (PDO_Dataset) PDO_Dataset: It The accuracy of PDO_Dataset using
2023) based on CNN and was created in collaboration with extracts 23 features SkipConNet + RF is 99.11% while the
random forest the University of Colorado from each sound accuracy of PD_Dataset has an accuracy of
(RF). National Center for Voice and recording (no mention 98.30%.
Speech and the University of of exact speech
Oxford. This dataset consists of features).
195 speech recordings from 31
individuals (8 healthy controls and PD_Dataset: It extracts
23 Parkinson’s patients). 6 different signal
processing techniques
The other dataset (PD_Dataset) is such as baseline
collected at the Istanbul University features, time-frequency
Cerrahpasa Faculty of Medicine, features, MFCCs,
Department of Neurology. It wavelet transform-
consists of 252 individuals with 64 based features, vocal
healthy controls and 188 with fold features, and
Parkinson’s disease. TQWT features.

36
(Bhatt et al., 2023) SLT and DNN. PC-GITA dataset and ItalianPVS Features used are The accuracy of VGG-16 with SLT
dataset. phonation, articulation, achieves an accuracy of 92% while it
and prosody features. achieves 96% accuracy on the ItalianPVS
dataset.
(C. Quan et al., Bidirectional The dataset was collected from 45 Phonation (jitter, The accuracy of articulation features using
2021) Long-short term subjects (15 healthy controls and temporal perturbation of Bidirectional LSTM is 77.36% for input
memory (LSTM) 30 Parkinson’s patients) who are the fundamental monophonic speech /a/ and it has an
model to capture volunteers at the GYENNO frequency, shimmer, accuracy of 84.29% for input short
time-series SCIENCE Parkinson Disease amplitude perturbation sentence speech.
dynamic features Research Center. quotient, pitch
of a speech signal. perturbation quotient),
and articulation features
(vocal formants, vowel
space area, vocal
pentagon area, formant
centralization ratio).
(Escobar-Grisales 1D CNN and 2D Data is collected from 165 Speech features: The The accuracy of speech modality yielded an
et al., 2023) CNN. Colombian Spanish native participant was asked to accuracy of 88% and outperformed all
speakers, wherein 80 of them describe a regular day in language representations, including the
suffer from Parkinson’s disease. his/her life for multi-modal approach.
approximately 90 days.
(Costantini et al., CNN. Data is collected from 266 healthy Mel spectrograms. The mean accuracy of CNN is 69.75% and
2023) control and 160 Parkinson’s 53% for the mid-advanced level. The
patients (72 subjects are newly accuracy of CNN is 61% of the mean
diagnosed, 88 subjects with accuracy for multiclass tasks.
medium-to-advanced impairment
patients).
(Khaskhoussy & CNN + MLP, and Data is collected by Sakar et al. It Phonetic-prosodic The proposed systems outperform existing
Ayed, 2023) CNN + SVM. consists of 1208 voice recordings features. approaches, achieving an accuracy of
in WAV format. 100%.

37
(Narasimha Rao & ORG-RGRU Dataset 1: It is collected from the MFCC, Cepstral and The proposed model shows an accuracy of
Meher, 2024) Kaggle website “https Spectral features, 95%.
://www.kaggle.com/datasets/dipa principle speech
yanbiswas/parkinsons-disease- features, and pitch
speech-signal-features” features (zero frequency
response filter).
Dataset 2: It is represented by
“https://github.com/Mak-
Sim/Troparion/tree/master/SPA20
19″.

Dataset 3: It is given in the link as


“https://lindat.mff.cuni.cz/reposito
ry/xmlui/handle/11372/LRT-
1597″

(Ma et al., 2021) A deep dual-side Dataset 1: Dataset 1: Continuous The accuracy reaches 98.4% and 99.6% for
learning ensemble LSVT_voice_rehabilitation vowel sounds the two respective datasets.
model is dataset. This dataset consists of 14
developed. subjects with Parkinson’s disease
Dataset 2: 26 Turkish
and each subject had 9 speech pronunciation tasks
samples of different pronunciation
including continuous
tasks. vowels, numbers,
words, and short
Dataset 2: Sakar dataset that sentences.
contains 40 subjects with 6 women
and 14 men as PD patients.
(Guatelli et al., CNN and hybrid The database is obtained from Sustained phonation of The accuracy of ELM and CNN is similar
2023) CNN-ELM Giuliano. This dataset consists of the letter “a”. and reaches a maximum of 91.30%.
(Extreme 55 people suffering from
Learning Parkinson’s disease and 64 healthy
Machine). controls.

38
(Er et al., 2021) Pre-trained deep
PC-GITA Spanish dataset. The Mel spectrogram The highest classification performance is
dataset consists of 50 Parkinson’s
networks and long obtained from RESNET-101 + LSTM
short-term patients and 50 healthy controls. model with VMD as 98.61%.
memory (LSTM).The details of the dataset include
recording monologues and vowels
and reading the text.
(Yao et al., 2022) DCNN. CIEMPIESS corpus dataset with a Auditory The accuracy is up to 95.77%.
total of 16717 sound recordings. characteristics, such as
spectro-temporal
sparsity, frequency
masking, time masking,
pitch shifting, and time
stretching.
(Gemci & Ibrikci, Feed-forward UCI Machine Learning Repository Jitter variants, shimmer The results are classified with 100%
2019) Neural Network variants, fundamental accuracy using an 80-20% train-test data
(FFNN) frequency, baseline partition and 30 epoch numbers.
feature, harmonicity,
recurrent period density
entropy (RPDE), pitch
period entropy (PPE),
intensity parameters,
formant frequencies,
glottis quotient (GQ),
vocal fold features, mel
frequency cepstral
coefficients (MFCCs),
wavelet transform-
based features, and

39
tunable Q-factor
wavelet transform
(TWQT) features.
(Huseyn, 2020) Feed-forward UCI Machine Learning Repository Jitter variants, shimmer The results are classified with 100%
Neural Network variants, fundamental accuracy using an 80-20% train-test data
(FFNN) frequency, baseline partition and 30 epoch numbers.
feature, harmonicity,
recurrent period density
entropy (RPDE), pitch
period entropy (PPE),
intensity parameters,
formant frequencies,
glottis quotient (GQ),
vocal fold features, mel
frequency cepstral
coefficients (MFCCs),
wavelet transform-
based features, and
tunable Q-factor
wavelet transform
(TWQT) features.

40
The Best Introduction to Deep Learning - A Step by Step Guide. (2023, July 21, 2023). Retrieved from
https://www.simplilearn.com/tutorials/deep-learning-tutorial/introduction-to-deep-
learning
Bhatele, K. R. (2020). Classification of Neurodegenerative Diseases Based on VGG 19 Deep Transfer
Learning Architecture: A Deep Learning Approach. Bioscience Biotechnology Research
Communications, 13, 1972-1980. doi:10.21786/bbrc/13.4/51
Bhatt, K., Jayanthi, N., & Kumar, M. (2023). High-resolution superlet transform based techniques for
Parkinson's disease detection using speech signal. Applied Acoustics, 214, 109657.
doi:https://doi.org/10.1016/j.apacoust.2023.109657
Celik, G., & Başaran, E. (2023). Proposing a new approach based on convolutional neural networks and
random forest for the diagnosis of Parkinson's disease from speech signals. Applied Acoustics,
211, 109476. doi:https://doi.org/10.1016/j.apacoust.2023.109476
Chauhan, N. S. (March 15, 2022). Transformer Neural Network in Deep Learning: Explained. Retrieved
from https://www.theaidream.com/post/transformer-neural-network-in-deep-learning-
explained#:~:text=Transformers%20are%20the%20current%20state,these%20is%20a%20ma
chine%20translation.
Cordella, F., Paffi, A., & Pallotti, A. (2021, 23-25 June 2021). Classification-based screening of
Parkinson’s disease patients through voice signal. Paper presented at the 2021 IEEE
International Symposium on Medical Measurements and Applications (MeMeA).
Costantini, G., Cesarini, V., Di Leo, P., Amato, F., Suppa, A., Asci, F., . . . Saggio, G. (2023). Artificial
Intelligence-Based Voice Assessment of Patients with Parkinson's Disease Off and On
Treatment: Machine vs. Deep-Learning Comparison. In Sensors (Basel, Switzerland) (Vol. 23).
Switzerland: MDPI.
Di Pietro, D. A., Olivares, A., Comini, L., Vezzadini, G., Luisa, A., Petrolati, A., . . . Vitacca, M. (2022).
Voice Alterations, Dysarthria, and Respiratory Derangements in Patients With Parkinson's
Disease. J Speech Lang Hear Res, 65(10), 3749-3757. doi:10.1044/2022_jslhr-21-00539
Er, M. B., Isik, E., & Isik, I. (2021). Parkinson’s detection based on combined CNN and LSTM using
enhanced speech signals with Variational mode decomposition. Biomedical Signal Processing
and Control, 70, 103006. doi:https://doi.org/10.1016/j.bspc.2021.103006
Escobar-Grisales, D., Ríos-Urrego, C. D., & Orozco-Arroyave, J. R. (2023). Deep Learning and Artificial
Intelligence Applied to Model Speech and Language in Parkinson’s Disease.
Diagnostics, 13(13). doi:10.3390/diagnostics13132163
Fang, H., Gong, C., Zhang, C., Sui, Y., & Li, L. (2021). Parkinsonian Chinese Speech Analysis towards
Automatic Classification of Parkinson's Disease.
Gemci, F., & Ibrikci, T. (2019). USING DEEP LEARNING ALGORITHM TO DIAGNOSE PARKINSON DISEASE
WITH HIGH ACCURACY.
Goyal, J., Khandnor, P., & Aseri, T. C. (2021). A Hybrid Approach for Parkinson’s Disease diagnosis with
Resonance and Time-Frequency based features from Speech signals. Expert Systems with
Applications, 182, 115283. doi:https://doi.org/10.1016/j.eswa.2021.115283
Guatelli, R., Aubin, V., Mora, M., Naranjo-Torres, J., & Mora-Olivari, A. (2023). Detection of Parkinson’s
disease based on spectrograms of voice recordings and Extreme Learning Machine random
weight neural networks. Engineering Applications of Artificial Intelligence, 125, 106700.
doi:https://doi.org/10.1016/j.engappai.2023.106700
Gunduz, H. (2019). Deep Learning-Based Parkinson’s Disease Classification Using Vocal Feature Sets.
IEEE Access, 7, 115540-115551. doi:10.1109/ACCESS.2019.2936564
Huseyn, E. (2020). Diagnosing Parkinson's Disease using Deep Learning Algorithms.
Introduction to Deep Neural Networks. (July, 2023). Retrieved from
https://www.datacamp.com/tutorial/introduction-to-deep-neural-networks
Kaushik, A. (2023). Understanding the VGG19 Architecture. Retrieved from
https://iq.opengenus.org/vgg19-architecture/

41
Khaskhoussy, R., & Ayed, Y. B. (2023). Improving Parkinson’s disease recognition through voice
analysis using deep learning. Pattern Recognition Letters, 168, 64-70.
doi:https://doi.org/10.1016/j.patrec.2023.03.011
Ma, J., Zhang, Y., Li, Y., Zhou, L., Qin, L., Zeng, Y., . . . Lei, Y. (2021). Deep dual-side learning ensemble
model for Parkinson speech recognition. Biomedical Signal Processing and Control, 69, 102849.
doi:https://doi.org/10.1016/j.bspc.2021.102849
Mandal, M. (2023, November 16, 2023). Introduction to Convolutional Neural Networks (CNN).
Retrieved from https://www.analyticsvidhya.com/blog/2021/05/convolutional-neural-
networks-cnn/
Narasimha Rao, P. V. L., & Meher, S. (2024). ORG-RGRU: An automated diagnosed model for multiple
diseases by heuristically based optimized deep learning using speech/voice signal. Biomedical
Signal Processing and Control, 88, 105493. doi:https://doi.org/10.1016/j.bspc.2023.105493
Narendra, N. P., Schuller, B., & Alku, P. (2021). The Detection of Parkinson's Disease From Speech
Using Voice Source Information. IEEE/ACM Transactions on Audio, Speech, and Language
Processing, 29, 1925-1936. doi:10.1109/TASLP.2021.3078364
Nijhawan, R., Kumar, M., Arya, S., Mendirtta, N., Kumar, S., Towfek, S. K., . . . Abdelhamid, A. A. (2023).
A Novel Artificial-Intelligence-Based Approach for Classification of Parkinson’s Disease
Using Complex and Large Vocal Features. Biomimetics, 8(4). doi:10.3390/biomimetics8040351
Parkinson’s Disease. (2023). Retrieved from https://www.aans.org/en/Patients/Neurosurgical-
Conditions-and-Treatments/Parkinsons-Disease
Pragadeeswaran, S., & Kannimuthu, S. (2024). An Adaptive Intelligent Polar Bear (AIPB) Optimization-
Quantized Contempo Neural Network (QCNN) model for Parkinson’s disease diagnosis using
speech dataset. Biomedical Signal Processing and Control, 87, 105467.
doi:https://doi.org/10.1016/j.bspc.2023.105467
Quan, C., Ren, K., & Luo, Z. (2021). A Deep Learning Based Method for Parkinson’s Disease Detection
Using Dynamic Features of Speech. IEEE Access, 9, 10239-10252.
doi:10.1109/ACCESS.2021.3051432
Quan, C., Ren, K., Luo, Z., Chen, Z., & Ling, Y. (2022). End-to-end deep learning approach for Parkinson’s
disease detection from speech signals. Biocybernetics and Biomedical Engineering, 42(2), 556-
574. doi:https://doi.org/10.1016/j.bbe.2022.04.002
Rahman, S., Hasan, M., Sarkar, A., & Khan, F. (2023). Classification of Parkinson’s Disease using Speech
Signal with Machine Learning and Deep Learning Approaches. European Journal of Electrical
Engineering and Computer Science, 7, 20-27. doi:10.24018/ejece.2023.7.2.488
Semenova, E. I., Partevian, S. A., Shulskaya, M. V., Rudenok, M. M., Lukashevich, M. V., Baranova, N.
M., . . . Alieva, A. K. (2023). Analysis of ADORA2A, MTA1, PTGDS, PTGS2, NSF, and HNMT Gene
Expression Levels in Peripheral Blood of Patients with Early Stages of Parkinson's Disease.
BioMed Research International, 1-8. doi:10.1155/2023/9412776
Sigcha, L., Borzì, L., Amato, F., Rechichi, I., Ramos-Romero, C., Cárdenas, A., . . . Olmo, G. (2023). Deep
learning and wearable sensors for the diagnosis and monitoring of Parkinson’s disease: A
systematic review. Expert Systems with Applications, 229, 120541.
doi:https://doi.org/10.1016/j.eswa.2023.120541
Sivaranjini, S., & Sujatha, C. M. (2019). Deep learning based diagnosis of Parkinson’s disease using
convolutional neural network. Multimedia Tools and Applications, 79, 15467 - 15479.
Váradi, C. (2020). Clinical Features of Parkinson’s Disease: The Evolution of Critical Symptoms. Biology,
9(5). doi:10.3390/biology9050103
Yao, D., Chi, W., & Khishe, M. (2022). Parkinson’s disease and cleft lip and palate of pathological speech
diagnosis using deep convolutional neural networks evolved by IPWOA. Applied Acoustics, 199,
109003. doi:https://doi.org/10.1016/j.apacoust.2022.109003
Zahid, L., Maqsood, M., Durrani, M. Y., Bakhtyar, M., Baber, J., Jamal, H., . . . Song, O. Y. (2020). A
Spectrogram-Based Deep Feature Assisted Computer-Aided Diagnostic System for Parkinson’s
Disease. IEEE Access, 8, 35482-35495. doi:10.1109/ACCESS.2020.2974008

42
43

You might also like