You are on page 1of 16

Received: 22 May 2020 Revised: 17 June 2020 Accepted: 17 June 2020 Published on: 19 August 2020

DOI: 10.1002/sam.11480

RESEARCH ARTICLE

Multiclass machine learning classification of functional


brain images for Parkinson’s disease stage prediction

Guan-Hua Huang1 Chih-Hsuan Lin1 Yu-Ren Cai1 Tai-Been Chen2


Shih-Yen Hsu3 Nan-Han Lu2,4,5,6 Huei-Yung Chen7 Yi-Chen Wu7

1
Institute of Statistics, National Chiao
Tung University, Hsinchu, Taiwan Abstract
2
Department of Medical Imaging and We analyzed a data set containing functional brain images from 6 healthy con-
Radiological Sciences, I-Shou University, trols and 196 individuals with Parkinson’s disease (PD), who were divided into
Kaohsiung, Taiwan
five stages according to illness severity. The goal was to predict patients’ PD
3
Department of Information Engineering,
I-Shou University, Kaohsiung, Taiwan
illness stages by using their functional brain images. We employed the fol-
4
Department of Pharmacy, Tajen lowing prediction approaches: multivariate statistical methods (linear discrimi-
University, Pingtung, Taiwan nant analysis, support vector machine, decision tree, and multilayer perceptron
5
Department of Radiology, E-Da Hospital, [MLP]), ensemble learning models (random forest [RF] and adaptive boosting),
I-Shou University, Kaohsiung, Taiwan
and deep convolutional neural network (CNN). For statistical and ensemble
6
School of Medicine, College of Medicine,
I-Shou University, Kaohsiung, Taiwan
models, various feature extraction approaches (principal component analysis
7
Department of Nuclear Medicine, E-Da [PCA], multilinear PCA, intensity summary statistics [IStat], and Laws’ tex-
Hospital, I-Shou University, Kaohsiung, ture energy measure) were employed to extract features, the synthetic minority
Taiwan
over-sampling technique was used to address imbalanced data, and the opti-
Correspondence mal combination of hyperparameters was found using a grid search. For CNN
Guan-Hua Huang, Institute of Statistics, modeling, we applied an image augmentation technique to increase and balance
National Chiao Tung University, 1001
data sizes over different disease stages. We adopted transfer learning to incorpo-
University Road, Hsinchu 30010, Taiwan.
Email: ghuang@stat.nctu.edu.tw rate pretrained VGG16 weights and architecture into the model fitting, and we
also tested a state-of-the-art machine learning model that could automatically
Funding information
generate an optimal neural architecture. We found that IStat consistently out-
Ministry of Science and Technology,
Taiwan, Grant/Award Numbers: MOST
performed other feature extraction approaches. MLP and RF were the analytic
105-2118-M-009-004-MY2, MOST approaches with the highest prediction accuracy rate for multivariate statistical
107-2118-M-009-005-MY2 and ensemble learning models, respectively. Overall, the deep CNN model with
pretrained VGG16 weights and architecture outperformed other approaches;
it captured critical features from imaging, effectively distinguished between
normal controls and patients with PD, and achieved the highest classification
accuracy.

KEYWORDS
deep neural network, functional brain image, machine learning, supervised classification

Stat Anal Data Min: The ASA Data Sci Journal. 2020;13:508–523. wileyonlinelibrary.com/sam © 2020 Wiley Periodicals LLC 508
HUANG et al. 509

1 I N T RO DU CT ION PD disease stages is performed after feature selection.


Classifiers can be support vector machine (SVM),12,15,16,18
Parkinson’s disease (PD) is a degenerative neurologi- linear or quadratic discriminant analysis,17,18 naïve
cal disorder related to striatal dopamine deficiency, with Bayes,13 and so on. Deep learning, such as a convolutional
symptoms such as slow movement, muscle stiffness, and neural network (CNN) framework, is another suitable
shaking.1 PD has a prevalence of 1–2 per 1000 in the method for predicting PD stages.19
general population, but the rate is up to 2% in people Researchers have developed several methods for clas-
aged over 65 years.2 In Taiwan, the prevalence of PD per sifying PD diagnosis. However, these methods can only
100,000 was 84.8 in 2004 and 147.7 in 2011, representing classify individuals as having PD or being healthy. Meth-
a 7.9% yearly increase.3 PD illness severity can be classi- ods that enable the classification of multiple PD stages
fied into five stages, which are based on the level of clinical are rare. Therefore, this study investigated the optimal
disability.4,5 The type and severity of PD-related symptoms multiclass classification for predicting patients’ PD illness
vary between individuals and according to the different stages by using their functional brain images. The data
stages of the disease. analyzed in this study were derived from SPECT imag-
Because PD is degenerative, early detection, which can ing with 99m Tc-TRODAT-1 ligands.7,20 We used data from
limit expenditure and improve a patient’s quality of life,6 6 healthy controls and 196 patients with PD (with iden-
is critical. The diagnosis of PD is based on clinical crite- tification of disease stage from 1 to 5). We developed
ria, but misdiagnosis is reportedly as high as 25% of cases, an analytic system for the multiclass classification of PD
according to anatomic-pathologic studies.7 Functional stages. This system includes a series of methods for image
imaging techniques such as positron-emission tomogra- preprocessing, feature extraction, imbalanced data adjust-
phy (PET), single photon-emission computed tomography ment, and three types of classification models: multivari-
(SPECT), and functional magnetic resonance imaging can ate statistical methods, ensemble learning models, and
elucidate the pathophysiology and evolution of PD and aid deep CNN.
in the differential diagnosis of the disease. PET and SPECT
have enabled noninvasive, in vivo visualization of the
progression of striatal neuronal function in patients with 2 MATERIALS
PD.8 Unlike for PET, no on-site cyclotron or radiochem-
istry facilities are required for SPECT imaging because of Patients’ imaging and diagnostic reports, collected
its longer half-life. SPECT studies also benefit from the between March 2006 and August 2013, were extracted
industrial production of tracers. The lower cost of radio- from the picture archiving and communication system
tracer synthesis enables the investigation of more patients at E-Da Hospital, I-Shou University, Taiwan. After we
through SPECT than through PET. excluded those who were ineligible, 202 patients remained
In clinical practice, SPECT images are typically evalu- for the analysis. Each patient underwent 99m Tc-TRODAT-1
ated visually or through region-of-interest (ROI) analysis.9 SPECT imaging. The image matrix was 128 × 128, and
ROI techniques involve outlining or positioning the ROI a total of 64 images were captured at a collection rate
over the striatum (target region) and the occipital cortex of 25 s per image. These SPECT images were stored in
(reference region) and computing a quantitative measure the Digital Imaging and Communications in Medicine
termed the background-subtracted striatal uptake ratio.9 format.
An alternate approach is to perform shape and intensity According to the diagnostic reports, 6 patients had
distribution (surface profile) analysis and use pattern healthy brain function (3 men and 3 women, with a
recognition techniques for differentiation.10,11 median age of 47.5 years) and 196 patients had PD (80 men
Numerous studies have attempted to design a and 116 women, with a median age of 69.13 years). The
computer-aided diagnosis system for PD detection. Most physical symptoms of PD were classified into five differ-
of them have focused on the pattern recognition approach. ent stages using the Hoehn and Yahr Scale (HYS),4 with
In general, features can be extracted from voxels of the stages I and V indicating the mildest and most severe ill-
complete brain12,13 or of the striatum.14–16 Feature extrac- ness, respectively. The number of patients in stages I to V
tion from voxels of the complete brain is typically followed was 22, 27, 53, 87, and 7, respectively.
by dimensional reduction, such as principal component More details on the exclusion criteria, imaging instru-
analysis (PCA)15 or singular value decomposition (SVD).13 ment, and experimental design can be found in Hsu
Feature extraction from voxels of the striatum is typically et al..21 This clinical study was approved by the Medical
followed by feature selection16 or the use of the striatal Ethics Committee of E-Da Hospital. All patients signed
binding ratio14,17 as features. Classification into different written informed consent before participating.
510 HUANG et al.

FIGURE 1 Proposed analytic system for multiclass machine learning classification of functional brain images for Parkinson’s disease
stage prediction

3 M ET H ODS The overall structure of our analytic system is illustrated


in Figure 1. The Python programming language22 was the
The goal of this study was to predict patients’ PD ill- analytic tool employed in the implementation of these
ness stage by using their SPECT functional brain images. methods.
All SPECT images were preprocessed before any analy- We used 5-fold cross-validation for model evalua-
sis to remove noise or correct for errors. We used the tion: the data were divided into five parts, with four of
following machine learning methods for PD classifica- them serving as the training set and one as the test set,
tion: multivariate statistical methods (linear discriminant with a total of five different training-test combinations
analysis [LDA], SVM, decision tree [DT], and multilayer (five cross-validation rounds). Our analytic system, which
perceptron [MLP]), ensemble learning models (random includes a series of methods for image preprocessing,
forest [RF] and adaptive boosting [AdaBoost]), and deep feature extraction, imbalanced data adjustment, and clas-
CNN. For statistical and ensemble models, various fea- sification, was applied to the training set. When tuning
ture extraction approaches (PCA, multilinear PCA, inten- the model hyperparameters, the training data were fur-
sity summary statistics [IStat], and Laws’ texture energy ther divided into five folds: four folds were fitted with
measure [LTEM]) were employed to extract features. The different parameter settings and one fold was for vali-
synthetic minority over-sampling technique (SMOTE) was dation. The optimal parameter setting was determined
used to address imbalanced data concerns, and the opti- using the average accuracy of five validations. The perfor-
mal combination of hyper-parameters was obtained using mance of various approaches was evaluated in terms of the
a grid search. For CNN modeling, we applied an image training accuracy, test accuracy, and F1 score. Accuracy
augmentation technique to increase and balance data is defined as the proportion of images that are correctly
sizes across different disease stages, adopted the trans- classified. Training accuracy denotes the average accu-
fer learning concept to incorporate pretrained VGG16 racy of the training sets during the five cross-validation
weights and architecture into the model fitting, and used a rounds. Test accuracy refers to the accuracy on the five
state-of-the-art machine learning model that can generate test parts. In binary classification, F1 score is the har-
an optimal neural architecture automatically (AutoML). monic mean of the true positive rates over truths (ie, recall)
HUANG et al. 511

F I G U R E 2 Preprocessing procedure for three-dimensional (3D) single photon-emission computed tomography (SPECT) images. Step
1 involved selecting a single slice from the original 3D stereo image that contained the clearest striatum shape as the target data for
subsequent analysis. Step 2 involved trimming the image from 128 × 128 to 50 × 50 to obtain a cropped image that contained a complete brain
image with only a small portion of the black background

and over positives (ie, precision). For our multiclass clas- of a black background, and only the middle section con-
sification, we adopted the macro-F1 score, which is the tained the brain image, we further trimmed the image
equal-weighted average of the F1 scores of each class. We from 128 × 128 to 50 × 50 pixels. The cropped image con-
calculated the macro-F1 score according to the five test tains a complete brain image with only a small portion of
parts. the black background (Figure 2).

3.1 Image preprocessing 3.2 Feature extraction

For SPECT images, statistical parametric mapping (SPM) For multivariate statistical methods and ensemble learn-
software23,24 can be used for preprocessing. In the early ing models, we used pixel-based features (ie, every pixel
stages of the study, we attempted to use SPM to prepro- represents a feature) for analysis. Our input image had
cess the collected SPECT images, but the results were of 2500 (50 × 50) pixels (features). Conventional machine
insufficient quality. Our SPECT brain images were col- learning models could not be directly applied because the
lected through general clinical pipelines. Compared with number of features was too large. We thus applied the fol-
the images obtained in typical research projects, our brain lowing approaches to extract vital features for later PD
image collection procedure was less standardized; there- classification analysis.
fore, the image noise induced by human or environmen-
tal factors was larger. We speculated that these noises
caused the violation of SPM model assumptions, resulting 3.2.1 Principal component analysis
in implementation results that were poorer than expected.
Our preprocessing, therefore, proceeded as follows: PCA is a commonly used dimensional reduction tech-
from the original three-dimensional (3D) stereo image, nique that involves explaining the variance–covariance
we first selected a single slice that contained the clearest structure of a set of variables through a few linear com-
striatum shape as the target data for subsequent anal- binations of these variables (ie, principal components
ysis. PD diagnosis is mainly based on the characteriza- [PCs]) and then using these PCs to replace the original
tion of the striatum, which only occupies a small part of variables.25 Our 50 × 50 input image was first flattened into
the brain. Complete brain images may contain too much a one-dimensional vector with a length of 2500 pixels. We
unnecessary information, leading to an increase in ana- then performed PCA on the training data with 2500 vari-
lytic burden. By focusing on the striatum, we reduced our ables and extracted the first k (≤2500) PCs, which were
data from the original 3D stereo image (128 × 128 × 64) selected using the proportion of total variance explained or
to a two-dimensional (2D) planar image (128 × 128), as the scree plot.25 The PCA function in the sklearn Python
depicted in Figure 2. Because most of the image consisted library was used to perform the PCA.
512 HUANG et al.

3.2.2 Multilinear PCA compute and is used more often in practice. LTEM are tex-
ture features generated through the statistical approach
When analyzing data in matrix format (eg, images), and consist of mine full images that are created by apply-
PCA first vectorizes these matrix data and then pro- ing nine types of texture filters to the original image.
ceeds with a large variance–covariance matrix, which These nine filtered images measure level-, edge-, spot-, and
can be inefficient and unstable.26 MPCA is a modifica- ripple-related contents of the original image (Figure 3).
tion of PCA. It preserves the natural matrix structure Here, we applied the aforementioned PCA, MPCA, and
of the data in searching for PCs. For our 50 × 50 input IStat to the nine texture images from the LTEM in addition
image X i , i = 1, … , n, MPCA will find A1 (50 × p) to the original image. Thus, by including texture informa-
and A2 (50 × q) with p ≤ 50 and q ≤ 50 that minimize tion, we expanded the feature set generated by each feature
∑n
i=1 ||(Xi − X) − A1 A1 (Xi − X)A2 A2 || , where X is the
T T 2
extraction by 10 times. A Python implementation of LTEM
average of the input images and ||⋅|| is the matrix Frobenius was used to create texture images.
norm. In the solution, the low-dimensional core tensors
AT1 (Xi − X)A2 , i = 1, … , n are used to replace the original
images for further analysis; thus, we reduced the num- 3.3 Class imbalance adjustment
ber of features to pq (≤2500). Notably, PCA performs SVD
for a 2500 × 2500 variance-covariance matrix, whereas in The distribution of our data over multiple PD stages
MPCA, the SVD is performed for only two 50 × 50 matri- (classes) was relatively imbalanced, which could result in
ces, which is much more parsimonious in terms of param- a distortion of the classification model, biasing the predic-
eter usage. MPCA was implemented with the Python pro- tion results toward the majority class at the expense of the
gramming language. minority class. To address the class imbalance matter, we
adopted the following approaches.

3.2.3 Intensity summary statistics


3.3.1 Synthetic minority over-sampling
Given the flattened vector of pixel values from an image technique
(x1 , x2 , … , x2500 ), we calculated its mean, variance, kur-
tosis, skewness, maximum, minimum, and Shannon SMOTE is an over-sampling approach in which the minor-
entropy. In addition, to create the histogram representing ity class is over-sampled by creating “synthetic” exam-
the distribution of pixel values, we standardized the vec- ples rather than over-sampling with replacements.28 We
(xj −min xi )
tor components as yj = i
, j = 1, … , 2500, and applied SMOTE to the extracted features used by the mul-
(max xi −min xi )
i i tivariate statistical and ensemble learning classification
calculated the proportions
[ ) [ of )yj ’s that[ fell in each
) [ of the ] analysis. We selected three nearest neighbors for creat-
1 1 2
following b bins 0, b , b , b , … , b , b , b−1
b−2 b−1
b
, 1 . ing synthetic minority samples and ensured that made all
Our intensity summary statistics consisted of these sum- classes in the training set contained the same number of
mary statistics and the histogram’s bin proportions, a fea- samples. The SMOTE function in the imblearn Python
ture vector of length 7 + b. Python functions were adopted package was used to perform SMOTE.
for calculating summary statistics.

3.3.2 Image augmentation


3.2.4 Laws’ texture energy measure
To construct a powerful image classifier using little train-
An image contains information not only on the bright- ing data, image augmentation is typically required to
ness of the image on each unit pixel but also on the boost the performance of deep neural network mod-
interaction among units. Image texture is a set of metrics els. The image augmentation process uses a combi-
that quantify the spatial arrangement of color or intensi- nation of affine transformations to generate duplicate
ties in an image.27 Image texture can be defined in two images that are rotated, shifted, flipped, or zoomed in
manners: the structured approach considers image texture or out. We augmented our training images by using the
as a set of primitive texels in some regular or repeated ImageDataGenerator function in the Keras Python
patterns, whereas the statistical approach defines texture library: images from each PD class were randomly selected
as quantitative measures of the arrangement of intensi- to perform the transformations until the preset sample
ties in a region. Compared with the structured approach, size for the class was reached. Our augmentation process
the statistical approach is more general and simpler to not only increased the sample size but also attempted to
HUANG et al. 513

F I G U R E 3 Texture feature images generated by Laws’ texture energy measures. The first image on the left is the original image; the
nine pictures on the right contain feature images, from left to right and top to bottom, generated using the following nine texture filters,
respectively: L5E5/E5L5, L5R5/R5L5, E5S5/S5E5, S5S5, R5R5, L5S5/S5L5, E5E5, E5R5/R5E5, and S5R5/ R5S5

balance classes in the training data set. Then, both original sklearn Python library was used to perform LDA. No
and duplicate images were entered into our deep neural hyperparameter required setting.
network models.29 An SVM produces a separating hyperplane that has the
largest distance to the nearest training data point and has
nonlinear boundaries by constructing a linear boundary in
3.4 Machine learning classification a large, transformed version of the feature space.31 We used
the SVC function in the sklearn Python library for SVM.
The machine learning classification used in this study con- The one-vs-one strategy was adopted for multiclass classi-
sisted of multivariate statistical methods, ensemble learn- fication (ie, decision_function_shape = “ovo”).
ing models, and the deep CNN. In this study, we tested linear, polynomial, radial basis
function, and sigmoid kernels for mapping inputs
into high-dimensional feature spaces. Parameter val-
3.4.1 Multivariate statistical methods ues for kernels were set to be exponentially growing
sequences, following the suggestion in ref. 32. The opti-
LDA adopts multivariate normal distributions to model mal kernel-parameter combination was selected through
input features and assigns objects according to their a grid search using 5-fold cross-validation, which was
posterior probabilities of class membership.30 The implemented through the GridSearchCV function in
LinearDiscriminantAnalysis function in the the sklearn Python library.
514 HUANG et al.

DT is a nonparametric supervised learning method at which boosting was terminated and learning rates that
that partitions the feature space into a set of rectangles shrank the contribution of each classifier.
and then fits a simple model (such as a constant) in each
rectangle.33 We used the DecisionTreeClassifier
function in the sklearn Python library for DT. We con- 3.4.3 Deep convolutional neural
sidered hyperparameters, including different strategies for network
choosing the split, maximum depth of the tree, and num-
ber of features to scrutinize, when searching for the most Deep learning is part of a broader family of machine learn-
suitable split. ing methods based on neural network models. A CNN
The central concept of an MLP (also known as the is a deep learning architecture that constructs a series of
neural network model) is to extract linear combinations hidden layers to progressively extract higher level features
of the inputs as derived features and then model the tar- from the raw input. Hidden layers in CNN typically consist
get as a nonlinear function of these features. We used of convolutional layers for feature extraction, activation
the MLPClassifier function in the sklearn Python layers (functions) for nonlinear transformation, pooling
library for MLP. We selected quasi-Newton methods as the layers for dimension reduction, and fully connected layers
solver for weight optimization (ie, solver = “lbfgs”) to flatten the matrix to a one-dimensional form. In con-
because of our small dataset. We considered hyperparam- volutional layers, a learnable filter (small matrix) is con-
eters, including different hidden layer sizes and activation volved across the width and height of one slice of the input
functions. volume, computing the dot product between the entries of
the filter and the input and producing a 2D activation map
for a specific type of feature that the filter intends to learn.
3.4.2 Ensemble learning models Stacking the activation maps for all filters along the depth
dimension of the input volume forms the output volume of
Ensemble learning constructs a classification model by the convolution layer.37 The pooling procedure partitions
combining the strengths of a collection of simpler base the input image into a set of nonoverlapping rectangles
models.34 Two families of ensemble methods are typically and, for each such subregion, outputs the maximum or the
distinguished. Bagging fits the same classification function average. The pooling layer progressively reduces the size
many times to bootstrap-sampled versions of the training of the image to avoid rapid retraining, alleviate computa-
data and averages the results. The objective is to aver- tional burden and hence also control overfitting. Typically,
age many noisy but approximately unbiased models and the pooling layer is used after the convolution operation
thus reduce the variance. Boosting sequentially applies is performed. Activation functions can generate nonlinear
the weak (base) classifier to the bootstrap-sampled train- mapping from inputs to complex outputs. When apply-
ing data with repeatedly modified sampling weights; the ing CNNs for image classification, the rectified linear unit
weights are increased for those observations that are mis- (ReLU) activation functions are commonly used after con-
classified at the previous classification and the weights are volutional layers, and the softmax function is used in the
decreased for those correctly classified. final fully connected layer. In contrast to other image clas-
RF is a substantial modification of bagging sification algorithms that rely on prior knowledge and
that constructs a large collection of “de-correlated” human effort in feature design, CNNs can automatically
trees and then averages them.35 We used the extract input features through learned filters.
RandomForestClassifier function in the sklearn Transfer learning is a machine learning method that
Python library for RF. We considered hyper-parameters, utilizes knowledge learned from one task as the starting
including different number of trees in the forest, max- points on related ones. Implementing transfer learning in
imum depth of the tree, and number of features to the CNN involves reusing the first several layers of the net-
scrutinize, when searching for the most suitable split. work for a similar task with much more data. By using
AdaBoost, proposed by Freund and Schapire,36 is large amounts of data, hidden layers earlier in the net-
the first practical boosting algorithm. We used the work learn to recognize a variety of base elements that
AdaBoostClassifier function in the sklearn are typically not specific to the exact task that the net-
Python library for AdaBoost. We employed SVM and DT as work may be used for; therefore, transfer learning enables
the base classifiers. The optimal hyperparameter for each a CNN to have a broader application. VGG1638 is cur-
base classifier was determined first. Then, we selected the rently one of the highest-quality CNN architectures. We
boosting hyperparameters for this optimal base classifier, used the VGG16 function in the keras Python library to
including different maximum numbers of base classifiers adopt the pretrained VGG16 model whose weights were
HUANG et al. 515

trained from ImageNet, which is a large and publicly avail-


able image database with a total of 14 million images and
22,000 visual categories.39 VGG16 requires input images
with three channels (eg, color). In our preprocessing pro-
cedure, we had used the original SPECT 3D stereo image
to select a single slice that contained the clearest striatum
shape as the target data for subsequent analysis. Our input
data were grayscale medical imaging, and we thus repre-
sented three channels for VGG16 inputs, either with the
F I G U R E 4 Cumulative proportions of total variance
same selected (2D) image or with the three consecutive
explained by k principal components (PCs), k = 1, 2, 3, … based on
images above and below the selected one. We used weights
four training folds for each cross-validation round. In the table on
from the first 18 layers of VGG16 on ImageNet and fur- the right, the values represent the cumulative percentages of the
ther trained three fully connected layers with sizes 128, total variance explained by the first 50 PCs. The x-axis of the plot
64, 6 and activations ReLU, ReLU, softmax, respectively. has only 160 rather than 2500 PCs because the explained variance
The RMSprop and Adam optimizers in the keras Python ratio was nearly zero after PC 160
library, at their default hyper-parameter values, were used
to obtain weight estimation.
The VGG16 architecture was fine-tuned on ImageNet these PCs in each cross-validation round. For compari-
data that consist of common images in daily life; there- son, MPCA was thus fitted on the numbers of image bases
fore, it may not be applicable to our medical imaging data. p = q = 7 (ie, 7 × 7 = 49).
Neural architecture search (NAS)40 aims to identify the After features were extracted using the PCA, MPCA,
optimal neural network architecture for the given learning or IStat methods, we first applied SMOTE to ensure that
task and data set, which is the key component of AutoML. all classes contained the same number of samples; sub-
NAS adopts reinforcement learning (RL) to train a recur- sequently, machine learning classification based on mul-
rent neural network controller to produce a child network tivariate statistical methods was used for PD illness stage
by combining a set of building blocks of the network. Then, prediction. Table 1 presents the accuracy of the differ-
we can use the child network’s prediction accuracy as the ent feature extraction methods according to the 5-fold
reward signal to update the controller, and as a result, cross-validation. IStat outperformed the other extraction
produce a new child network with a higher accuracy in methods (in both training and test accuracy) regardless
the next iteration of RL. Although NAS is powerful, it is of the multivariate statistical classification used. When
computationally expensive and time consuming. The effi- adopting a specific feature extraction method, MLP gen-
cient neural architecture search (ENAS)41 accelerates the erally resulted in higher test accuracy than other multi-
training process of NAS by forcing all child networks to variate statistical classifications did. When SVM or MLP
share weights to avoid training each child network from adopted features from IStat, we obtained the highest test
the beginning to convergence. Auto-Keras42 is a Python accuracy (52%). However, when DT adopted features from
library for ENAS. We used the ImageClassifier func- MPCA, the test accuracy was as low as 26.2%. Performance
tion in the autokeras library to automatically search for results based on F1 scores (Table 1) were generally sim-
the architecture and hyperparameters of the deep CNN ilar to the aforementioned findings. The performance of
most suitable to our PD illness stage prediction by using MLP was an exception: when MLP adopted features from
SPECT images. IStat, it did not have the highest F1 score. The MLP with
IStat features incorrectly classified all images represent-
ing healthy brains and most illness stages. This approach
4 R E S U LTS appeared to bias the prediction results toward majority
classes and ignore the two minority classes, which led to
4.1 Results from multivariate higher accuracy but poorer F1 scores.
statistical classification We also applied PCA, MPCA, and IStat to the nine
texture images from LTEM in addition to the original
To determine the number of PCs to retain in fitting the image and thus expanded the feature set generated by
PCA, we calculated the cumulative proportions of the total each feature extraction by 10 times (Table 2). The addi-
variance explained by k PCs, k = 1, 2, 3, … on the tional texture information from LTEM did not appear to
basis of four training folds for each cross-validation round alter the performance ranking of various approaches; how-
(Figure 4). We selected 50 components because more than ever, it improved the training accuracy and reduced the
97% of the total variance was consistently explained by test accuracy and F1 score, implying that these textures
516 HUANG et al.

T A B L E 1 Results of Parkinson’s Feature Number of Training Test Test


disease stage prediction using Model extraction features accuracy accuracy F1 score
multivariate statistical methods
LDA PCA 50 0.883 0.450 0.314
MPCA 49 0.884 0.351 0.269
IStat 50 0.911 0.500 0.320
SVM PCA 50 0.999 0.421 0.279
MPCA 49 0.997 0.426 0.289
IStat 50 0.991 0.520 0.370
DT PCA 50 1.000 0.347 0.221
MPCA 49 0.995 0.262 0.180
IStat 50 1.000 0.470 0.365
MLP PCA 50 1.000 0.450 0.355
MPCA 49 1.000 0.450 0.308
IStat 50 1.000 0.520 0.298

Abbreviations: DT, decision tree; IStat, intensity summary statistics; LDA, linear discriminant analysis; MLP,
multilayer perceptron; MPCA, multilinear principal component analysis; PCA, principal component analysis;
SVM, support vector machine.

T A B L E 2 Results of Feature Number of Training Test Test F1


Parkinson’s disease stage Model extraction features accuracy accuracy score
prediction using multivariate
LDA LTEM + PCA 500 (50 × 10) 0.985 0.347 0.261
statistical methods with additional
features from Laws’ texture energy LTEM + MPCA 490 (49 × 10) 0.988 0.297 0.255
measure (LTEM) images LTEM + IStat 500 (50 × 10) 0.963 0.475 0.294
SVM LTEM + PCA 500 (50 × 10) 1.000 0.431 0.238
LTEM + MPCA 490 (49 × 10) 1.000 0.421 0.187
LTEM + IStat 500 (50 × 10) 1.000 0.406 0.146
DT LTEM + PCA 500 (50 × 10) 0.999 0.317 0.272
LTEM + MPCA 490 (49 × 10) 0.991 0.262 0.192
LTEM + IStat 500 (50 × 10) 0.997 0.460 0.250
MLP LTEM + PCA 500 (50 × 10) 1.000 0.416 0.283
LTEM + MPCA 490 (49 × 10) 1.000 0.431 0.298
LTEM + IStat 500 (50 × 10) 1.000 0.470 0.267

Abbreviations: DT, decision tree; IStat, intensity summary statistics; LDA, linear discriminant analysis; LTEM,
Laws’ texture energy measure; MLP, multilayer perceptron; MPCA, multilinear principal component analysis;
PCA, principal component analysis; SVM, support vector machine.

from LTEM did not generate additional useful features for presented in Tables 3 and 4, respectively. IStat outper-
predictions. formed PCA and MPCA regardless of the ensemble learn-
ing classification used. When adopting a specific feature
4.2 Results from ensemble learning extraction method, RF resulted in higher test accuracy and
classification F1 score than did AdaBoost. When RF adopted features
from IStat, we obtained the highest test accuracy (54.5%)
The results obtained from applying feature extraction and F1 score (38.5%). Additional texture information from
methods on the original image and on the 10 texture LTEM did not appear to improve the test accuracy and F1
images (including the original one) from LTEM are score of various approaches.
HUANG et al. 517

TA B L E 3 Results of Parkinson’s disease stage prediction using ensemble learning models


Feature Number of Training Test Test F1
Model extraction features accuracy accuracy score
RF PCA 50 1.000 0.470 0.282
MPCA 49 1.000 0.406 0.235
IStat 50 1.000 0.545 0.385
AdaBoost + DT PCA 50 1.000 0.272 0.152
MPCA 49 1.000 0.347 0.229
IStat 50 1.000 0.446 0.315
AdaBoost + SVM PCA 50 0.617 0.322 0.234
MPCA 49 0.578 0.268 0.182
IStat 50 0.823 0.515 0.350

Abbreviations: AdaBoost, adaptive boosting; DT, decision tree; IStat, intensity summary statistics; MPCA, multilinear
principal component analysis; PCA, principal component analysis; RF, random forest; SVM, support vector machine.

TA B L E 4 Results of Parkinson’s disease stage prediction using ensemble learning models with additional
features from Laws’ texture energy measure (LTEM) images
Feature Number of Training Test F1
Model extraction features accuracy Test accuracy score
RF LTEM+PCA 500 (50 × 10) 1.000 0.406 0.202
LTEM+MPCA 490 (49 × 10) 1.000 0.455 0.233
LTEM+IStat 500 (50 × 10) 1.000 0.564 0.341
AdaBoost + DT LTEM + PCA 500 (50 × 10) 1.000 0.327 0.193
LTEM + MPCA 490 (49 × 10) 1.000 0.307 0.201
LTEM + IStat 500 (50 × 10) 1.000 0.421 0.301
AdaBoost + SVM LTEM + PCA 500 (50 × 10) 0.762 0.381 0.176
LTEM + MPCA 490 (49 × 10) 0.789 0.401 0.113
LTEM + IStat 500 (50 × 10) 0.766 0.371 0.153

Abbreviations: AdaBoost, adaptive boosting; DT, decision tree; IStat, intensity summary statistics; MPCA, multilinear principal
component analysis; PCA, principal component analysis; RF, random forest; SVM, support vector machine.

4.3 Comparison of feature extraction For the test image (Figure 6), MPCA could progressively
approaches improve the reconstruction through the addition of more
features, but PCA appeared not to make much difference
To compare the performance of PCA and MPCA on our after 100 PCs. These results exemplify the tendency of
SPECT image data, in each cross-validation round, PCA overfitting in PCA. Notably, PCA experienced the “large
or MPCA was applied on four training folds to produce p and small n” problem, where the sample size for the
PC loadings or image bases (A1 , A2 ), which could then be training set (161) was much smaller than the length of the
used to reconstruct images in the training and test folds. input vector (2500). This problem had a smaller impact on
Figures 5 and 6 illustrate the reconstruction of one random MPCA because it is much more parsimonious in parame-
training image and one random test image, respectively. ter usage.
For the training image (Figure 5), the component-wide IStat consistently outperformed other feature extrac-
sum of absolute difference between the original and recon- tion approaches regardless of the multivariate statistical
structed images (the “Diff” value in the figure) for PCA or ensemble learning classification used for prediction
was much smaller than that for its MPCA counterpart (Tables 1 and 3). We observed no clear difference in predic-
(ie, k ≈ p × q). The absolute differences for both PCA and tion accuracy and F1 score between PCA and MPCA. Addi-
MPCA decreased as the number of features increased. tional texture information from LTEM did not appear to
518 HUANG et al.

F I G U R E 5 Reconstructions of one randomly selected training image by using principal component analysis (PCA) and multilinear
PCA (MPCA). The first image on the left is the original image; the first and third layers of the 12 images on the right are the results of PCA
reconstruction, and the second and fourth layers are the results of MPCA reconstruction. The Diff value in the figure is the component-wide
sum of absolute difference between the original and reconstructed images. PCA_nc denotes the number of principal components used for
PCA, and MPCA_p,q refers to the number of image bases used for MPCA

improve the test accuracy of various approaches (Tables 2 extraction methods. All three extraction methods per-
and 4). formed fairly well in the training set, with the increased
We further investigated the effect of the number of fea- feature numbers improving the training accuracy. For the
tures retained in the three feature extraction methods. To test set, IStat consistently exhibited the highest test accu-
facilitate the comparison, here we only used LDA for clas- racy for a given number of features. The prediction in
sification. Figure 7 illustrates the training and test accu- the test set, in general, were initially more accurate but
racies as the number features increased for the different then progressively worsened with the increased number of
HUANG et al. 519

F I G U R E 6 Reconstructions of one randomly selected test image by using principal component analysis (PCA) and multilinear PCA
(MPCA). The first image on the left is the original image; the first and third layers of the 12 images on the right are the results of PCA
reconstruction, and the second and fourth layers are the results of MPCA reconstruction. The Diff value in the figure is the component-wide
sum of absolute difference between original and reconstructed images. PCA_nc refers to the number of principal components used for PCA,
and MPCA_p, q denotes the numbers of image bases used for MPCA

features. Notably, we observed a considerable increase in 4.4 Results from the deep
IStat’s test accuracy when the number of features was 70 convolutional neural network
or 75. With an appropriately selected bin size for the his-
togram, IStat can generate features that benefit prediction We used Python’s VGG16 function to fit the VGG16 model
accuracy. with pretrained weights from ImageNet. VGG16 requires
520 HUANG et al.

FIGURE 7 Effect of the number of features retained in the three feature extraction methods—principal component analysis (PCA)
and multilinear PCA (MPCA), and intensity summary statistics (IStat). We only used linear discriminant analysis (LDA) for classification.
The left plot illustrates the training accuracies as the number of features increases for different extraction methods; the right plot reveals the
test accuracies

TA B L E 5 Results of Parkinson’s disease stage prediction using a deep convolutional neural network
Training Test Test F1
Model The 3 channels for input data accuracy accuracy score
Deep CNN—VGG16 Repeat the image three times 0.922 0.649 0.576
Use three consecutive images 0.912 0.653 0.606
Deep CNN—Auto-Keras Repeat the image three times 0.700 0.431 0.357

three-channel images as its input. We thus repeated our autokeras was not only inferior to that of the VGG16
grayscale images three times to obtain a three-channel model but also poorer than that of some multivariate sta-
input. The expanded three-channel images in the training tistical and ensemble learning models (Table 5). The five
set were augmented to ensure that each PD illness label deep CNN architectures generated by autokeras for the
contained 250 images. They were then input into VGG16 five cross-validation rounds differed substantially; they
for modeling. The VGG16 deep CNN model achieved a contained 50, 32, 87, 93, and 83 blocks, respectively. Our
64.9% test accuracy and a 57.6% test F1 score (Table 5), rep- data appeared to be insufficient for autokeras to train
resenting an improvement of approximately 20% and 50%, a stable and robust model, which ultimately resulted in a
respectively, compared with the results from multivariate relatively poor prediction ability.
statistical and ensemble learning classifications. To examine how the sample size and the number of
We also used the three consecutive images above and augmented images affected the performance of the VGG16
below the selected slice in our preprocessing as VGG16’s deep CNN model, we executed the model on all of the orig-
input. We did observe an improvement in test accuracy inal samples, a random selection of two thirds of it, or a
and F1 score; nevertheless, the two types of inputs yielded random selection of half of it. The random sampling was
very similar prediction results (Table 5). We speculate that stratified by PD illness stages, except for the normal and
the difference between the three consecutive images was fifth stages, where the number of images (six and seven,
not obvious; thus, the model construction was not greatly respectively) were limited; therefore, all images for those
affected by the input data changes. two groups were selected for analysis. For each sample
When autokeras in Python was used to automati- type, the repeated three-channel images of the training set
cally search for the architecture and hyperparameters of were augmented to create balanced, 150, 250, 500, or 700
the deep CNN, the three channels of the input data were images for each illness stage. Here, “balanced” indicates
formed by the same grayscale image, and the augmented that all stages had the same number of images as that of
training set had 250 images for each PD illness class. the largest stage. The training and test accuracies from
The performance of the model automatically generated by various combinations of sample and augmented sizes are
HUANG et al. 521

F I G U R E 8 Effects of the sample size and number of augmented images on the performance of the VGG16 deep convolutional neural
network (CNN) model. The left plot illustrates the training accuracies from various combinations of sample sizes and augmented image
numbers; the right plot reveals the test accuracies

revealed in Figure 8. Within each sample type, as the num- carefully constructed multivariate statistical or ensem-
ber of augmented images increased, the training accuracy ble learning models over complicated deep learning
increased, whereas the test accuracy first improved and approaches. When MLP and RF adopted features from
then became poorer. However, when adopting the same IStat, we observed average test accuracies of 49.5% and
number of augmented images, the increased sample size 55.5%, respectively, whereas the AutoML deep CNN model
appeared to reduce the training accuracy and increase the only reached a test accuracy of 43.1%. These findings were
test accuracy. consistent with our expectations. However, transfer learn-
ing that used weights trained from large and publicly
available data on our tasks compensated for the short-
5 DISCUSSION age of data in applying deep learning models. Our VGG16
deep CNN model with pretrained weights from ImageNet
In the classification of high-dimensional data (eg, imag- achieved a test accuracy of 64.9%. Medical image data are
ing, microarray, or genome sequencing), feature extraction rare and their labeling is expensive; therefore, having a
is typically first performed to achieve dimension reduc- small to medium amount of data is not unusual. The trans-
tion; then, machine learning classification is applied to fer learning approach can thus be applied to enhance the
predict the data class labels. This study compared the deep CNN model’s capacity in analyzing these small- or
performance of various combinations of feature extrac- medium-sized data. Moreover, different criteria may be
tion and machine learning methods in classifying medical employed for certain diseases’ diagnosis. The diagnosis of
imaging data. We considered dimension reduction, hand- PD can serve as an example: in addition to the HYS used
crafted, and deep learning feature extraction approaches. in the current study, the Unified Parkinson’s Disease Rat-
We employed the following machine learning classifica- ing Scale43 is another commonly used rating scale. When
tions: multivariate statistical methods, ensemble learning several SPECT image datasets with class labels that were
models, and the deep convolutional neural network. For generated through different diagnostic criteria are avail-
multivariate statistical methods and ensemble learning able, transfer learning can use the knowledge obtained
models, we observed the highest accuracy when MLP and from one data set as the starting points for another set and
RF, respectively, adopted features from IStat. Overall, the thus combine these data sets to form a better classifier.
deep CNN model with pretrained VGG16 weights and Our analysis revealed that the number of augmented
architecture outperformed other approaches and achieved images and the sample size can affect the performance
a greater than 20% and 50% improvement in test accuracy of the VGG16 deep CNN model. Image augmentation not
and F1 score, respectively, compared with the multivariate only increased the amount of data but also introduced
statistical and ensemble learning classifications. some noise to enhance the robustness of the model. Addi-
The deep learning model tends to require con- tion of more augmented images can thus improve both
siderably more data to achieve its full capacity than training and test accuracies. However, excessively expand-
conventional machine learning algorithms. When ana- ing training images is equivalent to repeating the training
lyzing “not-so-big data,” the conventional wisdom favors on the same data, leading to more measurement errors
522 HUANG et al.

and less prediction improvement. When the same number by adopting the generalized extreme value distribution for
of augmented images is used, the larger data can gener- analysis.
ate an analytic set with rich variation, whereas the smaller
data have many similar images with small transforms; as a ACKNOWLEDGMENTS
result, classifiers for large size data are superior in predict- This research was partially supported by grants
ing unknown future, whereas models constructed for less from the Ministry of Science and Technology, Tai-
variable data have a better goodness of fit. Applying trans- wan (MOST 105-2118-M-009-004-MY2, and MOST
fer learning in the VGG16 deep CNN model is more appro- 107-2118-M-009-005-MY2). We are grateful to the National
priate when the source task (eg, ImageNet object classifica- Center for High-performance Computing, Taiwan for
tion) is sufficiently related to the target task (eg, functional computer time and facilities. This manuscript was edited
brain images for PD stage prediction). Our results indi- by Wallace Academic Editing.
cated that the increased sample size appeared to reduce the
training accuracy, which indirectly support the aforemen- ORCID
tioned assertion because less variable target data are more Guan-Hua Huang https://orcid.org/0000-0002-1802-
similar to the source data than are data with richer vari- 3855
ation. We thus offer the following suggestions for apply-
ing deep CNN models with transfer learning. First, the REFERENCES
retained number of augmented images can be determined 1. A. H. Schapira et al., Perspectives on recent advances in the under-
using cross-validation, as described in Section 4. Second, standing and treatment of Parkinson’s disease, Eur. J. Neurol.
the larger the target data are, the higher the predictive 16(10) (2009), 1090–1099.
ability of the model is. Third, to obtain the largest bene- 2. L. M. de Lau and M. M. Breteler, Epidemiology of Parkinson’s
fit from transfer learning, the source data must be vastly disease, Lancet Neurol. 5(6) (2006), 525–535.
3. W. M. Liu et al., Time trends in the prevalence and incidence of
larger than and sufficiently similar to the target data.
Parkinson’s disease in Taiwan: A nationwide, population-based
The handcrafted feature extraction approach IStat study, J. Formosan Med. Assoc. 115(7) (2016), 531–538.
consistently outperformed the dimensional reduction 4. M. M. Hoehn and M. D. Yahr, Parkinsonism: onset, progression
approaches of PCA and MPCA. PCA and MPCA solely and mortality, Neurology 17(5) (1967), 427–442.
use first- and second-order moments (ie, mean and 5. Parkinson’s Foundation: Stages of Parkinson’s, (2020),
variance–covariance) to generate summary features for Miami, FL, Parkinson’s Foundation: available at https://
classification, whereas IStat includes extra higher order www.parkinson.org/Understanding-Parkinsons/What-is-
moments that appear necessary for predicting class labels. Parkinsons/Stages-of-Parkinsons
6. F. L. Pagan, Improving outcomes through early diagnosis of
The deep CNN model constructs convolutional and acti-
Parkinson’s disease, Am. J. Managed Care 18(7) Suppl (2012),
vation layers to form non-linear combinations of input
S176–S182.
variables and can thus learn flexible and useful features for 7. L. Wang et al., SPECT molecular imaging in Parkinson’s disease,
class prediction. J. Biomed. Biotechnol. 2012 (2012), 412486.
To include as much information as possible in the anal- 8. J. Booij et al., Imaging of the dopaminergic neurotransmission sys-
ysis, we examined the effects of including extra texture tem using single-photon emission tomography and positron emis-
information from LTEM. These efforts have the poten- sion tomography in patients with parkinsonism, Eur. J. Nuclear
tial to create additional features that are useful in class Med. 26(2) (1999), 171–182.
9. C. Scherfler and M. Nocker, Dopamine transporter SPECT: How
prediction, but they did not appear to improve the perfor-
to remove subjectivity? Mov. Disord. 24(S2) (2009), S721–S724.
mance of the classification models. One explanation might
10. R. Prashanth et al., High-accuracy classification of Parkin-
be that our data were of insufficient size to observe some son’s disease through shape analysis and surface fitting in
extreme values of created features that have larger effects 123I-Ioflupane SPECT imaging, IEEE J. Biomed. Health Inf.
on the prediction of class labels. Enlarging the sample size 21(3) (2017), 794–802.
might increase the likelihood of observing these extreme 11. R. T. Staff et al., Shape analysis of 123i-n-ω-fluoropropyl-2-
values and thus help verify their effects on prediction. Con- β-carbomethoxy-3β-(4-iodophenyl) nortropane single-photon
ventional approaches are constructed around the mean emission computed tomography images in the assessment of
and focus on common event prediction, whereas extreme patients with Parkinsonian syndromes, Nuclear Med. Commun.
30(3) (2009), 194–201.
values are on the tail of the distribution, with a rare occur-
12. I. A. Illan et al., Automatic assistance to Parkinson’s disease diag-
rence. Therefore, as is common in extreme value analysis44 nosis in DaTSCAN SPECT imaging, Med. Phys. 39(10) (2012),
or in genetic studies for testing the contribution of rare 5971–5980.
variants to the disease,45,46 new approaches are required 13. D. J. Towey, P. G. Bain, and K. S. Nijran, Automatic classifica-
that are based on collapsing extreme values in different fea- tion of 123I-FP-CIT (DaTSCAN) SPECT images, Nuclear Med.
tures into a single variable (eg, block maxima), followed Commun. 32(8) (2011), 699–707.
HUANG et al. 523

14. R. Prashanth et al., Automatic classification and prediction mod- 32. C. W. Hsu, C. C. Chang, and C. J. Lin, A practical guide to support
els for early Parkinson’s disease diagnosis from SPECT imaging, vector classification, Technical Report, Department of Computer
Expert Syst. Appl. 41(7) (2014), 3333–3342. Science and Information Engineering, National Taiwan Univer-
15. A. Rojas et al., Application of empirical mode decomposition sity, Taipei, 2016.
(EMD) on DaTSCAN SPECT images to explore Parkinson disease, 33. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of
Expert Syst. Appl. 40(7) (2013), 2756–2766. Statistical Learning, 2nd ed., Springer, New York, NY, 2009,
16. F. Segovia et al., Improved parkinsonism diagnosis using a par- 389–409.
tial least squares based approach, Med. Phys. 39(7) (2012), 34. T. Hastie, R. Tibshirani, and J. Friedman, The elements of statis-
4395–4403. tical learning, 2nd ed., Springer, New York, NY, 2009 Chapter
17. N. A. Bhalchandra et al., Early detection of Parkinson’s disease 16.
through shape based features from 123 I-Ioflupane SPECT imaging, 35. T. Hastie, R. Tibshirani, and J. Friedman, The elements of statis-
in IEEE 12th International Symposium on Biomedical Imaging tical learning, 2nd ed., Springer, New York, NY, 2009 Chapter
(ISBI), New York, NY , IEEE, New York, NY, Vol 2015, 2015, 15.
963–966. doi: 10.1109/ISBI.2015.7164031. 36. Y. Freund and R. E. Schapire, A decision-theoretic generalization
18. W. Caesarendra et al., A pattern recognition method for stage of on-line learning and an application to boosting, J. Comput.
classification of Parkinson’s disease utilizing voice features, Syst. Sci. 55(1) (1997), 119–139.
in IEEE Conference on Biomedical Engineering and Sciences 37. Wikipedia: Deep learning, available at https://en.wikipedia.org/
(IECBES), Kuala Lumpur, 2014, pp. 87–92. wiki/Deep_learning
19. H. Choi et al., Refining diagnosis of Parkinson’s disease with deep 38. K. Simonyan and A. Zisserman, Very deep convolutional net-
learning-based interpretation of dopamine transporter imaging, works for largescale image recognition, arXiv (2014), 1409.1556.
NeuroImage Clin. 16 (2017), 586–594. 39. ImageNet, available at http://image-net.org/
20. W. S. Huang et al., Evaluation of early-stage Parkinson’s disease 40. B. Zoph and Q. V. Le, Neural architecture search with reinforce-
with 99mTc-TRODAT-1 imaging, J. Nuclear Med. 42(9) (2001), ment learning, arXiv (2017), 1611.01578v2.
1303–1308. 41. H. Pham et al., Efficient neural architecture search via parameter
21. S. Y. Hsu et al., Feasible classified models for Parkinson disease sharing, arXiv 1802.03268v2, (2018).
from 99m Tc-TRODAT-1 SPECT Imaging, Sensors 19(7) (2019), 42. H. Jin, Q. Song, and X. Hu, Auto-Keras: an efficient neu-
1740. ral architecture search system, arXiv preprint:1806.10282v3,
22. Python, available at https://www.python.org/ 2019.
23. K. J. Friston et al., Statistical parametric maps in functional imag- 43. S. Fahn, R. L. Elton, and UPDRS Program Members, Uni-
ing: A general linear approach, Human Brain Map. 2(4) (1994), fied Parkinson’s Disease Rating Scale, in Recent developments
189–210. in Parkinson’s Disease, Vol 2, S. Fahn et al., Eds., Macmillan
24. SPM—Statistical Parametric Mapping, available at https://www. Healthcare Information, Florham Park, NJ, 1987, 153–163.
fil.ion.ucl.ac.uk/spm/ 44. S. Coles, An Introduction to Statistical Modeling of Extreme Val-
25. R. A. Johnson and D. W. Wichern, Applied multivariate statistical ues, Springer-Verlag, London, 2001 Chapter 3.
analysis, 6th ed., Prentice Hall, Upper Saddle River, NJ, 2007, 45. S. Morgenthaler and W. G. Thilly, A strategy to discover genes
430–459. that carry multi-allelic or mono-allelic risk for common diseases:
26. T. L. Chen et al., An introduction to multilinear principal compo- a cohort allelic sums test (CAST), Mutat. Res. 615(1–2) (2007),
nent analysis, J. Chin. Stat. Assoc. 52(1) (2014), 24–43. 28–56.
27. L. G. Shapiro and G. C. Stockman, Computer vision, Prentice 46. M. C. Wu et al., Rare-variant association testing for sequencing
Hall, Upper Saddle River, New Jersey, 2001, 235–245. data with the sequence kernel association test, Am. J. Human
28. N. V. Chawla et al., SMOTE: Synthetic minority over-sampling Genetics 89(1), (2011), 82–93.
technique, J. Artif. Intell. Res. 16(1) (2002), 321–357.
29. L. Perez and J. Wang, The effectiveness of data augmentation
in image classification using deep learning, arXiv 1712.04621,
(2017). How to cite this article: Huang G-H, Lin C-H,
30. R. A. Johnson and D. W. Wichern, Applied multivariate statisti- Cai Y-R, et al. Multiclass machine learning
cal analysis, 6th ed., Prentice Hall, Upper Saddle River, NJ, 2007 classification of functional brain images for
Chapter 11. Parkinson’s disease stage prediction. Stat Anal Data
31. T. Hastie, R. Tibshirani, and J. Friedman, The elements of statis- Min: The ASA Data Sci Journal. 2020;13:508–523.
tical learning, 2nd ed., Springer, New York, NY, 2009 Chapter https://doi.org/10.1002/sam.11480
12.

You might also like