Professional Documents
Culture Documents
DOI: 10.1002/sam.11480
RESEARCH ARTICLE
1
Institute of Statistics, National Chiao
Tung University, Hsinchu, Taiwan Abstract
2
Department of Medical Imaging and We analyzed a data set containing functional brain images from 6 healthy con-
Radiological Sciences, I-Shou University, trols and 196 individuals with Parkinson’s disease (PD), who were divided into
Kaohsiung, Taiwan
five stages according to illness severity. The goal was to predict patients’ PD
3
Department of Information Engineering,
I-Shou University, Kaohsiung, Taiwan
illness stages by using their functional brain images. We employed the fol-
4
Department of Pharmacy, Tajen lowing prediction approaches: multivariate statistical methods (linear discrimi-
University, Pingtung, Taiwan nant analysis, support vector machine, decision tree, and multilayer perceptron
5
Department of Radiology, E-Da Hospital, [MLP]), ensemble learning models (random forest [RF] and adaptive boosting),
I-Shou University, Kaohsiung, Taiwan
and deep convolutional neural network (CNN). For statistical and ensemble
6
School of Medicine, College of Medicine,
I-Shou University, Kaohsiung, Taiwan
models, various feature extraction approaches (principal component analysis
7
Department of Nuclear Medicine, E-Da [PCA], multilinear PCA, intensity summary statistics [IStat], and Laws’ tex-
Hospital, I-Shou University, Kaohsiung, ture energy measure) were employed to extract features, the synthetic minority
Taiwan
over-sampling technique was used to address imbalanced data, and the opti-
Correspondence mal combination of hyperparameters was found using a grid search. For CNN
Guan-Hua Huang, Institute of Statistics, modeling, we applied an image augmentation technique to increase and balance
National Chiao Tung University, 1001
data sizes over different disease stages. We adopted transfer learning to incorpo-
University Road, Hsinchu 30010, Taiwan.
Email: ghuang@stat.nctu.edu.tw rate pretrained VGG16 weights and architecture into the model fitting, and we
also tested a state-of-the-art machine learning model that could automatically
Funding information
generate an optimal neural architecture. We found that IStat consistently out-
Ministry of Science and Technology,
Taiwan, Grant/Award Numbers: MOST
performed other feature extraction approaches. MLP and RF were the analytic
105-2118-M-009-004-MY2, MOST approaches with the highest prediction accuracy rate for multivariate statistical
107-2118-M-009-005-MY2 and ensemble learning models, respectively. Overall, the deep CNN model with
pretrained VGG16 weights and architecture outperformed other approaches;
it captured critical features from imaging, effectively distinguished between
normal controls and patients with PD, and achieved the highest classification
accuracy.
KEYWORDS
deep neural network, functional brain image, machine learning, supervised classification
Stat Anal Data Min: The ASA Data Sci Journal. 2020;13:508–523. wileyonlinelibrary.com/sam © 2020 Wiley Periodicals LLC 508
HUANG et al. 509
FIGURE 1 Proposed analytic system for multiclass machine learning classification of functional brain images for Parkinson’s disease
stage prediction
F I G U R E 2 Preprocessing procedure for three-dimensional (3D) single photon-emission computed tomography (SPECT) images. Step
1 involved selecting a single slice from the original 3D stereo image that contained the clearest striatum shape as the target data for
subsequent analysis. Step 2 involved trimming the image from 128 × 128 to 50 × 50 to obtain a cropped image that contained a complete brain
image with only a small portion of the black background
and over positives (ie, precision). For our multiclass clas- of a black background, and only the middle section con-
sification, we adopted the macro-F1 score, which is the tained the brain image, we further trimmed the image
equal-weighted average of the F1 scores of each class. We from 128 × 128 to 50 × 50 pixels. The cropped image con-
calculated the macro-F1 score according to the five test tains a complete brain image with only a small portion of
parts. the black background (Figure 2).
For SPECT images, statistical parametric mapping (SPM) For multivariate statistical methods and ensemble learn-
software23,24 can be used for preprocessing. In the early ing models, we used pixel-based features (ie, every pixel
stages of the study, we attempted to use SPM to prepro- represents a feature) for analysis. Our input image had
cess the collected SPECT images, but the results were of 2500 (50 × 50) pixels (features). Conventional machine
insufficient quality. Our SPECT brain images were col- learning models could not be directly applied because the
lected through general clinical pipelines. Compared with number of features was too large. We thus applied the fol-
the images obtained in typical research projects, our brain lowing approaches to extract vital features for later PD
image collection procedure was less standardized; there- classification analysis.
fore, the image noise induced by human or environmen-
tal factors was larger. We speculated that these noises
caused the violation of SPM model assumptions, resulting 3.2.1 Principal component analysis
in implementation results that were poorer than expected.
Our preprocessing, therefore, proceeded as follows: PCA is a commonly used dimensional reduction tech-
from the original three-dimensional (3D) stereo image, nique that involves explaining the variance–covariance
we first selected a single slice that contained the clearest structure of a set of variables through a few linear com-
striatum shape as the target data for subsequent anal- binations of these variables (ie, principal components
ysis. PD diagnosis is mainly based on the characteriza- [PCs]) and then using these PCs to replace the original
tion of the striatum, which only occupies a small part of variables.25 Our 50 × 50 input image was first flattened into
the brain. Complete brain images may contain too much a one-dimensional vector with a length of 2500 pixels. We
unnecessary information, leading to an increase in ana- then performed PCA on the training data with 2500 vari-
lytic burden. By focusing on the striatum, we reduced our ables and extracted the first k (≤2500) PCs, which were
data from the original 3D stereo image (128 × 128 × 64) selected using the proportion of total variance explained or
to a two-dimensional (2D) planar image (128 × 128), as the scree plot.25 The PCA function in the sklearn Python
depicted in Figure 2. Because most of the image consisted library was used to perform the PCA.
512 HUANG et al.
3.2.2 Multilinear PCA compute and is used more often in practice. LTEM are tex-
ture features generated through the statistical approach
When analyzing data in matrix format (eg, images), and consist of mine full images that are created by apply-
PCA first vectorizes these matrix data and then pro- ing nine types of texture filters to the original image.
ceeds with a large variance–covariance matrix, which These nine filtered images measure level-, edge-, spot-, and
can be inefficient and unstable.26 MPCA is a modifica- ripple-related contents of the original image (Figure 3).
tion of PCA. It preserves the natural matrix structure Here, we applied the aforementioned PCA, MPCA, and
of the data in searching for PCs. For our 50 × 50 input IStat to the nine texture images from the LTEM in addition
image X i , i = 1, … , n, MPCA will find A1 (50 × p) to the original image. Thus, by including texture informa-
and A2 (50 × q) with p ≤ 50 and q ≤ 50 that minimize tion, we expanded the feature set generated by each feature
∑n
i=1 ||(Xi − X) − A1 A1 (Xi − X)A2 A2 || , where X is the
T T 2
extraction by 10 times. A Python implementation of LTEM
average of the input images and ||⋅|| is the matrix Frobenius was used to create texture images.
norm. In the solution, the low-dimensional core tensors
AT1 (Xi − X)A2 , i = 1, … , n are used to replace the original
images for further analysis; thus, we reduced the num- 3.3 Class imbalance adjustment
ber of features to pq (≤2500). Notably, PCA performs SVD
for a 2500 × 2500 variance-covariance matrix, whereas in The distribution of our data over multiple PD stages
MPCA, the SVD is performed for only two 50 × 50 matri- (classes) was relatively imbalanced, which could result in
ces, which is much more parsimonious in terms of param- a distortion of the classification model, biasing the predic-
eter usage. MPCA was implemented with the Python pro- tion results toward the majority class at the expense of the
gramming language. minority class. To address the class imbalance matter, we
adopted the following approaches.
F I G U R E 3 Texture feature images generated by Laws’ texture energy measures. The first image on the left is the original image; the
nine pictures on the right contain feature images, from left to right and top to bottom, generated using the following nine texture filters,
respectively: L5E5/E5L5, L5R5/R5L5, E5S5/S5E5, S5S5, R5R5, L5S5/S5L5, E5E5, E5R5/R5E5, and S5R5/ R5S5
balance classes in the training data set. Then, both original sklearn Python library was used to perform LDA. No
and duplicate images were entered into our deep neural hyperparameter required setting.
network models.29 An SVM produces a separating hyperplane that has the
largest distance to the nearest training data point and has
nonlinear boundaries by constructing a linear boundary in
3.4 Machine learning classification a large, transformed version of the feature space.31 We used
the SVC function in the sklearn Python library for SVM.
The machine learning classification used in this study con- The one-vs-one strategy was adopted for multiclass classi-
sisted of multivariate statistical methods, ensemble learn- fication (ie, decision_function_shape = “ovo”).
ing models, and the deep CNN. In this study, we tested linear, polynomial, radial basis
function, and sigmoid kernels for mapping inputs
into high-dimensional feature spaces. Parameter val-
3.4.1 Multivariate statistical methods ues for kernels were set to be exponentially growing
sequences, following the suggestion in ref. 32. The opti-
LDA adopts multivariate normal distributions to model mal kernel-parameter combination was selected through
input features and assigns objects according to their a grid search using 5-fold cross-validation, which was
posterior probabilities of class membership.30 The implemented through the GridSearchCV function in
LinearDiscriminantAnalysis function in the the sklearn Python library.
514 HUANG et al.
DT is a nonparametric supervised learning method at which boosting was terminated and learning rates that
that partitions the feature space into a set of rectangles shrank the contribution of each classifier.
and then fits a simple model (such as a constant) in each
rectangle.33 We used the DecisionTreeClassifier
function in the sklearn Python library for DT. We con- 3.4.3 Deep convolutional neural
sidered hyperparameters, including different strategies for network
choosing the split, maximum depth of the tree, and num-
ber of features to scrutinize, when searching for the most Deep learning is part of a broader family of machine learn-
suitable split. ing methods based on neural network models. A CNN
The central concept of an MLP (also known as the is a deep learning architecture that constructs a series of
neural network model) is to extract linear combinations hidden layers to progressively extract higher level features
of the inputs as derived features and then model the tar- from the raw input. Hidden layers in CNN typically consist
get as a nonlinear function of these features. We used of convolutional layers for feature extraction, activation
the MLPClassifier function in the sklearn Python layers (functions) for nonlinear transformation, pooling
library for MLP. We selected quasi-Newton methods as the layers for dimension reduction, and fully connected layers
solver for weight optimization (ie, solver = “lbfgs”) to flatten the matrix to a one-dimensional form. In con-
because of our small dataset. We considered hyperparam- volutional layers, a learnable filter (small matrix) is con-
eters, including different hidden layer sizes and activation volved across the width and height of one slice of the input
functions. volume, computing the dot product between the entries of
the filter and the input and producing a 2D activation map
for a specific type of feature that the filter intends to learn.
3.4.2 Ensemble learning models Stacking the activation maps for all filters along the depth
dimension of the input volume forms the output volume of
Ensemble learning constructs a classification model by the convolution layer.37 The pooling procedure partitions
combining the strengths of a collection of simpler base the input image into a set of nonoverlapping rectangles
models.34 Two families of ensemble methods are typically and, for each such subregion, outputs the maximum or the
distinguished. Bagging fits the same classification function average. The pooling layer progressively reduces the size
many times to bootstrap-sampled versions of the training of the image to avoid rapid retraining, alleviate computa-
data and averages the results. The objective is to aver- tional burden and hence also control overfitting. Typically,
age many noisy but approximately unbiased models and the pooling layer is used after the convolution operation
thus reduce the variance. Boosting sequentially applies is performed. Activation functions can generate nonlinear
the weak (base) classifier to the bootstrap-sampled train- mapping from inputs to complex outputs. When apply-
ing data with repeatedly modified sampling weights; the ing CNNs for image classification, the rectified linear unit
weights are increased for those observations that are mis- (ReLU) activation functions are commonly used after con-
classified at the previous classification and the weights are volutional layers, and the softmax function is used in the
decreased for those correctly classified. final fully connected layer. In contrast to other image clas-
RF is a substantial modification of bagging sification algorithms that rely on prior knowledge and
that constructs a large collection of “de-correlated” human effort in feature design, CNNs can automatically
trees and then averages them.35 We used the extract input features through learned filters.
RandomForestClassifier function in the sklearn Transfer learning is a machine learning method that
Python library for RF. We considered hyper-parameters, utilizes knowledge learned from one task as the starting
including different number of trees in the forest, max- points on related ones. Implementing transfer learning in
imum depth of the tree, and number of features to the CNN involves reusing the first several layers of the net-
scrutinize, when searching for the most suitable split. work for a similar task with much more data. By using
AdaBoost, proposed by Freund and Schapire,36 is large amounts of data, hidden layers earlier in the net-
the first practical boosting algorithm. We used the work learn to recognize a variety of base elements that
AdaBoostClassifier function in the sklearn are typically not specific to the exact task that the net-
Python library for AdaBoost. We employed SVM and DT as work may be used for; therefore, transfer learning enables
the base classifiers. The optimal hyperparameter for each a CNN to have a broader application. VGG1638 is cur-
base classifier was determined first. Then, we selected the rently one of the highest-quality CNN architectures. We
boosting hyperparameters for this optimal base classifier, used the VGG16 function in the keras Python library to
including different maximum numbers of base classifiers adopt the pretrained VGG16 model whose weights were
HUANG et al. 515
Abbreviations: DT, decision tree; IStat, intensity summary statistics; LDA, linear discriminant analysis; MLP,
multilayer perceptron; MPCA, multilinear principal component analysis; PCA, principal component analysis;
SVM, support vector machine.
Abbreviations: DT, decision tree; IStat, intensity summary statistics; LDA, linear discriminant analysis; LTEM,
Laws’ texture energy measure; MLP, multilayer perceptron; MPCA, multilinear principal component analysis;
PCA, principal component analysis; SVM, support vector machine.
from LTEM did not generate additional useful features for presented in Tables 3 and 4, respectively. IStat outper-
predictions. formed PCA and MPCA regardless of the ensemble learn-
ing classification used. When adopting a specific feature
4.2 Results from ensemble learning extraction method, RF resulted in higher test accuracy and
classification F1 score than did AdaBoost. When RF adopted features
from IStat, we obtained the highest test accuracy (54.5%)
The results obtained from applying feature extraction and F1 score (38.5%). Additional texture information from
methods on the original image and on the 10 texture LTEM did not appear to improve the test accuracy and F1
images (including the original one) from LTEM are score of various approaches.
HUANG et al. 517
Abbreviations: AdaBoost, adaptive boosting; DT, decision tree; IStat, intensity summary statistics; MPCA, multilinear
principal component analysis; PCA, principal component analysis; RF, random forest; SVM, support vector machine.
TA B L E 4 Results of Parkinson’s disease stage prediction using ensemble learning models with additional
features from Laws’ texture energy measure (LTEM) images
Feature Number of Training Test F1
Model extraction features accuracy Test accuracy score
RF LTEM+PCA 500 (50 × 10) 1.000 0.406 0.202
LTEM+MPCA 490 (49 × 10) 1.000 0.455 0.233
LTEM+IStat 500 (50 × 10) 1.000 0.564 0.341
AdaBoost + DT LTEM + PCA 500 (50 × 10) 1.000 0.327 0.193
LTEM + MPCA 490 (49 × 10) 1.000 0.307 0.201
LTEM + IStat 500 (50 × 10) 1.000 0.421 0.301
AdaBoost + SVM LTEM + PCA 500 (50 × 10) 0.762 0.381 0.176
LTEM + MPCA 490 (49 × 10) 0.789 0.401 0.113
LTEM + IStat 500 (50 × 10) 0.766 0.371 0.153
Abbreviations: AdaBoost, adaptive boosting; DT, decision tree; IStat, intensity summary statistics; MPCA, multilinear principal
component analysis; PCA, principal component analysis; RF, random forest; SVM, support vector machine.
4.3 Comparison of feature extraction For the test image (Figure 6), MPCA could progressively
approaches improve the reconstruction through the addition of more
features, but PCA appeared not to make much difference
To compare the performance of PCA and MPCA on our after 100 PCs. These results exemplify the tendency of
SPECT image data, in each cross-validation round, PCA overfitting in PCA. Notably, PCA experienced the “large
or MPCA was applied on four training folds to produce p and small n” problem, where the sample size for the
PC loadings or image bases (A1 , A2 ), which could then be training set (161) was much smaller than the length of the
used to reconstruct images in the training and test folds. input vector (2500). This problem had a smaller impact on
Figures 5 and 6 illustrate the reconstruction of one random MPCA because it is much more parsimonious in parame-
training image and one random test image, respectively. ter usage.
For the training image (Figure 5), the component-wide IStat consistently outperformed other feature extrac-
sum of absolute difference between the original and recon- tion approaches regardless of the multivariate statistical
structed images (the “Diff” value in the figure) for PCA or ensemble learning classification used for prediction
was much smaller than that for its MPCA counterpart (Tables 1 and 3). We observed no clear difference in predic-
(ie, k ≈ p × q). The absolute differences for both PCA and tion accuracy and F1 score between PCA and MPCA. Addi-
MPCA decreased as the number of features increased. tional texture information from LTEM did not appear to
518 HUANG et al.
F I G U R E 5 Reconstructions of one randomly selected training image by using principal component analysis (PCA) and multilinear
PCA (MPCA). The first image on the left is the original image; the first and third layers of the 12 images on the right are the results of PCA
reconstruction, and the second and fourth layers are the results of MPCA reconstruction. The Diff value in the figure is the component-wide
sum of absolute difference between the original and reconstructed images. PCA_nc denotes the number of principal components used for
PCA, and MPCA_p,q refers to the number of image bases used for MPCA
improve the test accuracy of various approaches (Tables 2 extraction methods. All three extraction methods per-
and 4). formed fairly well in the training set, with the increased
We further investigated the effect of the number of fea- feature numbers improving the training accuracy. For the
tures retained in the three feature extraction methods. To test set, IStat consistently exhibited the highest test accu-
facilitate the comparison, here we only used LDA for clas- racy for a given number of features. The prediction in
sification. Figure 7 illustrates the training and test accu- the test set, in general, were initially more accurate but
racies as the number features increased for the different then progressively worsened with the increased number of
HUANG et al. 519
F I G U R E 6 Reconstructions of one randomly selected test image by using principal component analysis (PCA) and multilinear PCA
(MPCA). The first image on the left is the original image; the first and third layers of the 12 images on the right are the results of PCA
reconstruction, and the second and fourth layers are the results of MPCA reconstruction. The Diff value in the figure is the component-wide
sum of absolute difference between original and reconstructed images. PCA_nc refers to the number of principal components used for PCA,
and MPCA_p, q denotes the numbers of image bases used for MPCA
features. Notably, we observed a considerable increase in 4.4 Results from the deep
IStat’s test accuracy when the number of features was 70 convolutional neural network
or 75. With an appropriately selected bin size for the his-
togram, IStat can generate features that benefit prediction We used Python’s VGG16 function to fit the VGG16 model
accuracy. with pretrained weights from ImageNet. VGG16 requires
520 HUANG et al.
FIGURE 7 Effect of the number of features retained in the three feature extraction methods—principal component analysis (PCA)
and multilinear PCA (MPCA), and intensity summary statistics (IStat). We only used linear discriminant analysis (LDA) for classification.
The left plot illustrates the training accuracies as the number of features increases for different extraction methods; the right plot reveals the
test accuracies
TA B L E 5 Results of Parkinson’s disease stage prediction using a deep convolutional neural network
Training Test Test F1
Model The 3 channels for input data accuracy accuracy score
Deep CNN—VGG16 Repeat the image three times 0.922 0.649 0.576
Use three consecutive images 0.912 0.653 0.606
Deep CNN—Auto-Keras Repeat the image three times 0.700 0.431 0.357
three-channel images as its input. We thus repeated our autokeras was not only inferior to that of the VGG16
grayscale images three times to obtain a three-channel model but also poorer than that of some multivariate sta-
input. The expanded three-channel images in the training tistical and ensemble learning models (Table 5). The five
set were augmented to ensure that each PD illness label deep CNN architectures generated by autokeras for the
contained 250 images. They were then input into VGG16 five cross-validation rounds differed substantially; they
for modeling. The VGG16 deep CNN model achieved a contained 50, 32, 87, 93, and 83 blocks, respectively. Our
64.9% test accuracy and a 57.6% test F1 score (Table 5), rep- data appeared to be insufficient for autokeras to train
resenting an improvement of approximately 20% and 50%, a stable and robust model, which ultimately resulted in a
respectively, compared with the results from multivariate relatively poor prediction ability.
statistical and ensemble learning classifications. To examine how the sample size and the number of
We also used the three consecutive images above and augmented images affected the performance of the VGG16
below the selected slice in our preprocessing as VGG16’s deep CNN model, we executed the model on all of the orig-
input. We did observe an improvement in test accuracy inal samples, a random selection of two thirds of it, or a
and F1 score; nevertheless, the two types of inputs yielded random selection of half of it. The random sampling was
very similar prediction results (Table 5). We speculate that stratified by PD illness stages, except for the normal and
the difference between the three consecutive images was fifth stages, where the number of images (six and seven,
not obvious; thus, the model construction was not greatly respectively) were limited; therefore, all images for those
affected by the input data changes. two groups were selected for analysis. For each sample
When autokeras in Python was used to automati- type, the repeated three-channel images of the training set
cally search for the architecture and hyperparameters of were augmented to create balanced, 150, 250, 500, or 700
the deep CNN, the three channels of the input data were images for each illness stage. Here, “balanced” indicates
formed by the same grayscale image, and the augmented that all stages had the same number of images as that of
training set had 250 images for each PD illness class. the largest stage. The training and test accuracies from
The performance of the model automatically generated by various combinations of sample and augmented sizes are
HUANG et al. 521
F I G U R E 8 Effects of the sample size and number of augmented images on the performance of the VGG16 deep convolutional neural
network (CNN) model. The left plot illustrates the training accuracies from various combinations of sample sizes and augmented image
numbers; the right plot reveals the test accuracies
revealed in Figure 8. Within each sample type, as the num- carefully constructed multivariate statistical or ensem-
ber of augmented images increased, the training accuracy ble learning models over complicated deep learning
increased, whereas the test accuracy first improved and approaches. When MLP and RF adopted features from
then became poorer. However, when adopting the same IStat, we observed average test accuracies of 49.5% and
number of augmented images, the increased sample size 55.5%, respectively, whereas the AutoML deep CNN model
appeared to reduce the training accuracy and increase the only reached a test accuracy of 43.1%. These findings were
test accuracy. consistent with our expectations. However, transfer learn-
ing that used weights trained from large and publicly
available data on our tasks compensated for the short-
5 DISCUSSION age of data in applying deep learning models. Our VGG16
deep CNN model with pretrained weights from ImageNet
In the classification of high-dimensional data (eg, imag- achieved a test accuracy of 64.9%. Medical image data are
ing, microarray, or genome sequencing), feature extraction rare and their labeling is expensive; therefore, having a
is typically first performed to achieve dimension reduc- small to medium amount of data is not unusual. The trans-
tion; then, machine learning classification is applied to fer learning approach can thus be applied to enhance the
predict the data class labels. This study compared the deep CNN model’s capacity in analyzing these small- or
performance of various combinations of feature extrac- medium-sized data. Moreover, different criteria may be
tion and machine learning methods in classifying medical employed for certain diseases’ diagnosis. The diagnosis of
imaging data. We considered dimension reduction, hand- PD can serve as an example: in addition to the HYS used
crafted, and deep learning feature extraction approaches. in the current study, the Unified Parkinson’s Disease Rat-
We employed the following machine learning classifica- ing Scale43 is another commonly used rating scale. When
tions: multivariate statistical methods, ensemble learning several SPECT image datasets with class labels that were
models, and the deep convolutional neural network. For generated through different diagnostic criteria are avail-
multivariate statistical methods and ensemble learning able, transfer learning can use the knowledge obtained
models, we observed the highest accuracy when MLP and from one data set as the starting points for another set and
RF, respectively, adopted features from IStat. Overall, the thus combine these data sets to form a better classifier.
deep CNN model with pretrained VGG16 weights and Our analysis revealed that the number of augmented
architecture outperformed other approaches and achieved images and the sample size can affect the performance
a greater than 20% and 50% improvement in test accuracy of the VGG16 deep CNN model. Image augmentation not
and F1 score, respectively, compared with the multivariate only increased the amount of data but also introduced
statistical and ensemble learning classifications. some noise to enhance the robustness of the model. Addi-
The deep learning model tends to require con- tion of more augmented images can thus improve both
siderably more data to achieve its full capacity than training and test accuracies. However, excessively expand-
conventional machine learning algorithms. When ana- ing training images is equivalent to repeating the training
lyzing “not-so-big data,” the conventional wisdom favors on the same data, leading to more measurement errors
522 HUANG et al.
and less prediction improvement. When the same number by adopting the generalized extreme value distribution for
of augmented images is used, the larger data can gener- analysis.
ate an analytic set with rich variation, whereas the smaller
data have many similar images with small transforms; as a ACKNOWLEDGMENTS
result, classifiers for large size data are superior in predict- This research was partially supported by grants
ing unknown future, whereas models constructed for less from the Ministry of Science and Technology, Tai-
variable data have a better goodness of fit. Applying trans- wan (MOST 105-2118-M-009-004-MY2, and MOST
fer learning in the VGG16 deep CNN model is more appro- 107-2118-M-009-005-MY2). We are grateful to the National
priate when the source task (eg, ImageNet object classifica- Center for High-performance Computing, Taiwan for
tion) is sufficiently related to the target task (eg, functional computer time and facilities. This manuscript was edited
brain images for PD stage prediction). Our results indi- by Wallace Academic Editing.
cated that the increased sample size appeared to reduce the
training accuracy, which indirectly support the aforemen- ORCID
tioned assertion because less variable target data are more Guan-Hua Huang https://orcid.org/0000-0002-1802-
similar to the source data than are data with richer vari- 3855
ation. We thus offer the following suggestions for apply-
ing deep CNN models with transfer learning. First, the REFERENCES
retained number of augmented images can be determined 1. A. H. Schapira et al., Perspectives on recent advances in the under-
using cross-validation, as described in Section 4. Second, standing and treatment of Parkinson’s disease, Eur. J. Neurol.
the larger the target data are, the higher the predictive 16(10) (2009), 1090–1099.
ability of the model is. Third, to obtain the largest bene- 2. L. M. de Lau and M. M. Breteler, Epidemiology of Parkinson’s
fit from transfer learning, the source data must be vastly disease, Lancet Neurol. 5(6) (2006), 525–535.
3. W. M. Liu et al., Time trends in the prevalence and incidence of
larger than and sufficiently similar to the target data.
Parkinson’s disease in Taiwan: A nationwide, population-based
The handcrafted feature extraction approach IStat study, J. Formosan Med. Assoc. 115(7) (2016), 531–538.
consistently outperformed the dimensional reduction 4. M. M. Hoehn and M. D. Yahr, Parkinsonism: onset, progression
approaches of PCA and MPCA. PCA and MPCA solely and mortality, Neurology 17(5) (1967), 427–442.
use first- and second-order moments (ie, mean and 5. Parkinson’s Foundation: Stages of Parkinson’s, (2020),
variance–covariance) to generate summary features for Miami, FL, Parkinson’s Foundation: available at https://
classification, whereas IStat includes extra higher order www.parkinson.org/Understanding-Parkinsons/What-is-
moments that appear necessary for predicting class labels. Parkinsons/Stages-of-Parkinsons
6. F. L. Pagan, Improving outcomes through early diagnosis of
The deep CNN model constructs convolutional and acti-
Parkinson’s disease, Am. J. Managed Care 18(7) Suppl (2012),
vation layers to form non-linear combinations of input
S176–S182.
variables and can thus learn flexible and useful features for 7. L. Wang et al., SPECT molecular imaging in Parkinson’s disease,
class prediction. J. Biomed. Biotechnol. 2012 (2012), 412486.
To include as much information as possible in the anal- 8. J. Booij et al., Imaging of the dopaminergic neurotransmission sys-
ysis, we examined the effects of including extra texture tem using single-photon emission tomography and positron emis-
information from LTEM. These efforts have the poten- sion tomography in patients with parkinsonism, Eur. J. Nuclear
tial to create additional features that are useful in class Med. 26(2) (1999), 171–182.
9. C. Scherfler and M. Nocker, Dopamine transporter SPECT: How
prediction, but they did not appear to improve the perfor-
to remove subjectivity? Mov. Disord. 24(S2) (2009), S721–S724.
mance of the classification models. One explanation might
10. R. Prashanth et al., High-accuracy classification of Parkin-
be that our data were of insufficient size to observe some son’s disease through shape analysis and surface fitting in
extreme values of created features that have larger effects 123I-Ioflupane SPECT imaging, IEEE J. Biomed. Health Inf.
on the prediction of class labels. Enlarging the sample size 21(3) (2017), 794–802.
might increase the likelihood of observing these extreme 11. R. T. Staff et al., Shape analysis of 123i-n-ω-fluoropropyl-2-
values and thus help verify their effects on prediction. Con- β-carbomethoxy-3β-(4-iodophenyl) nortropane single-photon
ventional approaches are constructed around the mean emission computed tomography images in the assessment of
and focus on common event prediction, whereas extreme patients with Parkinsonian syndromes, Nuclear Med. Commun.
30(3) (2009), 194–201.
values are on the tail of the distribution, with a rare occur-
12. I. A. Illan et al., Automatic assistance to Parkinson’s disease diag-
rence. Therefore, as is common in extreme value analysis44 nosis in DaTSCAN SPECT imaging, Med. Phys. 39(10) (2012),
or in genetic studies for testing the contribution of rare 5971–5980.
variants to the disease,45,46 new approaches are required 13. D. J. Towey, P. G. Bain, and K. S. Nijran, Automatic classifica-
that are based on collapsing extreme values in different fea- tion of 123I-FP-CIT (DaTSCAN) SPECT images, Nuclear Med.
tures into a single variable (eg, block maxima), followed Commun. 32(8) (2011), 699–707.
HUANG et al. 523
14. R. Prashanth et al., Automatic classification and prediction mod- 32. C. W. Hsu, C. C. Chang, and C. J. Lin, A practical guide to support
els for early Parkinson’s disease diagnosis from SPECT imaging, vector classification, Technical Report, Department of Computer
Expert Syst. Appl. 41(7) (2014), 3333–3342. Science and Information Engineering, National Taiwan Univer-
15. A. Rojas et al., Application of empirical mode decomposition sity, Taipei, 2016.
(EMD) on DaTSCAN SPECT images to explore Parkinson disease, 33. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of
Expert Syst. Appl. 40(7) (2013), 2756–2766. Statistical Learning, 2nd ed., Springer, New York, NY, 2009,
16. F. Segovia et al., Improved parkinsonism diagnosis using a par- 389–409.
tial least squares based approach, Med. Phys. 39(7) (2012), 34. T. Hastie, R. Tibshirani, and J. Friedman, The elements of statis-
4395–4403. tical learning, 2nd ed., Springer, New York, NY, 2009 Chapter
17. N. A. Bhalchandra et al., Early detection of Parkinson’s disease 16.
through shape based features from 123 I-Ioflupane SPECT imaging, 35. T. Hastie, R. Tibshirani, and J. Friedman, The elements of statis-
in IEEE 12th International Symposium on Biomedical Imaging tical learning, 2nd ed., Springer, New York, NY, 2009 Chapter
(ISBI), New York, NY , IEEE, New York, NY, Vol 2015, 2015, 15.
963–966. doi: 10.1109/ISBI.2015.7164031. 36. Y. Freund and R. E. Schapire, A decision-theoretic generalization
18. W. Caesarendra et al., A pattern recognition method for stage of on-line learning and an application to boosting, J. Comput.
classification of Parkinson’s disease utilizing voice features, Syst. Sci. 55(1) (1997), 119–139.
in IEEE Conference on Biomedical Engineering and Sciences 37. Wikipedia: Deep learning, available at https://en.wikipedia.org/
(IECBES), Kuala Lumpur, 2014, pp. 87–92. wiki/Deep_learning
19. H. Choi et al., Refining diagnosis of Parkinson’s disease with deep 38. K. Simonyan and A. Zisserman, Very deep convolutional net-
learning-based interpretation of dopamine transporter imaging, works for largescale image recognition, arXiv (2014), 1409.1556.
NeuroImage Clin. 16 (2017), 586–594. 39. ImageNet, available at http://image-net.org/
20. W. S. Huang et al., Evaluation of early-stage Parkinson’s disease 40. B. Zoph and Q. V. Le, Neural architecture search with reinforce-
with 99mTc-TRODAT-1 imaging, J. Nuclear Med. 42(9) (2001), ment learning, arXiv (2017), 1611.01578v2.
1303–1308. 41. H. Pham et al., Efficient neural architecture search via parameter
21. S. Y. Hsu et al., Feasible classified models for Parkinson disease sharing, arXiv 1802.03268v2, (2018).
from 99m Tc-TRODAT-1 SPECT Imaging, Sensors 19(7) (2019), 42. H. Jin, Q. Song, and X. Hu, Auto-Keras: an efficient neu-
1740. ral architecture search system, arXiv preprint:1806.10282v3,
22. Python, available at https://www.python.org/ 2019.
23. K. J. Friston et al., Statistical parametric maps in functional imag- 43. S. Fahn, R. L. Elton, and UPDRS Program Members, Uni-
ing: A general linear approach, Human Brain Map. 2(4) (1994), fied Parkinson’s Disease Rating Scale, in Recent developments
189–210. in Parkinson’s Disease, Vol 2, S. Fahn et al., Eds., Macmillan
24. SPM—Statistical Parametric Mapping, available at https://www. Healthcare Information, Florham Park, NJ, 1987, 153–163.
fil.ion.ucl.ac.uk/spm/ 44. S. Coles, An Introduction to Statistical Modeling of Extreme Val-
25. R. A. Johnson and D. W. Wichern, Applied multivariate statistical ues, Springer-Verlag, London, 2001 Chapter 3.
analysis, 6th ed., Prentice Hall, Upper Saddle River, NJ, 2007, 45. S. Morgenthaler and W. G. Thilly, A strategy to discover genes
430–459. that carry multi-allelic or mono-allelic risk for common diseases:
26. T. L. Chen et al., An introduction to multilinear principal compo- a cohort allelic sums test (CAST), Mutat. Res. 615(1–2) (2007),
nent analysis, J. Chin. Stat. Assoc. 52(1) (2014), 24–43. 28–56.
27. L. G. Shapiro and G. C. Stockman, Computer vision, Prentice 46. M. C. Wu et al., Rare-variant association testing for sequencing
Hall, Upper Saddle River, New Jersey, 2001, 235–245. data with the sequence kernel association test, Am. J. Human
28. N. V. Chawla et al., SMOTE: Synthetic minority over-sampling Genetics 89(1), (2011), 82–93.
technique, J. Artif. Intell. Res. 16(1) (2002), 321–357.
29. L. Perez and J. Wang, The effectiveness of data augmentation
in image classification using deep learning, arXiv 1712.04621,
(2017). How to cite this article: Huang G-H, Lin C-H,
30. R. A. Johnson and D. W. Wichern, Applied multivariate statisti- Cai Y-R, et al. Multiclass machine learning
cal analysis, 6th ed., Prentice Hall, Upper Saddle River, NJ, 2007 classification of functional brain images for
Chapter 11. Parkinson’s disease stage prediction. Stat Anal Data
31. T. Hastie, R. Tibshirani, and J. Friedman, The elements of statis- Min: The ASA Data Sci Journal. 2020;13:508–523.
tical learning, 2nd ed., Springer, New York, NY, 2009 Chapter https://doi.org/10.1002/sam.11480
12.