You are on page 1of 5

Alzheimer’s Disease Brain MRI Classification: Challenges and Insights

Yi Ren Fung∗ , Ziqiang Guan∗ , Ritesh Kumar , Joie Yeahuay Wu and Madalina Fiterau
College of Information and Computer Sciences
University of Massachusetts Amherst
{yfung, zguan, riteshkumar, yeahuaywu, mfiterau}@cs.umass.edu
arXiv:1906.04231v1 [eess.IV] 10 Jun 2019

Abstract (AD); mild cognitive impairment (MCI); or cognitively nor-


mal (CN).
In recent years, many papers have reported state-of- While many previous related works report high accuracy in
the-art performance on Alzheimer’s Disease classi- three-class classification of brain MRI scans from the ADNI
fication with MRI scans from the Alzheimer’s Dis- dataset, we point out a major issue that puts into question
ease Neuroimaging Initiative (ADNI) dataset using the validity of their results. A number of papers [Payan
convolutional neural networks. However, we dis- and Montana, 2015; Farooq et al., 2017; Wu et al., 2018;
cover that when we split the data into training and Jain et al., 2019] perform classification with training and test-
testing sets at the subject level, we are not able to ing splits done randomly across the brain MRIs, in which the
obtain similar performance, bringing the validity of implicit assumption is that the MRIs are independently and
many of the previous studies into question. Fur- identically distributed. As a result, a given patient’s scans
thermore, we point out that previous works use dif- from different visits could exist across both the training and
ferent subsets of the ADNI data, making compari- testing sets.
son across similar works tricky. In this study, we
We point out that transitions in disease stage occur at a
present the results of three splitting methods, dis-
very low frequency in the ADNI dataset, which means that a
cuss the motivations behind their validity, and re-
model can overfit to the brain structure corresponding to an
port our results using all of the available subjects.
individual patient and not learn the key differences in feature
representation among the different disease stages. The model
1 Introduction can simply regurgitate patient-level information learned dur-
ing training and rely on that to obtain good classification re-
Alzheimer’s Disease is a progressive neurodegenerative dis- sults on the test set. We refer to generating the splits by sam-
ease characterized by cognitive decline and memory loss. It pling at random from the pool of images, ignoring the patient
is one of the leading causes of death in the aging population, IDs, as splitting by MRI randomly.
affecting millions of people around the world. Automatic di- Our main contribution in this study is exploring alternative
agnosis and early detection can help identify high risk co- ways to split the data into training and testing sets such that
horts for better treatment planning and improved quality of the evaluation gives valid insights into the performance of the
life [Association and others, 2017]. proposed methods.
Initial works on brain MRI classification rely on domain-
First, we look at splitting the data by patient, where all
specific medical knowledge to compute volumetric seg-
of the available visits of a patient are assigned to either the
mentation of the brain in order to analyze key compo-
training or testing set, with no patient data split across the two
nents that are strongly correlated with the disease [Ger-
sets. This method is motivated by the use case for automatic
ardin et al., 2009; Plant et al., 2010; Bhagwat et al.,
diagnosis from MRI scans, where the assumption is that no
2018]. With recent advances in deep learning and an in-
prior or subsequent patient information is given.
creasing number of MRI scans being made public by ini-
tiatives such as Alzheimer’s Disease Neuroimaging Initia- Then, we look at splitting the data by visit history, mean-
tive (adni.loni.usc.edu) and Open Access Series of Imaging ing that the first n − 1 visits of a patient’s data are used for
Studies (www.oasis-brains.org), various convolutional neu- training, and the nth visit is used for testing. This method is
ral network (CNN) architectures, including 2D and 3D vari- motivated by the use case wherein we incorporate MRI scans
ants of ResNet, DenseNet, InceptionNet, and VGG models, into personalized trajectory prediction.
have been proposed to holistically learn feature representa- Finally, we compare the accuracy of training and testing on
tions of the brain [Jain et al., 2019; Khvostikov et al., 2018; sets split randomly by MRI, with the accuracies we obtain us-
Korolev et al., 2017; Wang et al., 2018]. A patient’s diagnosis ing the two aforementioned ways of splitting the training and
in the dataset is typically categorized as Alzheimer’s disease testing set. We run experiments with both popular 2D models
pre-trained on ImageNet, as well as a 22-layer ResNet-based

The authors contributed equally 3D model we built. Unlike many other studies that perform
experiments on a subset of the ADNI dataset, our experiments Net with ensemble voting on the ADNI dataset to achieve the
include all of the currently available data from ADNI. Our re- state-of-the-art accuracy on three-class classification. They
sults suggest that there is still much room for improvement in split by patients but treated MRIs of the same patient that are
classifying MRI scans for Alzheimer’s Disease detection. over three years apart as different patients, giving away some
information from the training to testing process.
2 Related Work Table 1 summarizes information on the data and evaluation
approaches used by a few recent papers.
Before the current wave of interest in deep learning, the field Unlike most of the aforementioned papers that report high
of neuroimaging relied on domain knowledge to compute performance but do not explicitly mention their training and
some predefined sets of handcrafted metrics on brain MRI testing data split methodology, [Bäckström et al., 2018]
scans that reveal details on Alzheimer’s disease progression, pointed out the problem of potentially giving away informa-
such as the amount of tissue shrinkage that has occurred. A tion from the training set into the testing set when splitting
common method is to run MRI scans through brain segmen- randomly by MRI scans, and showed that splitting MRI data
tation pipelines, such as FreeSurfer [Fischl, 2012], to com- by patient led to worse model performance. However, they
pute values measuring the anatomical structure of the brain, only report two-class classification (Normal vs. Alzheimer’s)
and enter those values into a statistical or machine learning in their results. They also only used a subset that is less than
model to classify the state of the brain [Gerardin et al., 2009; half the size of the MRIs available in ADNI, even though they
Plant et al., 2010; Bhagwat et al., 2018]. reported empirical studies showing that the same model per-
With the advent of big data and deep learning, convolu- forming well on a small dataset can experience a significantly
tional neural network models have become increasingly pop- reduced performance on a large dataset.
ular for the task of image classification. While a variety of Our work differs in that we use as much of the data as avail-
architectures and training methods have been proposed, the able from ADNI and provide results in three-class classifica-
advantages and disadvantages of these design choices are not tion (CN vs MCI vs AD). In addition, we introduce a method
clear. of splitting the data by visit history motivated by real-world
[Hon and Khan, 2017] used the pre-trained 2D VGG model use case that is new to the application in Alzheimer’s disease
[Simonyan and Zisserman, 2014] and Inception V4 model brain MRI classification, and we compare the effects of the
[Szegedy et al., 2017] to train on a set of 6,400 slices from different splitting approaches in model performance.
the axial view of 200 patients’ MRI scans from the OASIS
dataset to perform two-class (Normal vs. Alzheimer’s) clas- 3 Methodology and Results
sification, where 32 slices are extracted from each patient’s
MRI scans based on the entropy of the image. In this section, we will briefly discuss the dataset we worked
[Jain et al., 2019] followed a similar approach using a 2D with, the preprocessing steps we took to produce our results,
VGG model [Simonyan and Zisserman, 2014] on the ADNI the model architectures we chose and training hyperparam-
dataset to train on a set of 4,800 brain images augmented from eters we used, as well as the results we obtained from the
150 subjects’ scans. For each subject’s scan, they selected various methods of splitting the dataset.
32 slices based on the entropy of the slices to compose the
dataset. They then proceed to shuffle the data and split it with 3.1 Data Acquisition and Preprocessing
a 4:1 training to testing ratio to perform the more complicated We use MRI data from ADNI because it is the largest pub-
task of three-class classification for Normal, Mild Cognitive licly available dataset on Alzheimer’s Disease, consisting of
Impairment, and Alzheimer. patients who experience mental decline getting their brain
Other methods used 3D models, which are more compu- scanned and cognition assessed in longitudinal visits over the
tationally expensive but have more learning capacity for the span of several years.
three-dimensional MRI data. We collected all of the scans that underwent Multipla-
[Payan and Montana, 2015] pretrained a sparse autoen- nar Reconstruction (MPR), Gradwarp, and B1 Correction
coder on randomly generated 5 × 5 × 5 patches and used pre-processing from the dataset, discarding the replicated
the weights as part of a 3D CNN to perform classification on scans of the brain for the same patient during the same visit.
the ADNI dataset, splitting randomly by MRI. [Hosseini-Asl We performed statistical parametric mapping (SPM) [Ash-
et al., 2016] employed a similar approach of using sparse au- burner and Friston, 2005] using the T1–volume pipeline in
toencoder to pretrain on scans, but instead of randomly select- the Clinica software platform (www.clinica.run), developed
ing patches from the training dataset, they used scans from by the ARAMIS lab (www.aramislab.fr), as an additional pre-
the CADDementia dataset [Bron et al., 2015], and performed processing step on the images to spatially normalize them.
ten-fold cross-validation on the ADNI dataset. In total, 2,731 MRIs from 657 patients are selected for our
[Khvostikov et al., 2018] extracted regions of interest experiments. For the split by patient history, we used 2,074
around the lobes of the hippocampus using atlas Automated scans for training and validation, and 657 scans for testing,
Anatomical Labeling (AAL), augmented their data by apply- where each scan in the testing set is from the last visit of the
ing random shifts of their data by up to two pixels and random patient. For splitting by MRI, we used the same number of
Gaussian blur with σ up to 1.2, and classified using a Siamese scans for all sets as the previous split by patient history. And
network on each of the regions of interest. finally, for splitting at the patient level, we had 2,074 scans
More recently, [Wang et al., 2018] trained a 3D Denset- from 484 patients in the training set and validation set, and
# MRIs # Patients Data Splitting Approach Accuracy
[Jain et al., 2019] 150 150 8:2 train/test split, by augmented MRI slices 95.7% (2D)
[Farooq et al., 2017] 355 149 3:1 train/test split, by augmented MRI slices 98.8% (2D)
[Payan and Montana, 2015] 100 n/a 8:1:1 train/val/test split, by patches from MRI 89.5% (3D)
[Hosseini-Asl et al., 2016] 210 210 10-fold cross validation, by patient 94.8% (3D)
[Khvostikov et al., 2018] 214 214 5:1 train/test split, by patient 96.5% (3D)
[Wang et al., 2018] 833 624 7:3 train/test split, by patient & MRI 97.2% (3D)

Table 1: Performance and methodology of some of the state-of-the-art studies on three-class Alzheimer’s Disease classification. The different
splitting schemes and subsets of the ADNI dataset used in evaluation make it hard to interpret the results meaningfully.

657 scans from 173 patients in the testing set. For consis- the training and test sets were split by visits, in which only the
tency, we used the same number of scans in all subsets for all latest visit of each patient was set aside for testing, accuracy
three splits. Our code repository is publicly available 1 , and it was slightly lower. However, accuracy dropped to around
includes the patients ids and visits that we evaluated. 50% when the training and test set were split by patient.
We note that the other state-of-the-art 2D CNN architec-
3.2 Model Architecture and Training tures we tried (DenseNet, InceptionNet, VGGNet) performed
We compared 2D and 3D CNN architecture performances. similarly to ResNet, and the choice of view in the 2D slice
For our 2D CNN architecture, we used ResNet18 [He et al., (coronal, axial, sagittal) did not lead to significant differences
2016] that was pretrained on ImageNet, allowing the model in testing accuracy as long as the slices were chosen to be
to learn how to better extract low-level features from images. close to the center of the brain. We report results on the 88th
For our 3D CNN architecture, we followed a residual slice of the coronal view for our 2D model. Our 3D model
network-based architectural design as well. We used the “bot- performs slightly better but is still limited in performance,
tleneck” configuration, where the inner convolutional layer of suffering the same problem of over-fitting to information on
each residual block contains half the number of filters, and we individual patients instead of learning what generally differ-
used the “full pre-activation” layout for the residual blocks. entiates a brain in the different stages of Alzheimer’s Disease.
See Figure 2 for more details.
For the 2D networks, we used a learning rate of 0.0001 and Ground Truth
L2 regularization constant of 0.01. For the 3D network, we AD MCI CN
used a learning rate of 0.001 and regularization constant of AD 96 59 17
0.0001. Both networks were trained for 36 epochs, with early Prediction MCI 62 153 90
stopping. CN 17 67 96
3.3 Splitting Methodology and Results Table 3: Confusion matrix of the 3D CNN experiment on the test
Our main goal is to investigate how differently the model per- set, with data split by patient.
forms under the following three scenarios: (1) random train-
ing and testing split across the brain MRIs, (2) training and
3.4 Analysis
testing split by patient ID, and (3) training and testing split
based on visit history across the patients. Analysis of the dataset shows that the frequency of disease
stage transition for patients between any two consecutive vis-
Model Train Acc. Test Acc. its is low in the ADNI dataset. There are only 152 transitions
Split by MRI randomly 99.2 ± 0.7% 83.7 ± 1.1% in the entire dataset, which contains 2,731 scans from 657
2D Split by visit history 99.0 ± 0.5% 81.2 ± 0.5% patients.
Split by patients 98.8 ± 0.6% 51.7 ± 1.2% We believe that due to the relatively few transition points in
Split by MRI randomly 95.8 ± 2.3% 84.4 ± 0.6% the dataset, the models are still able to achieve accuracy in the
3D Split by visit history 95.8 ± 2.2% 82.9 ± 0.3% 80% range for the splitting by visit experiments by repeating
Split by patients 86.3 ± 9.5% 52.4 ± 1.8% the diagnostic label of the previous visits. Out of the 152 total
transitions we found across the whole dataset, only 52 of them
Table 2: Our classification result of CNN models by different happened for patients between the n − 1th visit and nth visit.
train/test splitting scheme, averaged over five runs. This suggests that the model is able to encode the structure
of a patient’s brain from the training set, in turn aiding its
In Table 2, we present the three-way classification result of performance on the testing set.
the models described in Section 3.2. In summary, when the Furthermore, we set up additional experiments where we
training and testing sets were split by MRI scans randomly, trained on visits t1 ...tn−1 from all patients, but only tested on
the 2D and 3D models attained accuracy close to 84%. When the 52 patients that had a transition from the n − 1th to nth
visit. The classification accuracy on this experiment dropped
1 to around 54%, which suggests that the network was repeat-
https://github.com/Information-Fusion-Lab-Umass/alzheimers-
cnn-study ing information about the patient’s brain structure instead of
CN MCI AD

Figure 1: Comparison of the spatially normalized MRI scans of 4 subjects in each of the CN, MCI, and AD categories. To the human eye,
distinguishing the difference across the disease stages is a difficult task.

3x3 Conv Layer Batch Norm 3. Difficulty in distinguishing the visual difference of a
brain in the different Alzheimer’s stages. Human brains
Residual Block RELU
are distinct by nature, and the quality of MRI collections
3x3 Conv Layer 1x1 Conv from different clinical settings add to the noise level of the
Residual Block Batch Norm data. In Figure 1, we plotted out the brains of subjects
3x3 Conv Layer
in CN/MCI/AD, and show that the difference in anatomical
RELU
structure from CN to MCI to AD is very subtle.
Residual Block 1x1 Conv 4. No clear baseline. Many studies evaluated the per-
3x3 Conv Layer formance of their models on different subsets of the ADNI
Batch Norm
Residual Block dataset, making fair comparison a tricky task. In addition,
RELU
3x3 Conv Layer
studies that use a separate testing set do not report the sub-
1x1 Conv jects or scans that they used in their testing set, further com-
Residual Block plicating the comparison process.
3x3 Conv Layer We hope to keep these challenges in mind when designing
3x3 Conv Layer future experiments and ultimately design models that can re-
liably classify brain MRIs with its true stage in Alzheimer’s
Linear Layer
Disease progression, which is robust to visit number, lack of
patient transitions, and minor fluctuations in scan quality.
Figure 2: The architectures of our 3D CNN model and residual
blocks. With the exception of the first and last convolution layers, 4.2 Insights and Future Work
which have a stride of 1, all other layers have a stride of 2 for down- Many studies in Alzheimer’s disease brain MRI classifica-
sampling. The first convolution layers takes in a 1-channel image tion do not take into account how the data should be properly
and outputs a 32-channel output. split, putting into question the ability of the proposed mod-
els to generalize on unseen data. We fill this gap by provid-
learning to be discriminative among the different stages of ing detailed analysis of model performance across splitting
Alzheimer’s Disease exhibited by a particular MRI scan. schemes. Additionally, to our knowledge, none of the previ-
ous studies use all of the MRIs available in the ADNI dataset
and do not present a clear explanation for this decision. To ad-
4 Discussion dress the issue, we perform our experiments on all available
4.1 Technical Challenges data while also reporting the subjects used in the training and
We summarize the main challenges in working with the test split of all our experiments for reproducibility.
ADNI dataset as follows: In the future, we would like to explore utilizing the covari-
1. Lack of transitions in a patient’s health status be- ate data collected from patients to aid image feature extrac-
tween consecutive visits. There are only 152 transitions to- tion. Most of the studies we have come across do not use any
tal out of the entire dataset of 2,731 images collected from covariate information collected from patients. The covari-
patient visits. It is easy for the model to overfit and memorize ates, such as patient demographics and cognitive test scores,
the state of a patient at each visit instead of generalizing the may be helpful for the classification task since they correlate
key distinctions between the different stages of Alzheimer’s with the disease stage of the patient. A scenario could be a
Disease. multitask learning setup, where the model predicts the Mini-
2. Coarse-grained data labels. The data labels are coarse- Mental State Examination (MMSE) and Alzheimer’s Disease
grained in nature so our classifier may become confused when Assessment Scale (ADAS) cognitive scores in addition to the
trying to learn on cases when a patient’s cognitive state may labels. We think this may be helpful in training the model be-
be borderline, such as being between MCI and AD. The con- cause the cognitive test scores can provide finer-grained sig-
fusion matrix in Table 3 demonstrates this. nal for the model, making the prediction more robust.
References [Jain et al., 2019] Rachna Jain, Nikita Jain, Akshay Aggar-
[Ashburner and Friston, 2005] John Ashburner and Karl J wal, and D Jude Hemanth. Convolutional neural net-
work based alzheimer’s disease classification from mag-
Friston. Unified segmentation. Neuroimage, 26(3):839–
netic resonance brain images. Cognitive Systems Research,
851, 2005.
2019.
[Association and others, 2017] Alzheimer’s Association [Khvostikov et al., 2018] Alexander Khvostikov, Karim
et al. 2017 alzheimer’s disease facts and figures. Aderghal, Jenny Benois-Pineau, Andrey Krylov, and
Alzheimer’s & Dementia, 13(4):325–373, 2017. Gwenaelle Catheline. 3d cnn-based classification using
[Bäckström et al., 2018] Karl Bäckström, Mahmood Nazari, smri and md-dti images for alzheimer disease studies.
Irene Yu-Hua Gu, and Asgeir Store Jakola. An efficient 3d arXiv preprint arXiv:1801.05968, 2018.
deep convolutional network for alzheimer’s disease diag- [Korolev et al., 2017] Sergey Korolev, Amir Safiullin,
nosis using mr images. In 2018 IEEE 15th International Mikhail Belyaev, and Yulia Dodonova. Residual and plain
Symposium on Biomedical Imaging (ISBI 2018), pages convolutional neural networks for 3d brain mri classifi-
149–153. IEEE, 2018. cation. In 2017 IEEE 14th International Symposium on
[Bhagwat et al., 2018] Nikhil Bhagwat, Joseph D Vi- Biomedical Imaging (ISBI 2017), pages 835–838. IEEE,
viano, Aristotle N Voineskos, M Mallar Chakravarty, 2017.
Alzheimer’s Disease Neuroimaging Initiative, et al. [Payan and Montana, 2015] Adrien Payan and Giovanni
Modeling and prediction of clinical symptom trajectories Montana. Predicting alzheimer’s disease: a neuroimag-
in alzheimer’s disease using longitudinal data. PLoS ing study with 3d convolutional neural networks. arXiv
computational biology, 14(9):e1006376, 2018. preprint arXiv:1502.02506, 2015.
[Bron et al., 2015] Esther E Bron, Marion Smits, Wiesje M [Plant et al., 2010] Claudia Plant, Stefan J Teipel, An-
Van Der Flier, Hugo Vrenken, Frederik Barkhof, Philip nahita Oswald, Christian Böhm, Thomas Meindl, Janaina
Scheltens, Janne M Papma, Rebecca ME Steketee, Car- Mourao-Miranda, Arun W Bokde, Harald Hampel, and
olina Méndez Orellana, Rozanna Meijboom, et al. Stan- Michael Ewers. Automated detection of brain atrophy pat-
dardized evaluation of algorithms for computer-aided di- terns based on mri for the prediction of alzheimer’s dis-
agnosis of dementia based on structural mri: the cadde- ease. Neuroimage, 50(1):162–174, 2010.
mentia challenge. NeuroImage, 111:562–579, 2015. [Simonyan and Zisserman, 2014] Karen Simonyan and An-
[Farooq et al., 2017] Ammarah Farooq, SyedMuhammad drew Zisserman. Very deep convolutional networks
Anwar, Muhammad Awais, and Saad Rehman. A deep cnn for large-scale image recognition. arXiv preprint
based multi-class classification of alzheimer’s disease us- arXiv:1409.1556, 2014.
ing mri. In 2017 IEEE International Conference on Imag- [Szegedy et al., 2017] Christian Szegedy, Sergey Ioffe, Vin-
ing systems and techniques (IST), pages 1–6. IEEE, 2017. cent Vanhoucke, and Alexander A Alemi. Inception-v4,
[Fischl, 2012] Bruce Fischl. Freesurfer. Neuroimage, inception-resnet and the impact of residual connections on
62(2):774–781, 2012. learning. In Thirty-First AAAI Conference on Artificial In-
telligence, 2017.
[Gerardin et al., 2009] Emilie Gerardin, Gaël Chételat,
Marie Chupin, Rémi Cuingnet, Béatrice Desgranges, Ho- [Wang et al., 2018] Shuqiang Wang, Hongfei Wang, Yanyan
Sung Kim, Marc Niethammer, Bruno Dubois, Stéphane Shen, and Xiangyu Wang. Automatic recognition of mild
Lehéricy, Line Garnero, et al. Multidimensional clas- cognitive impairment and alzheimers disease using en-
sification of hippocampal shape features discriminates semble based 3d densely connected convolutional net-
alzheimer’s disease and mild cognitive impairment from works. In 2018 17th IEEE International Conference on
normal aging. Neuroimage, 47(4):1476–1486, 2009. Machine Learning and Applications (ICMLA), pages 517–
523. IEEE, 2018.
[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing
Ren, and Jian Sun. Deep residual learning for image recog- [Wu et al., 2018] Congling Wu, Shengwen Guo, Yan-
nition. In Proceedings of the IEEE conference on computer jia Hong, Benheng Xiao, Yupeng Wu, Qin Zhang,
vision and pattern recognition, pages 770–778, 2016. Alzheimer’s Disease Neuroimaging Initiative, et al. Dis-
crimination and conversion prediction of mild cognitive
[Hon and Khan, 2017] Marcia Hon and Naimul Mefraz impairment using convolutional neural networks. Quanti-
Khan. Towards alzheimer’s disease classification through tative Imaging in Medicine and Surgery, 8(10):992, 2018.
transfer learning. In 2017 IEEE International Conference
on Bioinformatics and Biomedicine (BIBM), pages 1166–
1169. IEEE, 2017.
[Hosseini-Asl et al., 2016] Ehsan Hosseini-Asl, Georgy
Gimel’farb, and Ayman El-Baz. Alzheimer’s disease
diagnostics by a deeply supervised adaptable 3d con-
volutional network. arXiv preprint arXiv:1607.00556,
2016.

You might also like