You are on page 1of 22

r Human Brain Mapping 38:2875–2896 (2017) r

Your Algorithm Might Think the Hippocampus


Grows in Alzheimer’s Disease: Caveats of
Longitudinal Automated Hippocampal Volumetry

Tejas Sankar,1* Min Tae M. Park,2,3 Tasha Jawa,4 Raihaan Patel,2,8


Nikhil Bhagwat ,2,5,8,9 Aristotle N. Voineskos,5,6 Andres M. Lozano,4
M. Mallar Chakravarty ,2,7,8* and the Alzheimer’s Disease Neuroimaging
Initiative
1
Division of Neurosurgery, Department of Surgery, University of Alberta, Alberta, Canada
2
Cerebral Imaging Centre, Douglas Mental Health University Institute, Montreal, Quebec, Canada
3
Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada
4
Division of Neurosurgery, University of Toronto, Toronto, Ontario, Canada
5
Kimel Family Translational Imaging Genetics Research Laboratory, Campbell Family Mental
Health Institute, Centre for Addiction and Mental Health, Toronto, Ontario, Canada
6
Department of Psychiatry, University of Toronto, Toronto, Ontario, Canada
7
Department of Psychiatry, McGill University, Montreal, Quebec, Canada
8
Department of Biological and Biomedical Engineering, McGill University, Montreal, Quebec,
Canada
9
Institute of Biomaterials and Biomedical Engineering, University of Toronto, Toronto,
Ontario, Canada

r r

Abstract: Hippocampal atrophy rate—measured using automated techniques applied to structural MRI
scans—is considered a sensitive marker of disease progression in Alzheimer’s disease, frequently used
as an outcome measure in clinical trials. Using publicly accessible data from the Alzheimer’s Disease
Neuroimaging Initiative (ADNI), we examined 1-year hippocampal atrophy rates generated by each of
five automated or semiautomated hippocampal segmentation algorithms in patients with Alzheimer’s
disease, subjects with mild cognitive impairment, or elderly controls. We analyzed MRI data from 398
and 62 subjects available at baseline and at 1 year at MRI field strengths of 1.5 T and 3 T, respectively.
We observed a high rate of hippocampal segmentation failures across all algorithms and diagnostic

Additional Supporting Information may be found in the online this report. A complete listing of ADNI investigators can be
version of this article. found at http://adni.loni.usc.edu/wp-content/uploads/how_to_
Contract grant sponsor: NIH; Contract grant number: P50 apply/ADNI_Acknowledgement_List.pdf
AG05681, P01 AG03991, R01 AG021910, P50 MH071616, U24 *Correspondence to: Tejas Sankar; Division of Neurosurgery,
RR021382, R01 MH56584, U01 AG024904; Contract grant sponsor: Department of Surgery, University of Alberta, 2D Surgery, WMC
DOD ADNI (Department of Defense); Contract grant number: Health Sciences Centre, 8440-112 Street, Edmonton, AB T6G 2B7,
W81XWH-12-2-0012. Canada. E-mail: tsankar@ualberta.ca and M. Mallar Chakravarty;
Douglas Mental Health University Institute, 6875 LaSalle Boulevard,
Tejas Sankar and Min Tae M. Park contributed equally to the
Montreal, QC H4H 1R3, Canada. E-mail: mallar@cobralab.ca
manuscript.
Data used in preparation of this article were obtained from the Received for publication 20 November 2015; Revised 31 January
Alzheimer’s Disease Neuroimaging Initiative (ADNI) database 2017; Accepted 27 February 2017.
(adni.loni.usc.edu). As such, the investigators within the ADNI DOI: 10.1002/hbm.23559
contributed to the design and implementation of ADNI and/or Published online 15 March 2017 in Wiley Online Library
provided data but did not participate in analysis or writing of (wileyonlinelibrary.com).

C 2017 Wiley Periodicals, Inc.


V
r Sankar et al. r

categories, with only 50.8% of subjects at 1.5 T and 58.1% of subjects at 3 T passing stringent segmenta-
tion quality control. We also found that all algorithms identified several subjects (between 2.94% and
48.68%) across all diagnostic categories showing increases in hippocampal volume over 1 year. For any
given algorithm, hippocampal “growth” could not entirely be explained by excluding patients with
flawed hippocampal segmentations, scan–rescan variability, or MRI field strength. Furthermore, differ-
ent algorithms did not uniformly identify the same subjects as hippocampal “growers,” and showed
very poor concordance in estimates of magnitude of hippocampal volume change over time (intraclass
correlation coefficient 0.319 at 1.5 T and 0.149 at 3 T). This precluded a meaningful analysis of whether
hippocampal “growth” represents a true biological phenomenon. Taken together, our findings suggest
that longitudinal hippocampal volume change should be interpreted with considerable caution as a
biomarker. Hum Brain Mapp 38:2875–2896, 2017. VC 2017 Wiley Periodicals, Inc.

Key words: Alzheimer’s disease; MRI; atrophy; hippocampus; volumetry

r r

been developed for this purpose [Pruessner et al., 2000;


INTRODUCTION Boccardi et al., 2011b; Watson et al., 1997]. To date, it has
Alzheimer’s Disease (AD) is a neurodegenerative disor- been difficult to determine the superiority of any one proto-
der characterized by the accumulation of amyloid-beta col over another [Boccardi et al., 2011a]. A harmonized con-
and tau proteins in the brain, ultimately leading to sensus protocol has recently been developed it is yet to be
progressive synaptic, neuronal, and axonal loss which widely adopted [Boccardi et al., 2013a, 2013b]. Despite being
profoundly affect memory and cognitive function [Small & considered the neuroanatomical “gold standard” for struc-
Duff, 2008]. The neurodegeneration that ultimately leads tural image analysis, manual segmentation has several pit-
to dementia in AD begins in the medial temporal lobe, falls. Primarily, manual segmentation is extraordinarily
before going on to involve neocortical regions [Braak & labor-intensive and time-consuming, and is, as such,
impractical for computing hippocampal volume in large
Braak, 1991; Braak, de Vos, Jansen, Bratzke, & Braak,
patient samples. Furthermore, the adoption of any protocol
1998], and manifests as atrophy on structural magnetic
requires careful assessment of inter-rater and intra-rater reli-
resonance imaging (MRI) [Lerch et al., 2008; Sabuncu
ability to ensure the concordance of hippocampal volumes
et al., 2011, 2012]. Consequently, quantitative MRI volume-
between different observers, and the stability of volume esti-
try of the hippocampus has long been proposed as a sensi-
mates from any single observer over time [Pruessner et al.,
tive and objective biomarker in AD. Smaller hippocampal
2000]. In response to these drawbacks, several semi-
volume has consistently been associated with a diagnosis
automated and fully automated hippocampal segmentation
of AD [Coupe et al., 2011b; Mouiha & Duchesne, 2011;
algorithms using libraries of manually segmented hippo-
Sabuncu et al., 2011], while many studies report that the
campi as training data have been developed and are now
rate of hippocampal atrophy over time is accelerated in freely accessible through popular neuroimaging software
patients with AD compared to subjects with mild cognitive packages [Fischl et al., 2002; Patenaude et al., 2011; Pipitone
impairment (MCI) or healthy controls [Apostolova et al., et al., 2014]. These algorithms permit hippocampal volumes
2012; Mouiha & Duchesne, 2011; Sabuncu et al., 2011]. Fur- and atrophy rates to be computed in large imaging datasets
thermore, it has been suggested that hippocampal atrophy at the click of a mouse button or through the use of a few
may more sensitively and precisely track disease progres- simple commands, while minimizing any potential impact
sion in AD than clinical measures of cognitive function of human bias [Mouiha & Duchesne, 2011; Pipitone et al.,
[Apostolova et al., 2010; Cavedo et al., 2014; Mielke et al., 2014; Sabuncu et al., 2011].
2012; McLaren et al., 2012; Sanchez-Benavides et al., 2014]. As different segmentation algorithms use slightly differ-
As a result, a decrease in the expected rate of hippocampal ent a priori anatomical information to identify the bound-
atrophy in AD patients—or a hippocampal atrophy rate more aries of the hippocampus, there are, predictably, slight
closely approximating that observed in normal aging—may differences in absolute hippocampal volume and absolute
be considered surrogate evidence of a disease-modifying rate of hippocampal atrophy generated by each individual
effect in trials of novel AD therapies. On this basis, hippocam- algorithm for a given set of subjects [Barnes et al., 2009;
pal atrophy has been included as an outcome measure in Mulder et al., 2014]. Such absolute differences can be tolerat-
several clinical trials of candidate disease-modifying drugs in ed provided that hippocampal volumes and atrophy rates
AD [Frisoni et al., 2010]. are at least concordant across algorithms, that is, those sub-
Hippocampal volume and atrophy rate are typically com- jects whose hippocampi are larger at baseline, or in whom
puted from T1-weighted volumetric MRI data. Several man- hippocampal atrophy is occurring more rapidly over time,
ual segmentation protocols—in which the outline of the ought to be consistently be identified by each individual
hippocampus is traced by an experienced observer—have algorithm. In the absence of a single gold-standard

r 2876 r
r Caveats of Longitudinal Automated Hippocampal Volumetry r

algorithm, this type of inter-algorithm concordance would designing clinical trials in AD, or clinicians who may be
greatly strengthen our confidence in the use of hippocampal tempted to apply their output as a biomarker at the individ-
volume as a biomarker to assess the impact of therapies ual patient level to aid in clinical decision-making.
directed at AD, or perhaps even to track disease progression
at the individual subject level in clinical practice. To date,
there has been a relative paucity of studies addressing the MATERIALS AND METHODS
concordance between hippocampal atrophy rates generated Data Acquisition
across various automated segmentation algorithms [Mulder
et al., 2014]. The need for these types of studies is ever more Data used in this study were obtained from the ADNI data-
pressing because of easier access to hippocampal segmenta- base (http://adni.loni.usc.edu/). The ADNI was launched in
tion algorithms and high-performance computing hardware 2003 by the National Institute on Aging (NIA), the National
by clinicians and scientists who may lack expertise with Institute of Biomedical Imaging and Bioengineering (NIBIB),
quantitative neuroimaging. The need is heightened by the the Food and Drug Administration (FDA), private pharma-
ongoing development of new segmentation algorithms, and ceutical companies, and nonprofit organizations, as a $60
the development of turnkey commercial services which take million, 5-year public–private partnership. The primary goal
unprocessed MRI data from patients and generate hippo- of ADNI has been to test whether longitudinal MRI, positron
campal volumes and atrophy rates in a “black box” manner. emission tomography (PET), other biological markers, and
The work in this article is inspired by an observation clinical and neuropsychological assessment can be combined
made by our group during our study of hippocampal vol- to measure the progression of mild cognitive impairment
umes derived from the Alzheimer’s Disease Neuroimaging (MCI) and early AD. Determination of sensitive and specific
Initiative (ADNI, http://www.adni-info.org), a large-scale, markers of very early AD progression is intended to aid
multisite, longitudinal study of the natural history of AD, in researchers and clinicians to develop new treatments and
part designed to determine the utility of various imaging monitor their effectiveness, as well as lessen the time and cost
biomarkers in detecting and tracking AD [Mueller et al., of clinical trials.
2005; Holland et al., 2009; Weiner et al., 2012]. While brain The Principal Investigator of this initiative is Michael W.
imaging data from ADNI is made freely available to Weiner, MD, VA Medical Center and University of Cali-
researchers, the ADNI database also contains a list of longi- fornia San Francisco. ADNI is the result of efforts of many
tudinal hippocampal volume estimates, generated by differ-
co-investigators from a broad range of academic institu-
ent semi-automated and automated methods, for most
tions and private corporations, and subjects have been
subjects. We observed, unexpectedly, that ADNI-reported
recruited from over 50 sites across the U.S. and Canada.
hippocampal volumes increased over time in a number of
The initial goal of ADNI was to recruit 800 subjects but
subjects across the AD, MCI, and healthy control groups.
ADNI-1 has been followed by ADNI-GO and ADNI-2. To
This observation runs counter to the expected progressive
date, these three protocols have recruited over 1500 adults,
atrophy of the hippocampus in AD and MCI, and to a lesser
ages 55–90, to participate in the research, consisting of cog-
extent, in normal aging [Mouiha and Duchesne, 2011; Sabu-
nitively normal older individuals, people with early or late
ncu et al., 2011]. We therefore undertook a study to deter-
MCI, and people with early AD. The follow-up duration
mine the incidence of hippocampal volume increases over
time in the ADNI cohort, by each of five different semi- of each group is specified in the protocols for ADNI-1,
automated or automated hippocampal segmentation algo- ADNI-2, and ADNI-GO. Subjects originally recruited for
rithms. Finding a surprisingly large proportion of patients ADNI-1 and ADNI-GO had the option to be followed in
with hippocampal enlargement, we then examined whether ADNI-2. For up-to-date information, see www.adni-info.org.
patients demonstrating volume increases are identified From the ADNI1:Complete 1Yr 1.5 T standardized dataset
independently and consistently across these algorithms. and ADNI1:Complete 1Yr 3 T standardized dataset, we
Based on these observations, we examined whether these identified 650 and 142 subjects respectively for whom a base-
volume increases could be explained by well-known sources line and a 1-year follow-up MRI scan had been completed.
of variability in hippocampal segmentation, such as ADNI investigators acquired MRI data using 1.5 T or 3.0 T
scan–rescan reliability, or factors influencing MRI acquisi- scanners (General Electric Healthcare, Philips Medical Sys-
tion, such as field strength. Given the use of hippocampal tems or Siemens Medical Solutions) at multiple sites as
volume and atrophy rates as biomarkers for disease progres- described in Jack et al. [2008]. Uniformly preprocessed
sion and transition from NC to MCI to AD [Amaral et al., images available through the ADNI database were consid-
2016; Wolz et al., 2010a; 2010b], the use of atrophy rates may ered suitable for analyses after quality control. Preprocessing
also provide diagnostic classification accuracy. We further steps include gradient nonlinearity and intensity inhomoge-
tested if this was true across the algorithms examined in this neity correction, and phantom-based distortion correction
study. Our findings raise some concerns about the reliability [Wyman et al., 2013]. Representative 1.5 T imaging parame-
of current segmentation algorithms in assessing hippocam- ters were TR 5 2400 ms, TI 5 1000 ms, TE 5 3.5 ms, flip
pal atrophy rate, and as such serve as a cautionary note to angle 5 8, field of view 5 240 3 240 mm, a 192 3 192 3 166
end users of these algorithms, especially those investigators matrix (x, y, and z directions) yielding voxel dimensions of

r 2877 r
r Sankar et al. r

1.25 mm 3 1.25 mm 3 1.2 mm. Representative 3.0 T imaging 4.4 and 5.1 respectively, using the longitudinal processing
parameters were TR 5 2300 ms, TI 5 850–900 ms, TE 5 3.5 stream.
ms, flip angle 5 8–9, field of view 5 256–260 3 240 mm, a
256 3 256 3 160–170 matrix (x, y, and z directions) yielding UCSD
voxel dimensions of 1.20 mm 3 1.00 mm 3 1.00 mm.
The UCSD method was developed for accurate quantifi-
cation of both global and local structural changes based on
Hippocampal Segmentation Algorithms longitudinal MRI scans [Holland et al., 2009, 2012; Holland
Evaluated and Dale, 2011]. The methodology relies on the fully affine
matching of longitudinal data, followed by an intensity
Five different algorithms for segmentation of the hippo-
alignment stage that corrects for relative B1-induced inten-
campus were chosen for evaluation. Amongst these, three
sity distortions. This is followed by a nonlinear registration
were algorithms from which hippocampal volume data
phase to match longitudinal images using a linear elastici-
were already available within the ADNI database, namely:
ty model which estimates a well-regularized displacement
FreeSurfer [Fischl et al., 2002, 2004], the Quantitative Ana-
field to align two images [Holland and Dale, 2011]. The
tomical Regional Change (QUARC) algorithm from the Uni-
displaced transformed image will deform cubic voxels into
versity of San Diego California (from this point forward
irregular hexahedra in order to estimate volume changes.
referred to as UCSD) [Holland et al., 2009, 2012; Holland
This is done in a hierarchical iterative procedure in order
and Dale, 2011], and Surgical Navigation Technologies
to capture subtle volumetric differences between longitudi-
(SNT) [Hsu et al., 2002]. Quality control of segmentations
nal data from the same individual. FreeSurfer-based seg-
generated by these algorithms was already available
mentations (as above) were used to create the baseline
through the ADNI database and subsequently used in our
definitions of the hippocampal segmentations. Final hippo-
evaluations. We also chose two other segmentation algo-
campal volumes were estimated using the UCSD proce-
rithms not already available in the ADNI database, namely
dure as the output of the nonlinear registration phase.
FSL FIRST (http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FIRST)
[Patenaude et al., 2011], and a newly derived multi-atlas seg-
mentation algorithm from our group, MAGeT Brain [Chak- Surgical navigation technologies (SNT)
ravarty et al., 2013; Pipitone et al., 2014]. FSL FIRST was Semi-automated hippocampal volumetry was carried
chosen due to its popularity in the field and in other neuro- out using a commercially available high-dimensional brain
imaging studies of the hippocampus, for example the recent mapping tool (Medtronic Surgical Navigation Technolo-
ENIGMA genome-wide association meta-analysis study gies, Louisville, CO) that has previously been validated
[Stein et al., 2012]. MAGeT-Brain was chosen due to our and compared to manual tracing of the hippocampus [Hsu
familiarity with the methodology and our assessment of its et al., 2002]. Measurements of hippocampal volume was
overall quality in comparison to FreeSurfer, FSL FIRST, and achieved by manually placing 22 control points as local
SNT methods [Pipitone et al., 2014; Treadway et al., 2015]. landmarks for the hippocampus on the individual brain
Note that the primary goal of our work was not to simply MRI data: one landmark at the hippocampal head, one at
compare algorithmic performance, as has been undertaken the tail, and four per image (i.e., at the superior, inferior,
in other studies [Mulder et al., 2014; Pipitone et al., 2014], medial, and lateral boundaries) on five equally spaced sec-
but moreover to evaluate their performance and concor- tions perpendicular to the long axis of the hippocampus.
dance in a longitudinal context. These algorithms are Second, fluid image transformation is used to match the
described in more detail below. individual brains to a template brain [Miller et al., 1997].
The voxels corresponding to the hippocampus are then
FreeSurfer labeled and counted to obtain volume.

FreeSurfer is a publicly available image analysis tool


FSL-FIRST
box (https://surfer.nmr.mgh.harvard.edu/) that uses a
fully automated Markov random fields approach to identi- Segmentations were derived using FSL-FIRST, part of
fy cortical and subcortical structures based on probabilistic the FSL toolkit [Jenkinson et al., 2012], (http://fsl.fmrib.ox.
information available from a library of manually segment- ac.uk/fsl/fslwiki/)). FIRST is a model-based segmentation
ed images [Fischl et al., 2002, 2004]. Data accessed for this tool that uses shape and appearance-based models con-
study were generated using the recently derived minimal- structed from manually segmented images [Patenaude
ly biased processing stream [Reuter et al., 2012] that relies et al., 2011]. The manual labels are parameterized as sur-
on within-subject template creation and initializing the face meshes and modeled as a point distribution model.
processing for each time point with common information Deformable surfaces are used to automatically parameterize
from the subject template in order to minimize potential the volumetric labels using constraints to preserve vertex cor-
confounds due to over-regularization. Data from ADNI1 respondence across the training data. Based on learned mod-
(1.5 T) and ADNI2 (3 T) were processed using FreeSurfer els, FIRST searches through linear combinations of shape

r 2878 r
r Caveats of Longitudinal Automated Hippocampal Volumetry r

modes of variation for the most probable shape instance giv- right and left hippocampi), or average (i.e., increase over 12
en the observed intensities in a T1-weighted image. FIRST months in mean volume of both hippocampi).
was implemented using FSL 4.1.9 version. FIRST segmenta-
tions were performed using the run_first_all script according
Quality Control
to the FIRST user guide (http://fsl.fmrib.ox.ac.uk/fsl/
fslwiki/FIRST/UserGuide). Quality control (QC) assessments were performed to iden-
tify and eliminate subjects with hippocampal segmentation
MAGeT Brain failures which could impact our hippocampal growth analy-
sis. Briefly, QC information for segmentations from the Free-
The MAGeT Brain framework [Chakravarty et al., 2013; Surfer and UCSD algorithms was obtained directly from the
Pipitone et al., 2014] is a modified multiatlas framework that ADNI database, while QC was assessed for FSL and MAGeT
uses a minimal number of well-defined atlases as inputs. In using an in-house method. QC could not be assessed for the
our implementation we used five high-resolution atlases of the SNT algorithm. A detailed description of QC methods follows
hippocampus [Winterburn et al., 2013] as input. MAGeT Brain below.
uses a subset of the data to be segmented to automatically gen-
erate a template library so that the segmentation process can FreeSurfer
be bootstrapped using a subset of the data. Template library
images were chosen based on a representative sampling of QC information for individual hippocampal segmentations
subject diagnosis, age, and sex to model neuroanatomical vari- was obtained from the ADNI database. According to ADNI,
ability within the ADNI cohort. Seven subjects from each diag- QC assessment was carried out as outlined in the ADNI
nostic group (AD, MCI, healthy control) spanning the age protocols (http://adni.bitbucket.org/docs/UCSFFRESFR/
range within each group were chosen. The template library is UCSFFreeSurferMethodsSummary.pdf). QC was assessed in
generated through automated nonlinear registration between both regional and global registration and segmentation accu-
each of the atlases and each of the subjects in the template racy. For the purposes of this study, we focused on the
library, yielding five candidate segmentations for each hippo- ADNI-reported “Overall QC” metric, which is the most rigor-
campus. The template library now acts like regular input seg- ous QC measure, meant to reflect the quality of cortical and
mentations in a regular multiatlas procedure. Each of the subcortical segmentation across the entire brain volume.
subjects is then matched to each of the subjects in the template Specifically, we considered a given hippocampal segmenta-
library to yield 105 candidate segmentations. The final segmen- tion in a given subject as acceptable only if that subject
tation is created by selecting the most frequently occurring achieved an Overall QC metric of “PASS.”
label at each voxel location [Collins and Pruessner, 2010]. All
images were converted into the MINC file format (http:// UCSD
www.bic.mni.mcgill.ca/ServicesSoftware/MINC) and all non-
linear registrations were performed using symmetric normaliza- QC information in the form of a simple two-point scale
tion and a cross-correlation objective function as implemented was obtained from the ADNI database, an ADNI-reported
in the ANTs toolbox [Avants et al., 2008]. rating of “0” denoted segmentation failure, while a rating
Volumetric data generated from the FreeSurfer, UCSD, of “1” indicated a passing segmentation.
and SNT methods were obtained from the ADNI database
(adni.loni.usc.edu) between March 2012 and December SNT
2012. FSL-FIRST and MAGeT Brain were locally processed
Formal QC was not reported in ADNI, and was not per-
for 1.5 T and 3 T data. FreeSurfer (longitudinal, version
formed considering that the SNT algorithm is semi-
5.1.0) was used to process 3 T images as the data were not
automated, depending in part on the selection of anatomi-
available from the ADNI database.
cal landmarks on the hippocampus by an expert observer.
As such, it is thought to be resistant to gross segmentation
Identification of Subjects Showing error, and is frequently considered a “gold-standard” ref-
Hippocampal “Growth” erence [Leung et al., 2010; Wolz et al., 2010a].

We compared baseline to 12-month hippocampal volumes


FSL-FIRST, MAGeT Brain
generated by each algorithm to identify those subjects show-
ing hippocampal “growth” over time. Simply, we defined QC was performed by one of the authors (MTMP). Fifteen
hippocampal growth as occurring in a given subject if that representative slices encompassing left and right hippocam-
subject demonstrated a larger hippocampal volume at the pal segmentations were used to evaluate QC. Specifically, if
12-month time point compared to baseline. Hippocampal either hippocampus was underestimated or overestimated
“growers” were classified as unilateral (i.e., increase over 12 by at least 10 voxels in three or more slices then the segmen-
months in volume of either the right or left hippocampus), tation did not pass. The MAGeT Brain segmentations used
bilateral (i.e., increase over 12 months in volume of both here have previously been analyzed in a validation study of

r 2879 r
r Sankar et al. r

this algorithm, and have been found to show good concor- Mean scan–rescan reliability data (expressed as a percent-
dance with manual segmentations derived using the Pruess- age in Supporting Information, Table I) for each algorithm
ner protocol [Pruessner et al., 2000] as well as SNT- were then used to define a conservative threshold for hippo-
generated volumes [Pipitone et al., 2014]. campal growth: “candidate” growers were those subjects in
QC assessment as described above was used to generate ADNI who, for a given segmentation algorithm, demonstrat-
three subject groups based on increasingly stringent QC ed an increase in hippocampal volume between baseline and
levels: 1-year greater than mean scan–rescan reliability error for that
algorithm.
1. First-pass QC: Subjects passing first-pass QC were
simply those with complete volumetric information Derived Scan-Rescan Reliability Estimates
for all algorithms at both time points. First-pass QC did
Second, we used data obtained from the Open Access
not involve any specific assessment of segmentation
Series of Imaging Studies (OASIS; http://www.oasis-
quality.
brains.org/) from the Consortium for Reliability and
2. Methodwise QC: Subjects passing methodwise QC
Reproducibility (CoRR; website: http://fcon_1000.projects.
were those who, for any given algorithm, passed QC by
nitrc.org/indi/CoRR/html/) [Zuo et al., 2014]. OASIS (1.5
that particular algorithm at both time points. Obviously,
T) and CoRR (3 T) datasets were used to independently
this resulted in different numbers of subjects surviving
assess the scan–rescan reliability of the FSL, FreeSurfer and
QC for each algorithm.
MAGeT Brain segmentation algorithms, as we had ready
3. Across-method QC: Among subjects in #1, only those
access to these algorithms in our lab. A set of 20 (12 female/
passing QC by all algorithms at both time points. 8 male, ages 19–34, all right handed) subjects from the
OASIS dataset who underwent a repeat scan within 90 days
were selected, yielding a set of 40 total images. A set of 84
Evaluation of Scan–Rescan Reliability (46 female/38 male, ages 18–62, handedness not available)
subjects from the CoRR dataset who underwent a repeat
Scan–rescan reliability can be an important source of error scan within 24 days were selected, yielding a set of 168
influencing measurement accuracy of longitudinal volumet- images. Specifically, images from the HNU_1, IPCAS_1 and
ric changes in brain structure, and has been assessed in stud- XHCUMS sites were selected from the CoRR dataset.
ies which report the output of automatic segmentation OASIS and CoRR data were processed independently with
algorithms applied to multiple MRI scans of the same FSL, Freesurfer and MAGeT Brain segmentation algorithms
subject at short intervals [Morey et al., 2010]. Resulting as described above for the ADNI dataset. FSL First (version
segmentations and volumetric differences between repeat 5.0.6) was run using the run_first_all command in accor-
scans are then typically used to quantify error limits within dance with http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FIRST/
the segmentation algorithm. UserGuide. The FreeSurfer longitudinal pipeline (version 4.4
We addressed scan–rescan reliability in two ways. for 1.5 T, version 5.1 for 3 T) was run for each scan–rescan
First, we determined literature-reported mean scan–rescan pair according to http://freesurfer.net/fswiki/Longitudinal-
reliability data for each algorithm: Processing. MAGeT Brain segmentation was run in accor-
dance with https://github.com/CobraLab/MAGeTbrain.
FreeSurfer, FSL-FIRST, and UCSD For 1.5 T data, 19 of the 20 baseline images were selected as
templates. For 3 T data, 21 of the 84 baseline images were
Scan–rescan data were estimated from a previous study selected in a demographically representative manner.
that compared FreeSurfer and FSL segmentations [Morey The resulting scan–rescan CoRR-OASIS reliability distri-
et al., 2010]. Data were not available specifically for the butions for each algorithm were used as null distributions
UCSD method, though we expected higher sensitivity in against which ADNI 1-year atrophy rate distributions
detecting subtle longitudinal changes for the UCSD were compared. Subjects from ADNI with atrophy rate
method compared to FreeSurfer [Holland & Dale, 2011]. outside the right-tail 90th percentile of the scan–rescan dis-
We therefore used the same threshold for FreeSurfer and tribution were considered as an alternate definition for
UCSD as a conservative measure of reliability. identifying candidate “growers.” We chose a more liberal
Absolute volumetric differences between raters were 90th percentile threshold based on the assumption that the
quantified from a previous study comparing SNT with scan–rescan distributions we computed likely undersam-
manual segmentation [Hsu et al., 2002]. ple the true population variance in scan–rescan values. By
generating a null distribution of test-retest reliability error
MAGeT Brain we considered that single individuals that fell to the edges
of this distribution would be the most likely candidates to
Images from five subjects scanned within a week locally demonstrate plausible biological growth over the course of
were processed with the MAGeT Brain pipeline, and absolute 1 year. Moreover, if these individuals were most likely to
volumetric differences between time points were quantified. exhibit hippocampal growth over 1 year, we hypothesized

r 2880 r
r Caveats of Longitudinal Automated Hippocampal Volumetry r

TABLE I. Selected demographic and clinical features of subjects by diagnostic category and field strength

AD MCI Normal P value

1.5 T
N 90 179 129 -
Mean age (range) 74.7 (56–89) 74.9 (55–89) 76.2 (62–90) 0.16
#Females 41 60 60 0.050
MMSE (SD) 23.4 (1.90) 27.1 (1.75) 29.1 (1.03) <0.001
3T
N 15 24 23 -
Mean age (range) 72.1 (57–88) 74.2 (56–88) 75.6 (70–85) 0.35
#Females 9 9 15 0.39
MMSE (SD) 23.1 (2.12) 26.92 (2.00) 29.43 (0.73) <0.001

P values are reported for one-way ANOVA for age and MMSE, and for chi-square test (1.5 T) or Fisher’s exact test (3 T) for proportion
of females. MMSE was significantly different between groups.

that this phenomenon should be captured through the dif- hippocampal atrophy rates derived from the five different
ferent algorithms tested in this manuscript. segmentation pipelines described above. To have a consis-
In this particular case, it may be more beneficial to identify tent set of subjects we used hippocampal volumes that
high rates of atrophy using a whole-brain or temporal lobe survived the most stringent level of quality control. We
measures (such as a medial temporal atrophy score). Howev- trained three different machine-learning models: logistic
er, at the present time, there are limited measures for lateral regression with Lasso, support vector machine, and ran-
ventricle and whole-brain volumes available via ADNI. Con- dom forest (RF). We only used total bilateral hippocampal
sequently, deriving such scores would limit us and potential- atrophy estimated at 1 year as input.
ly bias downstream analysis to a specific method. The model performance was evaluated using 10-fold nested
cross-validation procedure for each input choice. Folds were
stratified according to the proportion of each diagnostic cate-
Statistical Analysis
gory. Data were preprocessed to center at mean and feature-
General linear models were used to compare baseline wise scaled to have unit-variance. (http://scikit-learn.org/
hippocampal volume and hippocampal atrophy rate stable/modules/generated/sklearn.preprocessing.scale.html)
between segmentation algorithms and diagnostic groups, Briefly, cross-validation was performed as follows. First, we
with age and sex used as co-variates. We verified the divided the input data into 10 subsets. Then each model was
normality of the atrophy rate distributions using a Kolmo- trained using 9 of these 10 subsets and performance was mea-
gorov–Smirnov test for normality as these rates are sured on the held-out test subset. The process was repeated
derived using the ratio of two normally distributed varia- 10 times. Hyperparameters of the model (number of trees,
bles. The v2 test (or Fisher’s exact test for small n) was used Lasso penalty, etc.) were determined through a grid search
to assess differences in the proportion of QC failures using an inner cross validation loop created by further divid-
between algorithms, diagnostic categories, and timepoints, ing the training subset in each round. All models were imple-
as well as differences in the proportion of candidate hippo- mented using scikit-learn toolbox (http://scikit-learn. org/
campal “growers” by each method. Correction for multiple stable/index.html). The binary classification performance was
comparisons was performed with the Marascuillo procedure measured using receiver operating characteristic (ROC)
for multiple proportions. To determine if any subset of individ- curves and area under the curve (AUC) values.
uals (i.e., AD with smallest hippocampi) were driving QC fail-
ures, we also split each diagnostic group into quartiles based
on total hippocampal volume. For each method, we deter-
RESULTS
mined if the group representation in the QC failures was dif- Demographic and Clinical Data
ferent to the QC passes. To assess concordance between
various algorithms in baseline hippocampal volume and hip- Demographic and clinical features of subjects in all three
pocampal atrophy rate, the intraclass correlation coefficient diagnostic categories are listed in Table I.
(ICC) was used in a two-way, mixed-model for a single mea-
sure using a consistency agreement definition. Baseline Hippocampal Volume and
Hippocampal Atrophy Rates are Consistent
Diagnostic Classification Based on Atrophy Rates With Literature-Reported Data
We performed pairwise classification between three Figure 1 shows the mean baseline hippocampal volume
diagnostic categories (AD, MCI, and NC) using and mean hippocampal atrophy rates categorized by

r 2881 r
r Sankar et al. r

Figure 1.
Hippocampal volumetric data at 1.5 T. (A) Mean left, right, and average baseline hippocampal vol-
ume by segmentation algorithm. (B) Mean left, right, and average 12-month hippocampal atrophy
rate by segmentation algorithm.

segmentation algorithm at 1.5 T (3 T data are shown in significant difference in hippocampal atrophy rate across
Supporting Information, Fig. 1) at a first-pass level of algorithms.
QC. We observed that atrophy rates across methods and
field strengths were normally distributed with the excep-
tion of UCSD at 1.5 T (D 5 0.11, P 5 0.01). As a result, we Automated Hippocampal Segmentation
chose to continue to use the analyses detailed in the Sta- Algorithms Result in a Sizeable Number of
tistical Methods section of the Methods (Supporting Hippocampal Segmentation Failures
Information, Table II). Mean 12-month hippocampal atro-
phy rates for subjects with AD ranged from 1.6% to We found evidence of hippocampal segmentation fail-
6.0%, well within the expected range of literature- ures by every fully automated algorithm (FreeSurfer,
reported values [Barnes et al., 2009]. There was no UCSD, FSL, and MAGeT) in every diagnostic category

r 2882 r
r Caveats of Longitudinal Automated Hippocampal Volumetry r

Figure 2.
Graphical summary of the proportion of hippocampal segmentation failures at 1.5 T by algorithm,
diagnostic category, and time point: (A) baseline and (B) 1-year follow-up.

(i.e., AD, MCI, or NC). Segmentation failures were compared the proportion of QC failures by each algorithm
observed regardless of scan field strength: the proportion at baseline and at 12 months separately. At 1.5 T, the pro-
of segmentation failures for each algorithm and diagnostic portion of subjects with a failed hippocampal segmenta-
category at 1.5 T is summarized by Figure 2, while the tion was no different at 12 months compared to baseline
proportion of failures at 3 T is shown in Supporting Infor- for all algorithms (P > 0.05 by v2 test) except UCSD (v2 5
mation, Figure 2 (raw data on segmentation failures is 123.31, P < 0.001). Similarly, at 3 T, only the UCSD algo-
found in Supporting Information, Tables III and IV). The rithm had a lower proportion of QC failures at baseline
substantial number of segmentation failures, naturally, (P < 0.001, Fisher’s exact test). No specific diagnostic cate-
caused several subjects to fail more stringent QC assess- gory was seen to have an increased proportion of failures
ment. At 1.5 T, across all diagnostic categories, only in any method at 1.5 T or 3 T (Supporting Information,
approximately half (202 out of 398, 50.8%) of those subjects Table V). Interestingly—and uniquely—there were no seg-
with scans available at baseline and 12 months passed mentation failures at baseline by the UCSD algorithm, in
across-method QC (i.e., had acceptable hippocampal seg- any clinical group or at either field strength.
mentations at both time points by all algorithms). Overall,
the proportion of AD patients passing across-method QC
was significantly lower than that of the NC subjects (37.8% Automated Hippocampal Segmentation
vs 58.9%, v2 5 9.95, P 5 0.007), but not significantly lower Algorithms Demonstrate That Hippocampal
than that of MCI subjects (51.4%). At 3 T, 58.1% (36 out of “Growth” Occurs in a Substantial Proportion
62) of all subjects passed across-method QC overall; as at of Subjects
1.5 T, the proportion of AD patients passing across-
method QC was significantly lower than that of the NC At 1.5 T, at the first-pass QC level, a substantial propor-
subjects (40.0% vs 78.2%, v2 5 6.26, P 5 0.04) but not that tion of subjects in each of the AD, MCI, and NC groups
of the MCI patients (50.0%). demonstrated increases in hippocampal volume over 12
We considered the possibility that segmentation algo- months (i.e., were candidate “growers”) across all algo-
rithms might be less accurate when applied to MRI scans rithms (Table II and Fig. 3). The percentage of candidate
of increasingly atrophic brains. If this were true, scans at “growers” for each method, diagnosis, and QC level at 3 T
12 months would be more likely to be classified as QC fail- is listed in Supporting Information, Tables VI and VII.
ures than those at baseline, as atrophy would be expected Even among those subjects who passed stringent, across-
to progress over time. To test this hypothesis, we method QC, hippocampal “growth” was still observed in

r 2883 r
TABLE II. Percentage of subjects at 1.5 T showing hippocampal “growth” for all algorithms at all QC levels

Diagnosis FS UCSD SNT FSL MAGeT

First-pass QC AD Left 13.3 6.7 14.4 25.6 22.2


Right 12.2 6.7 11.1 24.4 25.6
Bilateral 5.6 3.3 4.4 11.1 8.9
Average 13.3 5.6 6.7 21.1 21.1
MCI Left 22.4 17.9 31.3 22.9 30.2
Right 22.4 20.1 22.9 24.6 26.3
Bilateral 10.1 8.9 12.3 11.2 12.9
Average 16.8 14.5 25.1 18.4 24.0
Normal Left 37.2 17.8 42.6 38.0 29.5
Right 33.3 20.2 41.9 40.3 37.2
Bilateral 15.5 11.6 29.5 19.4 15.5
Average 36.4 19.4 43.4 38.0 31.8
Methodwise QC AD Left 11.5 4 - 21.0 23.6
Right 9.6 5.3 - 15.8 23.6
Bilateral 5.8 1.3 - 5.3 9.7
Average 11.5 4 - 14.5 22.2
MCI Left 21.6 14.3 - 22.0 29.8
Right 21.6 16.3 - 23.9 26.2
Bilateral 9.6 5.4 - 10.7 12.5
Average 15.2 9.5 - 17.0 23.8
Normal Left 38.0 16.4 - 37.5 28.9
Right 35.0 18.2 - 40.0 38.8
Bilateral 15.0 10.0 - 17.5 16.5
Average 37.0 17.3 - 37.5 33.1
Across-method QC AD Left 8.8 2.9 14.7 23.5 20.6
Right 8.8 5.9 11.8 23.5 29.4
Bilateral 2.9 0 2.9 5.9 5.9
Average 8.8 3.0 2.9 17.7 23.5
MCI Left 17.4 15.2 38.0 20.7 29.4
Right 19.6 18.5 19.6 19.6 26.1
Bilateral 8.7 5.4 14.1 7.6 14.1
Average 15.2 7.6 27.2 15.2 20.7
Normal Left 39.5 18.4 43.4 35.5 38.2
Right 30.3 18.4 46.0 35.5 39.5
Bilateral 15.8 10.5 34.2 11.8 19.5
Average 38.2 18.4 48.7 35.5 38.2

Note that methodwise QC could not be performed for SNT; see text for details.

Figure 3.
Venn diagrams demonstrating overlap of subjects identified as hippocampal “growers” at 1.5 T
across all 5 segmentation algorithms, divided by (a) first pass QC and (b) across-method QC.
Within (a) and (b), clockwise from upper left, individual Venn diagrams represent overlap for left,
right, bilateral, and average hippocampal growth respectively.
r Caveats of Longitudinal Automated Hippocampal Volumetry r

some subjects in every diagnostic group. In AD patients, TABLE III. Percentage of subjects (number of
the proportion of candidate “growers” in average hippo- “growers”/total # of subjects within DX group) at 1.5 T
campal volume ranged from 2.94% (UCSD algorithm) to showing average hippocampal growth after thresholding
23.53% (MAGeT algorithm). In MCI patients, the propor- for scan–rescan reliability, divided by segmentation
tion of mean hippocampal candidate “growers” varied algorithm
from 7.61% (UCSD) to 27.17% (SNT algorithm). In NC,
FS UCSD SNT FSL MAGeT
between 18.42% (UCSD) and nearly half (48.68%, SNT) of
subjects were mean hippocampal candidate “growers,”
(A) AD 4.4 0 2.2 10.0 6.7
and the SNT, FSL, Freesurfer, and MAGeT algorithms all MCI 5.0 2.2 11.2 6.2 11.2
found that >35% of NC subjects demonstrated growth in Normal 7.8 1.6 21.7 7.8 7.8
average hippocampal volume. Overall, the proportion of (B) AD 1.9 0 2.2 1.3 4.2
average hippocampal candidate “growers” was different MCI 4.0 1.4 11.2 5.0 10.1
between algorithms (omnibus v2 5 26.78, P < 0.0001). Post- Normal 7.0 0.9 21.7 5.0 8.3
hoc testing showed that the proportion of average hippo- (C) AD 2.9 0 0 0 2.9
campal candidate “growers” was lower for the UCSD meth- MCI 2.2 1.1 13.0 4.4 7.6
od compared to any other algorithm (P < 0.035 for all Normal 6.6 0 22.4 2.6 9.2
(D) AD 2.9 N/A N/A 0 0
pairwise comparisons). Owing to smaller sample sizes, the
MCI 0 N/A N/A 3.3 1.1
results at 3 T were more variable (Supporting Information, Normal 5.3 N/A N/A 2.6 1.3
Fig. 3 and Tables VI and VII). Growth rates as a percentage
at 1.5 T are given in Supporting Information, Table VIII. (A) First pass QC, (B) Methodwise QC, (C) Across-method QC,
(D) Across-method QC and using 90th percentile thresholds gener-
ated based on CoRR-OASIS scan–rescan distributions. N/A 5
Hippocampal Growth in Some Subjects Cannot CoRR-OASIS scan–rescan distributions could not be generated for
Entirely be Explained by Scan–Rescan Reliability UCSD and SNT methods; see text for details.

We used literature-reported mean scan–rescan variabili-


ty values for each segmentation algorithm as a threshold passing first-pass QC, a total of only 7 subjects (0 AD, 3 MCI,
to identify individuals who are plausible candidates for and 4 NC) demonstrated mean hippocampal growth across
hippocampal growth, that is, only subjects showing all methods. By contrast, on its own, each individual algo-
increases in hippocampal volume above these thresholds rithm in isolation found from as low as 76 (UCSD) to as high
were considered candidate “growers.” Results of this anal- as 180 (MAGeT) candidate hippocampal “growers.” Similar-
ysis at 1.5 T are summarized in Table III, divided accord- ly, considering only those subjects who passed stringent
ing to QC level. Notably, we found that adjusting for across-method QC, a total of only 3 (0 AD, 1 MCI, and 2 NC)
scan–rescan variability still did not completely exclude subjects demonstrated mean hippocampal growth by all
“candidate” hippocampal growers by each method, even algorithms. Yet, between 22 and 63 subjects demonstrated
among those subjects who passed stringent across-method hippocampal growth when any given algorithm was consid-
QC. As an example, by the SNT algorithm at 1.5 T, fully ered on its own at the across-method QC level.
13% of MCI patients and 22% of NC subjects demonstrat-
ed average hippocampal growth greater than scan–rescan Segmentation Algorithms are Poorly Concordant
variability threshold at the highest QC level. in Their Estimates of Hippocampal Volume
Comparisons between the scan–rescan distributions
Change Over Time
obtained from FSL, Freesurfer, and MAGeT Brain (using
the CoRR-OASIS dataset), and the 1-year ADNI atrophy To further characterize the degree of concordance—or
rate distributions are shown in Figures 4 and 5. By all lack thereof—in hippocampal atrophy rates across algo-
three algorithms, there were still subjects who exhibited rithms, we used the intraclass correlation coefficient (ICC) as
hippocampal growth at 1 year that exceeded the 90th per- a summary statistic. We computed ICC with a two-way,
centile of the right tail of the scan–rescan distribution. This mixed-model for a single measure using a consistency agree-
observation held true even at the most stringent QC level. ment definition. As shown in Table (I and IV), ICC values
suggested good inter-algorithm reliability (i.e., ICC > 0.7) for
Subjects Showing Hippocampal Growth are not baseline hippocampal volume measurements, irrespective
of field strength and QC level. However, ICC values for %
Concordant Across Segmentation Algorithms
hippocampal volume change over 1 year were much lower
Despite unexpectedly large numbers of hippocampal (ICC range 0.05–0.319), and well below the threshold for
growers across diagnostic groups and segmentation algo- acceptable inter-algorithm reliability. Neither increased field
rithms, very few individual candidate “growers” were iden- strength nor more stringent QC appreciably improved ICC
tified consistently across multiple algorithms (Fig. 3 and for % hippocampal volume change. In summary, the five
Supporting Information, Fig. 3). At 1.5 T, among subjects segmentation algorithms we assessed were very poorly

r 2885 r
r Sankar et al. r

Figure 4.
Comparison of CoRR-OASIS test–retest distributions to ADNI 1-year hippocampal atrophy
rates, at (a) first pass and (b) across-method QC. Accompanying right-sided insets show mean
absolute scan–rescan values as determined from the CoRR-OASIS dataset.

concordant in their estimates of 1-year hippocampal atrophy range > 0.7) except for random forests with FSL and SNT
rate across all diagnostic categories, all field strengths, and (mean ROC AUC 5 0.58 and 0.67, respectively). MAGeT per-
all levels of scan QC. forms the worst in this comparison with (mean ROC
AUC < 0.66 across classifiers). All segmentation methods
Diagnostic Classification Varies Based on drop in accuracy substantially when classifying NC versus
MCI (mean ROC AUC range: 0.47–0.73) with the majority of
Atrophy Rates and Method Chosen
ROC values being <0.70. There is no clearly superior algo-
Subjectwise classification across algorithms and classifica- rithm in this comparison; however, SNT performs least
tion methods is not concordant (Fig 6). All methods perform accurately (mean ROC AUC range: 0.47–54). When classify-
well for the NC versus AD case (mean ROC AUC range: ing NC versus AD, a similar pattern is observed. Classifica-
0.61–0.91). UCSD performs the best for this classification tion accuracies range from 0.36–0.77 with the majority of
(mean ROC AUC range: 0.84–0.91). FSL, FS, and SNT show values <0.70. MAGeT achieves the lowest accuracy (mean
similar performance across classifiers (mean ROC AUC ROC AUC 5 0.36) with logistic regression, but performs

r 2886 r
Figure 5.
ADNI 1-year hippocampal atrophy distributions thresholded at the 90th percentile of the CoRR-
OASIS absolute test–retest distributions, for both positive and negative atrophy rates. Within
each graph, top right percentage indicates the 90th percentile threshold used, while percentages
on either side of the distributions indicate the percentage of subjects exhibiting potential growth
atrophy after thresholding. (a) First pass and (b) across-method QC.
r Sankar et al. r

TABLE IV. Inter-rater reliability between hippocampal segmentation algorithms for hippocampal volume and
atrophy rate as measured using the intraclass correlation coefficient (ICC)

Number of
Measure QC measurements ICC P value

1.5 T
Baseline hippocampal volume First pass 398 0.764 <0.0001
Across-method 202 0.842 <0.0001
% change in hippocampal First pass 398 0.050 0.373
volume 1 year
Across-method 202 0.319 <0.0001
3T
Baseline hippocampal volume First pass 62 0.800 <0.0001
Across-method 36 0.828 <0.0001
% change in hippocampal First pass 62 0.281 <0.0001
volume 1 year
Across-method 36 0.149 <0.005

QC 5 quality control level; ICC 5 two-way mixed model, single measures, intraclass correlation coefficient using a consistency agreement
definition.

within the bounds of the other classification techniques addition, we demonstrate that there is also no concordance
when used with support vector machines or random forests. between algorithms at the level of the single subject.
FreeSurfer achieves the best accuracy with random forest While there have been several previous studies in the
(mean ROC 50.77). literature addressing issues of error, variability, and reli-
ability in automated hippocampal segmentation [Morey
et al., 2010; Mulder et al., 2014; Mouiha & Duchesne, 2011;
DISCUSSION Pipitone et al., 2014], our study is unique for three reasons:
Motivated by our preliminary observation that within- (1) we frame the issue in the context of hippocampal
subject hippocampal volume increased over 1 year in “growth” over time; (2) we are primarily concerned with
some patients—including those diagnosed with AD—in inter-algorithm concordance in hippocampal atrophy rate,
the ADNI database, we aimed to determine the full extent which is a popular longitudinal biomarker for studies of
of this phenomenon, and the degree to which it could be putative disease-modifying therapies in AD; and (3) we
explained by sources of variability in automated hippo- assessed five separate segmentation algorithms, including,
campal segmentation. Our assessment of five different but not limited to, those algorithms used to generate
semi-automated or automated hippocampal segmentation hippocampal volume data in ADNI.
algorithms—including three (SNT, Freesurfer, and UCSD) To our knowledge, there have to date been no studies
which were used to generate publicly reported hippocam- specifically examining the phenomenon of hippocampal
pal volume data in ADNI—showed that a considerable growth over time, either in the setting of AD or in data
proportion of ADNI patients demonstrated volumetric hip- obtained from ADNI. A closer look at the literature, how-
pocampal growth over 1 year by all algorithms spanning ever, does hint at its occurrence without explicit mention.
AD, MCI, and NC groups. We also closely analyzed hip- In one example, Mouiha and Duchesne [2011] examined
pocampal segmentation quality for each algorithm and hippocampal atrophy rates in the ADNI dataset using
examined concordance between algorithms. We found a SNT and Freesurfer, and found that for the SNT algo-
considerable incidence of hippocampal segmentation fail- rithm, the mean monthly hippocampal atrophy rate
ures for each method across all diagnostic groups. Interest- between 6 and 12 months after baseline was greater than
ingly, even when stringent QC measures were used to zero for both the left and right hippocampi in HC sub-
exclude erroneous segmentations, each algorithm still jects. Put differently, the hippocampi of healthy subjects,
identified a number of candidate hippocampal “growers.” segmented by SNT, at a group-wide level, demonstrated
Adjustment for scan–rescan reliability error further a mean increase in volume over a 6 month interval.
reduced the number of candidate hippocampal growers, Notably—and in keeping with the pattern of nonconcord-
but did not eliminate them entirely. Ultimately, we found ance we identified—Freesurfer did not find evidence of
that hippocampal growth was inconsistently identified in groupwide hippocampal growth when applied to the
the same subjects by all algorithms, and that 1-year hippo- same group of HC subjects. However, unlike in our
campal atrophy rate estimates—as opposed to cross- study, these authors did not report the results of QC on
sectional, single timepoint estimates of baseline hippocam- any segmented images, and only considered scans
pal volume—were poorly concordant across algorithms. In acquired at 1.5 T.

r 2888 r
r Caveats of Longitudinal Automated Hippocampal Volumetry r

scanning of the same subject, even under identical scanner


and acquisition parameters (as in ADNI), does not neces-
sarily generate identical images, because of small changes
in the position of the subject’s head within the MRI coil,
as well as instabilities in the magnetic field even on strin-
gently monitored scanners [Morey et al., 2010]. As a result,
hippocampal segmentations may be different between suc-
cessive scans of the same patient even when no actual vol-
ume changes have occurred. Note that, by contrast, the
output of automated hippocampal segmentation algorithms
is thought to be almost perfectly reproducible when applied
to the same scan at different times, whether one uses different
computing platforms, or runs algorithms in parallel on the
same platform [Gronenschild et al., 2012]. Thresholding by
scan–rescan error, however, still did not completely eliminate
hippocampal growers. There are several potential reasons for
this finding. First, it is possible that we underestimated the
degree of scan–rescan variability. For subjects in ADNI, actual
scan–rescan variability for any given algorithm may well be
greater than literature-reported values or our own internally
generated variability distributions, as these values were
obtained using different scanners than in ADNI, and from
different populations free of older subjects or AD patients
(exception is Hsu et al. [2002]). Second, while the effect of the
duration between repeated scans on scan–rescan variability
in hippocampal volume is poorly understood [Holland et al.,
2009; Holland & Dale, 2011], it would seem intuitive that a
longer interval between scans—such as the 12-month interval
in this study—might increase the likelihood of larger inter-
scan variability, thus, the thresholds we identified from
the literature may be underestimates. Finally, it is possible
that scan–rescan variability is simply larger than average
for those specific individual subjects whom we identified as
“candidate” hippocampal growers.
One intriguing alternative to the scan–rescan reliability
explanation is the possibility that hippocampal growth as
we have observed it represents a true biological phenome-
non. The hippocampus has a notable capacity for plasticity
and regeneration, and several quantitative human MRI
studies have suggested that it may enlarge in response to
physical and mental training paradigms, or during recovery
from neurological disease states (see Fotuhi et al. [2012] for
a review). To our knowledge, there is only one published
instance of therapy-related hippocampal enlargement in
AD, reported in a subset of patients with mild AD treated
using deep brain stimulation of the fornix (Sankar et al.,
Figure 6. [2015]). Though these preliminary findings have yet to be
Receiver operating curves and area under the curve measures replicated, it is conceivable that in a certain subset of AD or
for ADNI-1 data passing stringent, across method QC for classi- MCI subjects that hippocampal volume may increase over
fication of (A) AD vs NC; (B) MCI vs NC; and (C) MCI vs. AD. time as a compensatory response to Alzheimer’s pathology
LR_L1 5 Logistic regression with lasso; SVC5 support vector or aging. Also, the accumulation of amyloid pathology
machine classification; RFC 5 random forest classification. within the hippocampi could theoretically produce a spuri-
ous increase in hippocampal volume in selected patients
Scan–rescan reliability error is perhaps the most (analogous to data in some rodent models of AD [Lau et al.,
straightforward explanation to account for hippocampal 2008]). Obviously, these possibilities are purely speculative,
growth we encountered in this study. Repeated MRI and we certainly do not have any convincing evidence from

r 2889 r
r Sankar et al. r

our data to support their existence. In a set of post-hoc anal- measures and the acceptable range of values in research
yses not shown, we could not identify any characteristic and clinical trial settings. We previously reported the
clinical traits or cerebrospinal fluid (CSF) features unique to effects of proportional biases on hippocampal volume esti-
hippocampal growers for any given method, including mation in Pipitone et al. [2014], where we observed that
those who showed growth over and above scan–rescan vari- FSL and FreeSurfer, in particular, tended to overestimate
ability thresholds. Perhaps more importantly, in the absence the size of larger hippocampi while underestimating the
of any real ground truth in hippocampal volume change for size of smaller hippocampi. We noted the opposite, and
each patient, and because agreement between segmentation more conservative bias, in MAGeT and SNT algorithms.
algorithms in identifying hippocampal growers was so Thus, the next step, which is outside the scope of this arti-
poor, it is a challenge to know in which patients one ought cle, would be to elucidate if there is a homologous propor-
to even begin looking for evidence of biological hippocam- tional bias related to atrophy rate. Importantly, researchers
pal growth. may want to consider the use of different algorithms for
A particularly significant finding of our study is that the different purposes based on the data we have presented
automated hippocampal segmentation algorithms we here and depending on the overall endpoints for their
assessed are in good agreement regarding baseline hippo- studies or clinical trials.
campal volume, but in rather poor agreement regarding the Overall, our rigorous and comprehensive approach to QC
magnitude and direction of hippocampal volume changes resulted in the rejection of a large number of hippocampal
over time, for subjects in the ADNI dataset. Interestingly, segmentations across algorithms, and merits some comment.
poor concordance in hippocampal atrophy rate did not For the Freesurfer and UCSD algorithms, we made the deci-
appear to be remedied by increasing field strength (which sion to use QC data reported in ADNI, since these data are
should increase signal-to-noise and tissue contrast in the publicly accessible. Since many studies consider SNT seg-
hippocampal region) nor by applying more stringent QC mentations to be a “gold standard”—given that they require
measures. Taken together, these data suggest more generally observer input and correction during their computation—we
that a given automated segmentation algorithm may have a did not consider QC for SNT [Leung et al., 2010; Wolz et al.,
particular bias toward exaggerating or underestimating vol- 2010a]. For those algorithms lacking QC information in ADNI
ume change relative to other algorithms for a given subject (i.e., FSL and MAGeT), we performed a detailed visual
with a unique hippocampal morphology. This may most inspection of segmentation quality in each subject. Given the
convincingly be explained by a combination of two key fac- absence of a standard consensus method to assess the quality
tors. First, different segmentation algorithms use different a of hippocampal segmentations, we erred on the side of a strict
priori anatomical definitions of the hippocampus; as an approach, limiting as much as possible the potential influence
example, some algorithms exclude surrounding white mat- of spurious segmentations on the measured incidence of hip-
ter such as the alveus or fimbria (Pipitone et al., 2014], while pocampal growers. As an example, for the FreeSurfer algo-
others incorporate these into the hippocampal volume [Col- rithm, we classified a particular subject as a hippocampal
lins and Pruessner, 2010; Pruessner et al., 2002]. This fact segmentation failure if that subject showed evidence of seg-
alone would not necessarily worsen interalgorithm concor- mentation inaccuracies anywhere in the brain across a multi-
dance if hippocampal atrophy in AD, MCI, and aging were tude of cortical or subcortical regions (i.e., if that subject failed
evenly distributed across the hippocampus. The second key “overall” QC). Our rationale was that evidence of poor tissue
factor, however, is that hippocampal atrophy may proceed classification or poor delineation of anatomy anywhere in the
at different rates within different regions of the hippocam- brain might introduce doubt about the accuracy of hippocam-
pus, as evidenced by recent work using hippocampal sub- pal segmentations as well. No doubt this contributed to the
fields as outcome measures [Apostolova et al., 2012; La Joie high failure rate we found for the FreeSurfer algorithm. Note
et al., 2010; Mueller et al., 2010]. Accordingly, various algo- that we are neither advocating any particular approach to
rithms may be differentially sensitive to—and their atrophy assessing hippocampal segmentation quality, nor are we
rate estimates differentially influenced by—the subject- suggesting that the ideal approach needs to be as stringent as
specific spatial distribution of atrophy within the hippocam- the one we employed. That being said, our QC data are con-
pus, leading to significant discrepancies in the calculated sistent with recent reports in the literature suggesting that the
magnitude of longitudinal hippocampal volume change. influence of failed segmentations may have the effect of
Apart from possibly being driven by differences in neuro- significantly reducing the reproducibility of hippocampal
anatomical definitions, the lack of concordance between volume change estimates in longitudinal studies [Mulder
atrophy rates may have significant implications in studies of et al., 2014]. Indeed, our findings underscore the need to re-
aging, pathological aging, and in clinical trials using hippo- evaluate the validity of the oft-used argument that it is too
campal volume as a primary outcome measure. Given the costly to perform rigorous QC of hippocampal segmentations
confounds over not only the specificity of hippocampal vol- in large datasets. At the very least, our results would argue
umes at the individual subject level in a longitudinal setting that in studies using ADNI-reported hippocampal values, the
but also the concordance between atrophy rates as well, the (easily accessed) accompanying QC data ought to be consid-
choice of algorithm will have a clear effect on outcome ered as well.

r 2890 r
r Caveats of Longitudinal Automated Hippocampal Volumetry r

One open question based on the findings presented in this effects certainly benefits from many groups asking similar
article is the ability of these algorithms to classify individuals questions but answering them with different tools. While
by their diagnosis. While this could be an important under- these algorithms show accurate results at the group level
taking in the context of the hippocampal volume segmenta- (significant cross-sectional and longitudinal differences), it
tion, such analyses are a topic of hot debate in the literature is clear that their precision (i.e., treatment of single subjects,
[Coupe et al., 2011a,; Davatzikos et al., 2011; Eskildsen et al., atrophy rates) varies substantially. This is important and
2013; Wang et al., 2010]. Such a problem would require sever- requires further replication and research with the use of
al design choices, which themselves will be controversial, open-source software and shared datasets.
each of which could easily account for a source of detailed It may also be worth considering the differences between
investigation. For example, should the input data that is used the design of the different methods. In this study, two of the
consist of baseline volumes only, atrophy rates, or volumes methods that were used were explicitly designed for the
from both time points? Should data that passes QC on a per- analyses of longitudinal datasets. Previous work by Jovovich
method basis be used or should the dataset consist of only et al. [2014] examined the reliability of the FreeSurfer cross-
those subjects that have passed QC across several methods? sectional and longitudinal processing streams in a multisite
The impact of image processing techniques on multi-variate/ setting and demonstrated far more variability in the cross-
machine learning applications is an open question in the liter- sectional scheme across sites for hippocampal volumes.
ature that requires further investigation, but is, ultimately, These results are reflected in our results as well, after strin-
outside the scope of this article. gent QC of all methods is performed, we observe a slightly
Nevertheless, based on our segmentation results, we also larger number of candidate “growers” in methods not opti-
performed classification of the different diagnostic catego- mized for longitudinal data. The UCSD method appears to
ries using three commonly used classifiers (logistic regres- perform better in this regard and shows a small number of
sion with lasso, random forests, and support vector growers across QC methods. Nontheless, it is somewhat
machines). Other groups have used classification techniques puzzling why in some instances FreeSurfer demonstrates a
to better identify diagnostic groups. For example, Wolz substantial number of candidate growers (>7% in some
et al [2010b] achieve a correct classification rate of 82% cases; see Table III). It is possible that this can be attributed
between AD and NC and they also demonstrate they can to older imaging (1.5 T) and image processing pipelines
identify MCI patients who progress to AD (with an accura- (older FreeSurfer versions) that were used in the processing
cy of 64%) based only on 1-year atrophy rates. Although of the data. It remains to be seen if newer versions of the
we did not perform this latter comparison, it could be fur- FreeSurfer pipeline may behave differently with these data.
ther explored using the framework in this manuscript. One limitation of our study is that we did not compare the
However, given the large number of comparisons, segmen- incidence of hippocampal growth in ADNI by manual trac-
tation algorithms, and classifiers already explored in this ing to that observed with automated segmentation algo-
manuscript, it would have been difficult to treat the latter rithms. Manual tracing by trained observers is considered
question (MCI-to-AD conversion) appropriately while main- by some to be more accurate than automated methods in
taining equivalent datasets across methods (given the sig- determining either hippocampal volume or atrophy rate
nificant issues with quality control that we observed). It is [Barnes et al., 2008; Boccardi et al., 2011a, 2013a; Nestor
important to note that there are other studies that examine et al., 2012; Pipitone et al., 2014; Pruessner et al., 2000], and
the utility of hippocampal volume as a biomarker based on segmentations generated manually are usually considered
the relationship between hippocampal atrophy and estab- the gold standard against which automated methods are
lished biomarkers related to AD pathophysiology. These validated [Mulder et al., 2014]. In addition, manual tracing
include the observations that longitudinal atrophy rates is less susceptible to gross segmentation errors [Morey et al.,
measured using surface-based [Morra et al., 2008] and volu- 2010]. At first glance, therefore, manual tracing would
metric techniques [Schuff et al., 2009] have a strong rela- appear better suited to predicting the true incidence of hip-
tionship with known AD-specific risk factors (such as being pocampal growers—if they actually exist—provided manual
positive for apolipoprotein e4 isoform and low education segmentations could actually be completed on all subjects in
status) and CSF biomarkers. ADNI. Unfortunately, doing so would most likely be prohib-
Another open question that remains is how to deal with itively time-consuming. Even if it could be done, it is not
the lack of concordance between methodologies. While necessarily true that the results obtained would be more val-
many groups have started to move to standardized datasets id than for automated segmentation. First, unlike automated
and practices in evaluating hippocampal and hippocampal segmentation algorithms, the reproducibility of any hippo-
subfield volumes, further investigation may be necessary campal manual tracing protocol is influenced by intra-rater
[Boccardi et al., 2013a; Schoemaker et al., 2016; Yushkevich reliability, as observer interpretation of identical imaging
et al., 2015]. Heterogeneity in scientific methodology is data may drift over time and due to human factors such as
acceptable, so long as methodologies have been stringently fatigue [Morey et al., 2010]. Second, inter-rater reliability for
validated (as is the case with algorithms that have been eval- the same protocol may vary even between expert observers
uated here). Furthermore, our understanding of group [Chupin et al., 2009; Pipitone et al., 2014]. Third, in the

r 2891 r
r Sankar et al. r

absence of widespread adoption of a harmonized protocol, hippocampal volume from structural MRI data, we found
there are at least as many manual protocols as there are an unexpectedly high incidence of subjects in the ADNI
automated segmentation algorithms [Boccardi et al., 2013a, database who demonstrated hippocampal growth over
2013b]. Accordingly, depending on the specific hippocampal time. For no individual algorithm could this counterintui-
boundaries used in each protocol, the subject-specific spatial tive finding be entirely explained by gross segmentation
distribution of atrophy within the hippocampus may influ- errors, scan–rescan variability, or field strength of MRI
ence inter-protocol concordance in hippocampal atrophy acquisition. Furthermore, algorithms did not consistently
rates in a manner similar to that observed with automated identify the same subjects as hippocampal growers, and
algorithms. more generally our analysis revealed poor concordance
A second limitation concerns the results reported by thresh- between algorithms in their estimates of the magnitude
olding for true growth using literature-based values. These and direction of hippocampal volume change over time,
values represent a mean error on the reliability of the mea- which precluded a meaningful analysis of whether hippo-
surement of hippocampal volume; consequently, any thresh- campal growth could be a true biological phenomenon.
old defined by this method is liberal as it is possible that These findings suggest that, in patients with either AD or
scan–rescan reliability for any given subject may exceed the MCI, or age-matched elderly controls, longitudinal hippo-
group mean. Indeed, we would caution against overinterpret- campal volume change should be interpreted with consid-
ing results that exceed mean scan–rescan reliability thresholds erable caution as a biomarker at the individual subject
in cases where information about a single subject is desired. level, and may have important limitations at a group-wide
Nonetheless, our results based on this type of thresholding level as well.
still demonstrate a lack of concordance between many com-
monly used algorithms in the literature. A final limitation per- ACKNOWLEDGMENTS
tains to our analysis of atrophy rates. While it is possible that
T.S. is supported by a Canadian Institutes of Health
atrophy rates may not theoretically conform to a normal dis-
Research (CIHR) fellowship award. A.N.V. is funded by the
tribution (as atrophy rate is the ratio of normally distributed
Canadian Institutes of Health Research (CIHR), National
variables), this is not what we observe in our data, (except for
Alliance for Research on Schizophrenia and Depression
UCSD data at 1.5 T), we chose to use Gaussian statistics for
(NARSAD), Ontario Mental Health Foundation (OMHF),
analysis of atrophy rates.
and CAMH Foundation (Koerner New Scientist Program
Taken together, our findings raise some concerns in the use
and Paul Garfinkel New Investigator Catalyst Fund). A.M.L.
of hippocampal volume as a biomarker in studies of AD and
is a Canada Research Chair in Neuroscience and is sup-
other neurodegenerative disorders. First, QC may have a
ported by the R.R. Tasker Chair in Functional Neurosurgery.
major impact regardless of segmentation algorithm, field
A.M.L. is a consultant to Medtronic, St Jude, and Boston Sci-
strength, diagnosis, and hippocampal boundary definitions.
entific. A.M.L. serves on the scientific advisory board of Cer-
The high rate of QC failures in this study brings into question
egene, Codman, Neurophage, Aleva, and Alcyone Life
whether current and prior “black-box” approaches to hippo-
Sciences. A.M.L. is a co-founder of Functional Neuromodu-
campal volumetry applied in certain studies and clinical tri-
lation Inc. and holds intellectual property in the field of
als, where patient MRIs are analyzed using either commercial
Deep Brain Stimulation. M.M.C. is supported by funding
or publically available software without any standardized
from the W. Garfield Weston Foundation, Michael J. Fox
QC measures reported, can be entirely trusted to yield mean-
Foundation, Alzheimer’s Society, National Sciences and
ingful results. Second, our data indicate that relative longitu-
Engineering Research Council of Canada, and Canadian
dinal hippocampal volume change, calculated in an
Institutes of Health Research. MR image processing compu-
automated manner, may be a suboptimal biomarker to track tations were performed on the GPC supercomputer at the
disease progression or response to therapy. This appears to SciNet HPC Consortium. SciNet is funded by the Canada
hold true even for algorithms that have been optimized for Foundation for Innovation under the auspices of Compute
longitudinal volumetric processing, such as FreeSurfer and Canada; the Government of Ontario; Ontario Research Fund
the UCSD method; we found these algorithms to be poorly - Research Excellence; and the University of Toronto.
concordant with one another and with alternate algorithms. Data collection and sharing for this project was funded by
Our data further suggest that it may be ill-advised to trust the Alzheimer’s Disease Neuroimaging Initiative (ADNI)
normative or pathological rates of hippocampal volume and DOD ADNI. ADNI is funded by the National Institute
change derived only a single algorithm; conversely, it is pos- on Aging, the National Institute of Biomedical Imaging
sible that the use of several different algorithms could, at and Bioengineering, and through generous contributions
worst, produce altogether divergent results. from the following: AbbVie, Alzheimer’s Association; Alz-
heimer’s Drug Discovery Foundation; Araclon Biotech;
CONCLUSION BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company;
CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals,
In an assessment of five different automated or semi- Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La
automated segmentation algorithms designed to measure Roche Ltd and its affiliated company Genentech, Inc.;

r 2892 r
r Caveats of Longitudinal Automated Hippocampal Volumetry r

Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer protocol differences for EADC-ADNI manual hippocampal seg-
Immunotherapy Research & Development, LLC.; Johnson mentation. Alzheimers Dement doi:10.1016/j.jalz.2013.03.001
& Johnson Pharmaceutical Research & Development LLC.; Boccardi M, Bocchetta M, Ganzola R, Robitaille N, Redolfi A,
Bartzokis G, Camicioli R, Csernansky J , De Leon M, DeToledo-
Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diag-
Morrell M, Killiany M, Lehericy S, Pantel J, Pruessner J,
nostics, LLC.; NeuroRx Research; Neurotrack Technolo-
Soininen H, Watson C, Duchesne S, Jack C, Frisoni G (2011a):
gies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Estimating the impact of differences among protocols for manu-
Piramal Imaging; Servier; Takeda Pharmaceutical Compa- al hippocampal segmentation on Alzheimer’s disease-related
ny; and Transition Therapeutics. The Canadian Institutes atrophy: Preparatory phase for a harmonized protocol.
of Health Research is providing funds to support ADNI Alzheimers Dement 7:S205–S206.
clinical sites in Canada. Private sector contributions are Boccardi M, Ganzola R, Bocchetta M, Pievani M, Redolfi A,
facilitated by the Foundation for the National Institutes of Bartzokis G, Camicioli R, Csernansky JG, de Leon MJ,
Health (www.fnih.org). The grantee organization is the deToledo-Morrell L, Killiany RJ, Lehericy S, Pantel J, Pruessner
JC, Soininen H, Watson C, Duchesne S, Jack CR Jr, deToledo-
Northern California Institute for Research and Education,
Morrell L (2011b): Survey of protocols for the manual
and the study is coordinated by the Alzheimer’s Thera-
segmentation of the hippocampus: Preparatory steps towards
peutic Research Institute at the University of Southern Cal- a joint EADC-ADNI harmonized protocol. J Alzheimers Dis 26:
ifornia. ADNI data are disseminated by the Laboratory for 61–75.
Neuro Imaging at the University of Southern California. Braak H, Braak E (1991): Neuropathological stageing of
Data collected for the OASIS dataset were performed Alzheimer-related changes. Acta Neuropathol 82:239–259.
using support obtained from the NIH grants. Braak H, de Vos RA, Jansen EN, Bratzke H, Braak E (1998): Neu-
ropathological hallmarks of Alzheimer’s and Parkinson’s dis-
eases. Prog Brain Res 117:267–285.
REFERENCES Cavedo E, Pievani M, Boccardi M, Galluzzi S, Bocchetta M,
Bonetti M, Thompson PM, Frisoni GB (2014): Medial temporal
Amaral RS, Park MT, Devenyi GA, Lynn V, Pipitone J, atrophy in early and late-onset Alzheimer’s disease. Neurobiol
Winterburn J, Chavez S, Schira M, Lobaugh NJ, Voineskos Aging doi:10.1016/j.neurobiolaging.2014.03.009
AN, Pruessner JC, Chakravarty MM Alzheimer’s Disease Neu- Chakravarty MM, Steadman P, van Eede MC, Calcott RD, Gu V,
roImaging Initiative (2016): Manual segmentation of the fornix, Shaw P, Raznahan A, Collins DL, Lerch JP (2013): Performing
fimbria, and alveus on high-resolution 3T MRI: Application via label-fusion-based segmentation using multiple automatically
fully-automated mapping of the human memory circuit white generated templates. Hum Brain Mapp 34:2635–2654.
and grey matter in healthy and pathological aging. Neuro- Chupin M, Gerardin E, Cuingnet R, Boutet C, Lemieux L,
image epub. doi: 10.1016/j.neuroimage.2016.10.027. Lehericy S, Benali H, Garnero L, Colliot O (2009): Fully auto-
Apostolova LG, Morra JH, Green AE, Hwang KS, Avedissian C, matic hippocampus segmentation and classification in Alz-
Woo E, Cummings JL, Toga AW, Jack CR Jr, Weiner MW, heimer’s disease and mild cognitive impairment applied on
Thompson PM (2010): Automated 3D mapping of baseline and data from ADNI. Hippocampus 19:579–587.
12-month associations between three verbal memory measures Collins DL, Pruessner JC (2010): Towards accurate, automatic seg-
and hippocampal atrophy in 490 ADNI subjects. Neuroimage mentation of the hippocampus and amygdala from MRI by aug-
51:488–499. menting ANIMAL with a template library and label fusion.
Apostolova LG, Green AE, Babakchanian S, Hwang KS, Chou YY, Neuroimage 52:1355–1366.
Toga AW, Thompson PM (2012): Hippocampal atrophy and Coupe P, Eskildsen SF, Manjon JV, Fonov V, Collins DL (2011a):
ventricular enlargement in normal aging, mild cognitive Simultaneous segmentation and grading of hippocampus for
impairment (MCI), and Alzheimer Disease. Alzheimer Dis patient classification with Alzheimer’s disease. Med Image
Assoc Disord 26:17–27. Comput Comput Assist Interv 14:149–157.
Avants BB, Epstein CL, Grossman M, Gee JC (2008): Symmetric dif- Coupe P, Eskildsen SF, Manjon JV, Fonov V, Collins DL (2011b):
feomorphic image registration with cross-correlation: Evaluating Simultaneous segmentation and grading of anatomical struc-
automated labeling of elderly and neurodegenerative brain. Med tures for patient’s classification: Application to Alzheimer’s
Image Anal 12:26–41. disease. Neuroimage doi:10.1016/j.neuroimage.2011.10.080
Barnes J, Foster J, Boyes RG, Pepple T, Moore EK, Schott JM, Scahill RI, Coupe P, Fonov VS, Bernard C, Zandifar A, Eskildsen SF, Helmer
Fox NC (2008): A comparison of methods for the automated calcu- C, Manj on JV, Amieva H, Dartigues JF, Allard M, Catheline G,
lation of volumes and atrophy rates in the hippocampus. Neuro- Collins DL (2015): Detection of Alzheimer’s disease signature
image 40:1655–1671. in MR images seven years before conversion to dementia:
Barnes J, Ourselin S, Fox NC (2009): Clinical application of mea- Toward an early individual prognosis. Hum Brain Mapp doi:
surement of hippocampal atrophy in degenerative dementias. 10.1002/hbm.22926
Hippocampus 19:510–516. Davatzikos C, Bhatt P, Shaw LM, Batmanghelich KN, Trojanowski
Boccardi M, Bocchetta M, Apostolova LG, Preboske G, Robitaille N, JQ (2011): Prediction of MCI to AD conversion, via MRI, CSF
Pasqualetti P, Collins LD, Duchesne S, Jack CR Jr, Frisoni GB biomarkers, and pattern classification. Neurobiol Aging 32:
(2013a): Establishing magnetic resonance images orientation for the 2322.e19–2327.
EADC-ADNI manual hippocampal segmentation protocol. Eskildsen SF, Coupe P, Garcia-Lorenzo D, Fonov V, Pruessner JC,
J Neuroimaging. doi:10.1111/jon.12065 Collins DL (2013): Prediction of Alzheimer’s disease in subjects
Boccardi M, Bocchetta M, Ganzola R, Robitaille N, Redolfi A, with mild cognitive impairment from the ADNI cohort using
Duchesne S, Jack CR Jr, Frisoni GB (2013b): Operationalizing patterns of cortical thinning. Neuroimage 65:511–521.

r 2893 r
r Sankar et al. r

Fischl B, van der Kouwe A, Destrieux C, Halgren E, Segonne F, deformation-based morphometry in a mouse model of Alz-
Salat DH, Busa E, Seidman LJ, Goldstein J, Kennedy D, heimer’s disease. Neuroimage 42:19–27.
Caviness V, Makris N, Rosen B, Dale AM (2004): Automatical- Lerch JP, Pruessner J, Zijdenbos AP, Collins DL, Teipel SJ, Hampel
ly parcellating the human cerebral cortex. Cereb Cortex 14: H, Evans AC (2008): Automated cortical thickness measure-
11–22. ments from MRI can accurately separate Alzheimer’s patients
Fischl B, Salat DH, Busa E, Albert M, Dieterich M, Haselgrove C, from normal elderly controls. Neurobiol Aging 29:23–30.
van der Kouwe A, Killiany R, Kennedy D, Klaveness S, Leung KK, Barnes J, Ridgway GR, Bartlett JW, Clarkson MJ,
Montillo A, Makris N, Rosen B, Dale AM (2002): Whole brain Macdonald K, Schuff N, Fox NC, Ourselin S (2010): Automat-
segmentation: Automated labeling of neuroanatomical struc- ed cross-sectional and longitudinal hippocampal volume mea-
tures in the human brain. Neuron 33:341–355. surement in mild cognitive impairment and Alzheimer’s
Fotuhi M, Do D, Jack C (2012): Modifiable factors that alter the disease. Neuroimage 51:1345–1359.
size of the hippocampus with ageing. Nat Rev Neurol 8: McLaren DG, Sreenivasan A, Diamond EL, Mitchell MB, Van Dijk
189–202. KR, Deluca AN, O’Brien JL, Rentz DM, Sperling RA, Atri A
Frisoni GB, Fox NC, Jack CRJ, Scheltens P, Thompson PM (2010): (2012): Tracking cognitive change over 24 weeks with longitudi-
The clinical use of structural MRI in Alzheimer disease. Nat nal functional magnetic resonance imaging in Alzheimer’s dis-
Rev Neurol 6:67–77. ease. Neurodegener Dis 9:176–186.
Gronenschild EH, Habets P, Jacobs HI, Mengelers R, Rozendaal N, Mielke MM, Okonkwo OC, Oishi K, Mori S, Tighe S, Miller MI,
van Os J, Marcelis M (2012): The effects of FreeSurfer version, Ceritoglu C, Brown T, Albert M, Lyketsos CG (2012): Fornix
workstation type, and Macintosh operating system version on integrity and hippocampal volume predict memory decline
anatomical volume and cortical thickness measurements. PLoS and progression to Alzheimer’s disease. Alzheimers Dement 8:
One 7:e38234. 105–113.
Holland D, Brewer JB, Hagler DJ, Fennema-Notestine C, Dale AM Miller M, Banerjee A, Christensen G, Joshi S, Khaneja N,
(2009): Subregional neuroanatomical change as a biomarker for Grenander U, Matejic L (1997): Statistical methods in computa-
Alzheimer’s disease. Proc Natl Acad Sci USA 106:20954–20959. tional anatomy. Stat Methods Med Res 6:267–299.
Holland D, Dale AM (2011): Nonlinear registration of longitudinal Morey RA, Selgrade ES, Wagner HR, Huettel SA, Wang L,
images and measurement of change in regions of interest. Med McCarthy G (2010): Scan-rescan reliability of subcortical brain
Image Anal 15:489–497. volumes derived from automated segmentation. Hum Brain
Holland D, McEvoy LK, Dale AM (2012): Unbiased comparison Mapp 31:1751–1762.
of sample size estimates from longitudinal structural mea- Morra JH, Tu Z, Apostolova LG, Avedissian C, Madsen SK,
sures in ADNI. Hum Brain Mapp 33:2586–2602. Parikshak N, Hua X, Toga AW, Jack CR, Jr, Weiner MW,
Hsu YY, Schuff N, Du AT, Mark K, Zhu X, Hardin D, Weiner Thompson PM (2008): Alzheimer’s disease neuroimaging initia-
MW (2002): Comparison of automated and manual MRI volu- tive. NeuroImage 43:59–68.
metry of hippocampus in normal aging and dementia. J Magn Mouiha A, Duchesne S (2011): Hippocampal atrophy rates in Alz-
Reson Imaging 16:305–310. heimer’s disease: Automated segmentation variability analysis.
Jack CR Jr, Bernstein MA, Fox NC, Thompson P, Alexander G, Neurosci Lett Retrieved from http://www.sciencedirect.com/
Harvey D, Borowski B, Britson PJ, L Whitwell J, Ward C, Dale science/article/pii/S0304394011002564
AM, Felmlee JP, Gunter JL, Hill DL, Killiany R, Schuff N, Fox- Mueller SG, Weiner MW, Thal LJ, Petersen RC, Jack CR, Jagust
Bosetti S, Lin C, Studholme C, DeCarli CS, Krueger G, Ward W, Trojanowski JQ, Toga AW, Beckett L (2005): Ways toward
HA, Metzger GJ, Scott KT, Mallozzi R, Blezek D, Levy J, an early diagnosis in Alzheimer’s disease: The Alzheimer’s
Debbins JP, Fleisher AS, Albert M, Green R, Bartzokis G, Disease Neuroimaging Initiative (ADNI). Alzheimers Dement
Glover G, Mugler J, Weiner MW (2008): The Alzheimer’s Dis- 1:55–66.
ease Neuroimaging Initiative (ADNI): MRI methods. J Magn Mueller SG, Schuff N, Yaffe K, Madison C, Miller B, Weiner MW
Reson Imaging 27:685–691. (2010): Hippocampal atrophy patterns in mild cognitive
Jenkinson M, Beckmann CF, Behrens TE, Woolrich MW, Smith impairment and Alzheimer’s disease. Hum Brain Mapp 31:
SM (2012): FSL. Neuroimage 62:782–790. 1339–1347.
Jovicich J, Marizzoni M, Bosch B, Bartres-Faz D, Arnold J, Mulder ER, de Jong RA, Knol DL, van Schijndel RA, Cover KS,
Benninghoff J, Wiltfang J, Roccatagliata L, Picco A, Nobili F, Visser PJ, Barkhof F, Vrenken H (2014): Hippocampal volume
Blin O, Bombois S, Lopes R, Bordet R, Chanoine V, Ranjeva JP, change measurement: Quantitative assessment of the reproduc-
Didic M, Gros-Dagnac H, Payoux P, Zoccatelli G, Alessandrini ibility of expert manual outlining and the automated methods
F, Beltramello A, Bargallû˚ N, Ferretti A, Caulo M, Aiello M, FreeSurfer and FIRST. Neuroimage 92:169–181.
Ragucci M, Soricelli A, Salvadori N, Tarducci R, Floridi P, Nestor SM, Gibson E, Gao FQ, Kiss A, Black SE (2012): A direct
Tsolaki M, Constantinidis M, Drevelegas A, Rossini PM, Marra morphometric comparison of five labeling protocols for multi-
C, Otto J, Reiss-Zimmermann M, Hoffmann KT, Galluzzi S, atlas driven automatic segmentation of the hippocampus in
Frisoni GB, PharmaCog C (2014): Multisite longitudinal reli- Alzheimer’s disease. Neuroimage 66C:50–70.
ability of tract-based spatial statistics in diffusion tensor imag- Patenaude B, Smith SM, Kennedy DN, Jenkinson M (2011): A
ing of healthy elderly subjects. Neuroimage 101:390–403. Bayesian model of shape and appearance for subcortical brain
La Joie R, Fouquet M, Mezenge F, Landeau B, Villain N, Mevel K, segmentation. Neuroimage 56:907–922.
Pelerin A, Eustache F, Desgranges B, Chetelat G (2010): Differ- Pipitone J, Park MT, Winterburn J, Lett TA, Lerch JP, Pruessner
ential effect of age on hippocampal subfields assessed using a JC, Lepage M, Voineskos AN, Chakravarty MM (2014): Multi-
new high-resolution 3T MR sequence. Neuroimage 53:506–514. atlas segmentation of the whole hippocampus and subfields
Lau JC, Lerch JP, Sled JG, Henkelman RM, Evans AC, Bedell BJ using multiple automatically generated templates. Neuroimage
(2008): Longitudinal neuroanatomical changes determined by 101:494–512.

r 2894 r
r Caveats of Longitudinal Automated Hippocampal Volumetry r

Pruessner JC, Kohler S, Crane J, Pruessner M, Lord C, Byrne A, N, Johnson MP, Kasperaviciute D, Kent JW Jr, Kochunov P,
Kabani N, Collins DL, Evans AC (2002): Volumetry of temporopo- Lancaster JL, Lawrie SM, Liewald DC, Mandl R, Matarin M,
lar, perirhinal, entorhinal and parahippocampal cortex from Mattheisen M, Meisenzahl E, Melle I, Moses EK, M€ uhleisen
high-resolution MR images: Considering the variability of the col- TW, Nauck M, N€ othen MM, Olvera RL, Pandolfo M, Pike GB,
lateral sulcus. Cereb Cortex 12:1342–1353. Puls R, Reinvang I, Renterıa ME, Rietschel M, Roffman JL,
Pruessner JC, Li LM, Serles W, Pruessner M, Collins DL, Kabani N, Royle NA, Rujescu D, Savitz J, Schnack HG, Schnell K, Seiferth
Lupien S, Evans AC (2000): Volumetry of hippocampus and N, Smith C, Steen VM, Valdes Hern andez MC, Van den
amygdala with high-resolution MRI and three-dimensional anal- Heuvel M, van der Wee NJ, van Haren NE, Veltman JA,
ysis software: Minimizing the discrepancies between laborato- V€olzke H, Walker R, Westlye LT, Whelan CD, Agartz I,
ries. Cereb Cortex 10:433–442. Boomsma DI, Cavalleri GL, Dale AM, Djurovic S, Drevets WC,
Reuter M, Schmansky NJ, Rosas HD, Fischl B (2012): Within-sub- Hagoort P, Hall J, Heinz A, Jack CR Jr, Foroud TM, Le Hellard
ject template estimation for unbiased longitudinal image analy- S, Macciardi F, Montgomery GW, Poline JB, Porteous DJ,
sis. Neuroimage 61:1402–1418. Sisodiya SM, Starr JM, Sussmann J, Toga AW, Veltman DJ,
Sabuncu MR, Buckner RL, Smoller JW, Lee PH, Fischl B, Sperling Walter H, Weiner MW; Alzheimer’s Disease Neuroimaging
RA (2012): The association between a polygenic Alzheimer Initiative.; EPIGEN Consortium.; IMAGEN Consortium.; Sag-
score and cortical thickness in clinically normal subjects. Cereb uenay Youth Study Group., Bis JC, Ikram MA, Smith AV,
Cortex 22:2653–2661. Gudnason V, Tzourio C, Vernooij MW, Launer LJ, DeCarli C,
Sabuncu MR, Desikan RS, Sepulcre J, Yeo BT, Liu H, Schmansky Seshadri S; Cohorts for Heart and Aging Research in Genomic
NJ, Reuter M, Weiner MW, Buckner RL, Sperling RA, Fischl B Epidemiology Consortium., Andreassen OA, Apostolova LG,
(2011): The dynamics of cortical and hippocampal atrophy in Bastin ME, Blangero J, Brunner HG, Buckner RL, Cichon S,
Alzheimer disease. Arch Neurol 68:1040–1048. Coppola G, de Zubicaray GI, Deary IJ, Donohoe G, de Geus
Sanchez-Benavides G, Pena-Casanova J, Casals-Coll M, Gramunt EJ, Espeseth T, Fern andez G, Glahn DC, Grabe HJ, Hardy J,
N, Molinuevo JL, Gomez-Anson B, Aguilar M, Robles A, Hulshoff Pol HE, Jenkinson M, Kahn RS, McDonald C,
Ant unez C, Martınez-Parra C, Frank-Garcıa A, Fern andez- McIntosh AM, McMahon FJ, McMahon KL, Meyer-Lindenberg
Martınez M, Blesa R (2014): Cognitive and neuroimaging pro- A, Morris DW, M€ uller-Myhsok B, Nichols TE, Ophoff RA,
files in mild cognitive impairment and Alzheimer’s disease: Paus T, Pausova Z, Penninx BW, Potkin SG, S€ amann PG,
Data from the Spanish Multicenter Normative Studies (NEU- Saykin AJ, Schumann G, Smoller JW, Wardlaw JM, Weale ME,
RONORMA project). J Alzheimers Dis doi:10.3233/JAD-132186 Martin NG, Franke B, Wright MJ Thompson PM (2012): Identi-
Sankar T, Chakravarty MM, Bescos A, Lara M, Obuchi T, Laxton fication of common variants associated with human hippocam-
AW, McAndrews MP, Tang-Wai DF, Workman CI, Smith GS, pal and intracranial volumes. Nat Genet 44:552–561.
Lozano AM (2015): Deep Brain Stimulation Influences Brain Treadway MT, Waskom ML, Dillon DG, Holmes AJ, Park MT,
Structure in Alzheimer’s Disease. Brain Stimulation 8:645–654. Chakravarty MM, Dutra SJ, Polli FE, Iosifescu DV, Fava M,
Schoemaker D, Buss C, Head K, Sandman CA, Davis EP, Gabrieli JD, Pizzagali DA (2015): Illness progression, recent
Chakravarty MM, Gauthier S, Pruessner JC (2016): Hippocam- stress, and morphometry of the hippocampal subfields and
pus and amygdala volumes from magnetic resonance images medial prefrontal cortex in major depression. Biological Psy-
in children: Assessing accuracy of FreeSurfer and FSL against chiatry 77:265–294.
manual segmentation. Neuroimage 129:1–14. Wang Y, Fan Y, Bhatt P, Davatzikos C (2010): High-dimensional pat-
Schuff N, Woerner N, Boreta L, Kornfield T, Shaw LM, tern regression using machine learning: From medical images to
Trojanowski JQ, Thompson PM, Jack CR Jr, Weiner MW; Alz- continuous clinical variables. Neuroimage 50:1519–1535.
heimer’s Disease NeuroImaging Initiative (2009): MRI of hip- Watson C, Cendes F, Fuerst D, Dubeau F, Williamson B, Evans A,
pocampal volume loss in early Alzheimer’s disease in relation Andermann F (1997): Specificity of volumetric magnetic reso-
to ApoE genotype and biomakers. 132(Pt 4):1067–1077. nance imaging in detecting hippocampal sclerosis. Arch Neurol
Small SA, Duff K (2008): Linking Abeta and tau in late-onset Alz- 54:67–73.
heimer’s disease: A dual pathway hypothesis. Neuron 60: Weiner MW, Veitch DP, Aisen PS, Beckett LA, Cairns NJ, Green
534–542. RC, Harvey D, Jack CR, Jagust W, Liu E, Morris JC, Petersen
Stein JL, Medland SE, Vasquez AA, Hibar DP, Senstad RE, RC, Saykin AJ, Schmidt ME, Shaw L, Siuciak JA, Soares H,
Winkler AM, Toro R, Appel K, Bartecek R, Bergmann Ø, Toga AW, Trojanowski JQ (2012): The Alzheimer’s Disease
Bernard M, Brown AA, Cannon DM, Chakravarty MM, Neuroimaging Initiative: A review of papers published since
Christoforou A, Domin M, Grimm O, Hollinshead M, Holmes its inception. Alzheimers Dement 8:S1–68.
AJ, Homuth G, Hottenga JJ, Langan C, Lopez LM, Hansell Winterburn JL, Pruessner JC, Chavez S, Schira MM, Lobaugh NJ,
NK, Hwang KS, Kim S, Laje G, Lee PH, Liu X, Loth E, Voineskos AN, Chakravarty MM (2013): A novel in vivo atlas
Lourdusamy A, Mattingsdal M, Mohnke S, Maniega SM, Nho of human hippocampal subfields using high-resolution 3 T
K, Nugent AC, O’Brien C, Papmeyer M, P€ utz B, Ramasamy A, magnetic resonance imaging. Neuroimage 74:254–265.
Rasmussen J, Rijpkema M, Risacher SL, Roddey JC, Rose EJ, Wolz R, Aljabar P, Hajnal JV, Hammers A, Rueckert D (2010a):
Ryten M, Shen L, Sprooten E, Strengman E, Teumer A, LEAP: Learning embeddings for atlas propagation. Neuro-
Trabzuni D, Turner J, van Eijk K, van Erp TG, van Tol MJ, image 49:1316–1325.
Wittfeld K, Wolf C, Woudstra S, Aleman A, Alhusaini S, Wolz R, Heckemann RA, Aljabar P, Hajnal JV, Hammers A,
Almasy L, Binder EB, Brohawn DG, Cantor RM, Carless MA, Lotjonen J, Ruechert D & (2010b): Alzheimer’s Disease Neuro-
Corvin A, Czisch M, Curran JE, Davies G, de Almeida MA, imaging Initiative. Neuroimage 49:2352–2365.
Delanty N, Depondt C, Duggirala R, Dyer TD, Erk S, Wyman BT, Harvey DJ, Crawford K, Bernstein MA, Carmichael
Fagerness J, Fox PT, Freimer NB, Gill M, G€ oring HH, Hagler O, Cole PE, Crane PK, DeCarli C, Fox NC, Gunter JL, Hill D,
DJ, Hoehn D, Holsboer F, Hoogman M, Hosten N, Jahanshad Killiany RJ, Pachai C, Schwarz AJ, Schuff N, Senjem ML,

r 2895 r
r Sankar et al. r

Suhy J, Thompson PM, Weiner M, Jack CRJ (2013): Standardi- Zuo XN, Anderson JS, Bellec P, Birn RM, Biswal BB, Blautzik J,
zation of analysis sets for reporting results from ADNI MRI Breitner JC, Buckner RL, Calhoun VD, Castellanos FX, Chen
data. Alzheimers Dement 9:332–337. A, Chen B, Chen J, Chen X, Colcombe SJ, Courtney W,
Yushkevich PA, Amaral RS, Augustinack JC, Bender AR, Craddock RC, Di Martino A, Dong HM, Fu X, Gong Q,
Bernstein JD, Boccardi M, Bocchetta M, Burggren AC, Carr Gorgolewski KJ, Han Y, He Y, He Y, Ho E, Holmes A, Hou
VA, Chakravarty MM, Chetelat G, Daugherty AM, Davachi L, XH, Huckins J, Jiang T, Jiang Y, Kelley W, Kelly C, King M,
Ding SL, Ekstrom A, Geerlings MI, Hassan A, Huang Y, LaConte SM, Lainhart JE, Lei X, Li HJ, Li K, Li K, Lin Q,
Iglesias JE, La Joie R, Kerchner GA, LaRocque KF, Libby LA, Liu D, Liu J, Liu X, Liu Y, Lu G, Lu J, Luna B, Luo J,
Malykhin N, Mueller SG, Olsen RK, Palombo DJ, Parekh MB, Lurie D, Mao Y, Margulies DS, Mayer AR, Meindl T,
Pluta JB, Preston AR, Pruessner JC, Ranganath C, Raz N, Meyerand ME, Nan W, Nielsen JA, O’Connor D, Paulsen D,
Schlichting ML, Schoemaker D, Singh S, Stark CE, Suthana N, Prabhakaran V, Qi Z, Qiu J, Shao C, Shehzad Z, Tang W,
Tompary A, Turowski MM, Van Leemput K, Wagner AD, Villringer A, Wang H, Wang K, Wei D, Wei GX, Weng XC,
Wang L, Winterburn JL, Wisse LE, Yassa MA, Zeineh MM Wu X, Xu T, Yang N, Yang Z, Zang YF, Zhang L, Zhang Q,
(2015): Quantitative comparison of 21 protocols for labeling Zhang Z, Zhang Z, Zhao K, Zhen Z, Zhou Y, Zhu XT,
hippocampal subfields and parahippocampal subregions in in Milham MP (2014): An open science resource for establishing
vivo MRI: Towards a harmonized segmentation protocol. reliability and reproducibility in functional connectomics. Sci
Neuroimage 111:526–541. Data 1:140049.

r 2896 r

You might also like