Professional Documents
Culture Documents
Machine
Learning for
Brain Disorders
NEUROMETHODS
Series Editor
Wolfgang Walz
University of Saskatchewan
Saskatoon, SK, Canada
Edited by
Olivier Colliot
CNRS, Paris, France
Editor
Olivier Colliot
CNRS
Paris, France
This Humana imprint is published by the registered company Springer Science+Business Media, LLC part of Springer
Nature.
The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.
Preface to the Series
Experimental life sciences have two basic foundations: concepts and tools. The Neuro-
methods series focuses on the tools and techniques unique to the investigation of the
nervous system and excitable cells. It will not, however, shortchange the concept side of
things as care has been taken to integrate these tools within the context of the concepts and
questions under investigation. In this way, the series is unique in that it not only collects
protocols but also includes theoretical background information and critiques which led to
the methods and their development. Thus it gives the reader a better understanding of the
origin of the techniques and their potential future development. The Neuromethods pub-
lishing program strikes a balance between recent and exciting developments like those
concerning new animal models of disease, imaging, in vivo methods, and more established
techniques, including, for example, immunocytochemistry and electrophysiological tech-
nologies. New trainees in neurosciences still need a sound footing in these older methods in
order to apply a critical approach to their results.
Under the guidance of its founders, Alan Boulton and Glen Baker, the Neuromethods
series has been a success since its first volume published through Humana Press in 1985. The
series continues to flourish through many changes over the years. It is now published under
the umbrella of Springer Protocols. While methods involving brain research have changed a
lot since the series started, the publishing environment and technology have changed even
more radically. Neuromethods has the distinct layout and style of the Springer Protocols
program, designed specifically for readability and ease of reference in a laboratory setting.
The careful application of methods is potentially the most important step in the process
of scientific inquiry. In the past, new methodologies led the way in developing new dis-
ciplines in the biological and medical sciences. For example, Physiology emerged out of
Anatomy in the nineteenth century by harnessing new methods based on the newly discov-
ered phenomenon of electricity. Nowadays, the relationships between disciplines and meth-
ods are more complex. Methods are now widely shared between disciplines and research
areas. New developments in electronic publishing make it possible for scientists that
encounter new methods to quickly find sources of information electronically. The design
of individual volumes and chapters in this series takes this new access technology into
account. Springer Protocols makes it possible to download single protocols separately. In
addition, Springer makes its print-on-demand technology available globally. A print copy
can therefore be acquired quickly and for a competitive price anywhere in the world.
v
Preface
Machine learning (ML) is at the core of the tremendous progress in artificial intelligence in
the past decade. ML offers exciting promises for medicine. In particular, research on ML for
brain disorders is a very active field. Neurological and psychiatric disorders are particularly
complex and can be characterized using various types of data. ML has the potential to exploit
such rich and complex data for a wide range of benefits including a better understanding of
disorders, the discovery of new biomarkers, assisting diagnosis, providing prognostic infor-
mation, predicting response to treatment and building more effective clinical trials.
Machine learning for brain disorders is an interdisciplinary field, involving concepts
from different disciplines such as mathematics, statistics and computer science on the one
hand and neurology, psychiatry, neuroscience, pathology and medical imaging on the other
hand. It is thus difficult to apprehend for students and researchers who are new to this area.
The aim of this book is to provide an up-to-date and comprehensive guide to both
methodological and applicative aspects of ML for brain disorders. This book aims to be
useful to students and researchers with various backgrounds: engineers, computer scientists,
neurologists, psychiatrists, radiologists, neuroscientists, etc.
Part I presents the fundamentals of ML. The book starts with a non-technical introduc-
tion to the main concepts underlying ML (Chapter 1). The main classic ML techniques are
then presented in Chapter 2. Even though not recent for most of them, these techniques are
still useful for various tasks. Chapters 3–6 are devoted to deep learning, a family of
techniques which have achieved impressive results in the past decade. Chapter 3 describes
the basics of deep learning, starting with simple artificial neural networks and then covering
convolutional neural networks (CNN) which are a standard family of approaches that are
mainly (but not only) used for imaging data. Those architectures are feed-forward, meaning
that information flows only in one direction. On the contrary, recurrent neural networks
(RNN), presented in Chapter 4, involve loops. They are particularly adapted to sequential
data, including longitudinal data (repeated measurements over time), time series and text.
Chapter 5 is dedicated to generative models: models that can generate new data. A large part
is devoted to generative adversarial networks (GANs), but other approaches such as diffu-
sion models are also described. Finally, Chapter 6 presents transformers, a recent approach
which is now the state-of-the-art for natural language processing and has achieved impres-
sive results for other applications such as imaging.
Part II is devoted to the main types of data used to characterize brain disorders. These
include clinical assessments (Chapter 7), neuroimaging (including magnetic resonance
imaging—MRI, positron emission tomography—PET, computed tomography—CT,
single-photon emission computed tomography—SPECT, Chapter 8), electro- and magne-
toencephalography (EEG/MEG, Chapter 9), genetic and omics data (including genotyp-
ing, transcriptomics, proteomics, metabolomics, Chapter 10), electronic health records
(EHR, Chapter 11), mobile devices, connected objects and sensor data (Chapter 12). The
emphasis is put on practical aspects rather on an in-depth description of the underlying data
acquisition techniques (which can be complex, for instance in the case of neuroimaging or
omics data). The corresponding chapters describe which information do these data provide,
vii
viii Preface
how they should be handled and processed and which features can be extracted from
such data.
Part III covers the core methodologies of ML for brain disorders. Each chapter is
devoted to a specific medical task that can be addressed with ML, presenting the main
state-of-the-art techniques. Chapter 13 deals with image segmentation, a crucial task for
extracting information from images. Image segmentation techniques allow delineating
anatomical structures and lesions (e.g. tumours, white matter lesions), which can in turn
provide biomarkers (e.g. the volume of the structure/lesion or other more sophisticated
derived measures). Image registration is presented in Chapter 14. It is also a fundamental
image analysis task which allows aligning images from different modalities or different
patients and which is a prerequisite for many other ML methods. Chapter 15 describes
methods for computer-aided diagnosis and prediction. These include methods to automati-
cally classify patients (for instance to assist diagnosis) as well as to predict their future state.
Chapter 16 presents ML methods to discover disease subtypes. Indeed, brain disorders are
heterogeneous and patients with a given diagnosis may have different symptoms, a different
underlying pathophysiology and a different evolution. Such heterogeneity is a major barrier
to the development of new treatments. ML has the potential to help discover more
homogeneous disease subtypes. Modelling disease progression is the focus of Chapter 17.
The chapter describes a wide range of techniques that allow, in a data-driven manner, to
build models of disease progression, which includes finding the ordering by which different
biomarkers become abnormal, estimating trajectories of change and uncovering different
evolution profiles within a given population. Chapter 18 is devoted to computational
pathology which is the automated analysis of histological data (which may come from
biopsies or post-mortem samples). Tremendous progresses have been made in this area in
the past years. Chapter 19 describes methods for integrating multimodal data including
medical imaging, clinical data and genetics (or other omics data). Indeed, characterizing the
complexity of brain disorders requires to integrate multiple types of data, but such integra-
tion raises computational challenges.
Part IV is dedicated to validation and datasets. These are fundamental issues that are
sometimes overlooked by ML researchers. It is indeed crucial that ML models for medicine
are thoroughly and rigorously validated. Chapter 20 covers model validation. It introduces
the main performance metrics for classification and regression tasks, describes how to
estimate these metrics in an unbiased manner and how to obtain confidence intervals.
Chapter 21 deals with reproducibility, the ability to reproduce results and findings. It is
widely recognized that many fields of science, including ML for medicine, are undergoing a
reproducibility crisis. The chapter describes the main types of reproducibility, what they
require and why they are important. The topic of Chapter 22 is interpretability of ML
methods. In particular, it reviews the main approaches to get insight on how “black-box”
models take their decisions and describes their application to brain imaging data. Chapter 23
provides a regulatory science perspective on performance assessment of ML algorithms. It is
indeed crucial to understand such perspective because regulation is critical to translate safe
and effective technologies to the clinic. Finally, Chapter 24 provides an overview of the main
existing datasets accessible to researchers. It can help scientists identify which datasets are
most suited to a particular research question and provides hints on how to use them.
Part V presents applications of ML to various neurological and psychiatric disorders.
Each chapter is devoted to a specific disorder or family of disorders. It presents some
information about the disorder that should, in particular, be useful to researchers who
don’t have a medical background. It then describes some important applications of ML to
Preface ix
this disorder as well as future challenges. The following disorders are covered: Alzheimer’s
disease and related dementia (including vascular dementia, frontotemporal dementia and
dementia with Lewy bodies) in Chapter 25, Parkinson’s disease and related disorders
(including multiple system atrophy, progressive supranuclear palsy and dementia with
Lewy bodies) in Chapter 26, epilepsy in Chapter 27, multiple sclerosis in Chapter 28,
cerebrovascular disorders (including stroke, microbleeds, vascular malformations, aneur-
ysms and small vessel disease) in Chapter 29, brain tumours in Chapter 30, neurodevelop-
mental disorders (including autism spectrum and attention deficit with hyperactivity
disorders) in Chapter 31 and psychiatric disorders (including depression, schizophrenia
and bipolar disorder) in Chapter 32.
We hope that this book will serve as a reference for researchers and graduate students
who are new to this field of research as well as constitute a useful resource for all scientists
working in this exciting scientific area.
I would like to express my profound gratitude to all authors for their contributions to the
book. It is thanks to you that this book has become a reality. I am also extremely grateful to
all present and past members of the ARAMIS team. Research is a collective endeavour. It was
a privilege (and a pleasure!) to work with you throughout all these years. I learn everyday
thanks to you all. More generally, I would like to acknowledge all colleagues with whom I
had the chance to work. I warmly thank the reviewers who have kindly reviewed chapters:
Maria Gloria Bueno Garcı́a, Sarah Cohen-Boulakia, Renaud David, Guillaume Dumas,
Anton Iftimovici, Hicham Janati, Jochen Klucken, Margy McCullough-Hicks, Paolo Mis-
sier, Till Nicke, Sebastian Raschka, Denis Schwartz as well as those who preferred to stay
anonymous. Your comments were very useful in improving the book. Finally, I would like to
thank my family members for their love and support.
I acknowledge the following funding sources: the French government under manage-
ment of the Centre National de la Recherche Scientifique, the French government under
management of Agence Nationale de la Recherche as part of the “Investissements d’avenir”
program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute) and reference ANR-10-
IAIHU-06 (Agence Nationale de la Recherche-10-IA Institut Hospitalo-Universitaire-6),
the European Union H2020 program (project EuroPOND, grant number 666992), the
Paris Brain Institute under the Big Brain Theory Program (project PredictICD, project
IMAGIN-DEAL in MS), Inria under the Inria Project Lab Program (project Neuromar-
kers), the Abeona Foundation (project Brain@Scale) and the Fondation Vaincre Alzheimer
(grant number FR-18006CB). This book was made Open Access thanks to the support of
the French government under management of Agence Nationale de la Recherche as part of
the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA
Institute).
xi
Abbreviations
xiii
xiv Abbreviations
DSM-5 5th edition of the Diagnostic and Statistical Manual of Mental Disorders
DT Decision Tree
DTI Diffusion Tensor Imaging
DWI Diffusion-Weighted Imaging
DXA Dual-energy X-ray Absorptiometry
DZ Dizygotic
EBM Event-Based Model
EDSS Expanded Disability Status Scale
EEA European Economic Area
EEG Electroencephalography
EFA Exploratory Factor Analysis
EHR Electronic Health Record
ELBO Evidence Lower Bound
ELL Exponential Logarithmic Loss
EM Expectation-Maximization
ENIGMA Enhancing NeuroImaging Genetics Through Meta-Analysis
eQTL expression Quantitative Trait Loci
ET Enhancing Tumour
EU European Union
FA Fractional Anisotropy
FAIR Findable, Accessible, Interoperable, Reusable
FASTA Fast-All format
FCD Focal Cortical Dysplasia
FCN Fully Connected Network
FCNN Fully Convolutional Neural Network
FDA United States Food and Drug Administration
FDG-PET [18F]-Fluorodeoxyglucose Positron Emission Tomography
FDR False Discovery Rate
FFPE Formalin-Fixed Paraffin-Embedded
FID Fréchet Inception Distance
FLAIR Fluid-Attenuated Inversion Recovery
FLOP Floating Point Operations
fMRI functional Magnetic Resonance Imaging
FN False Negative
FOG Freezing of Gait
FP False Positive
FROC Free-response ROC
FTLD Fronto-temporal Lobar Degeneration
G2PSR Genome-to-Phenome Sparse Regression
GAN Generative Adversarial Network
GBM Glioblastoma
GD Generalized Dice loss
GDPR General Data Protection Regulation
GENFI Genetic FTD Initiative
GEO Gene Expression Omnibus
GFF General Feature Format
GIS Geographic Information Systems
GM Gray Matter
GMM Gaussian Mixture Model
xvi Abbreviations
GO Gene Ontology
GPPM Gaussian Process Progression Model
GPS Global Positioning System
GPU Graphical Processing Unit
Grad-CAM Gradient-weighted Class Activation Mapping
GRE Gradient-Recalled Echo
GRU Gated Recurrent Unit
GTEx Genotype-Tissue Expression
GWAS Genome-Wide Association Study
H&E Hematoxylin and Eosin
HAR Human Activity Recognition
HC Healthy Controls
HCP-YA Human Connectome Project Young Adult
HD Hausdorff Distance
HGG High-Grade Glioma
HIPAA Health Insurance Portability and Accountability Act
HS Hippocampal Sclerosis
HUPO Human Proteome Organization
HYDRA HeterogeneitY through DiscRiminative Analysis
i.i.d. Independent and Identically Distributed
IA Intracranial Aneurysm
ICA Independent Component Analysis
ICD International Classification of Diseases
iCDF Inverse Cumulative Density Function
ICF International Classification of Functioning, Disability and Health
ID Intelligence Disabilities
IDH Isocitrate Dehydrogenase
IEEE Institute of Electrical and Electronics Engineers
IHC Inmunohistochemistry
IoU Intersection over Union also called JI
IPMI Information Processing in Medical Imaging conference
IQ Intelligence Quotient
iRANO Immune-related Response Assessment in Neuro-Oncology
iRBD Idiopathic Rapid eye movement sleep Behaviour Disorder
ISBI International Symposium on Biomedical Imaging
ISLES The Ischemic Stroke Lesion Segmentation
JI Jaccard index also called IoU
JS/JSD Jensen-Shannon Divergence
KDE Kernel Density Estimate
KDE-EBM Kernel Density Estimation EBM
KEGG Kyoto Encyclopedia of Genes and Genomes
KL/KLD Kullback-Leibler Divergence
kNN k-Nearest Neighbours
LASSO Least Absolute Shrinkage and Selection Operator
LATE Limbic-predominant Age-related TDP-43 Encephalopathy
LDA Linear Discriminant Analysis
LDDMM Large Deformation Diffeomorphic Metric Mapping
LGG Low-Grade Glioma
LIME Local Interpretable Model-agnostic Explanations
Abbreviations xvii
PART II DATA
xxiii
xxiv Contents
PART V DISORDERS
25 Machine Learning for Alzheimer’s Disease and Related Dementia . . . . . . . . . . . . 807
Marc Modat, David M. Cash, Liane Dos Santos Canas,
Martina Bocchetta, and Sébastien Ourselin
26 Machine Learning for Parkinson’s Disease and Related Disorders . . . . . . . . . . . . . 847
Johann Faouzi, Olivier Colliot, and Jean-Christophe Corvol
27 Machine Learning in Neuroimaging of Epilepsy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 879
Hyo Min Lee, Ravnoor Singh Gill, Neda Bernasconi,
and Andrea Bernasconi
28 Machine Learning in Multiple Sclerosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899
Bas Jasperse and Frederik Barkhof
29 Machine Learning for Cerebrovascular Disorders . . . . . . . . . . . . . . . . . . . . . . . . . . . 921
Yannan Yu and David Yen-Ting Chen
30 The Role of Artificial Intelligence in Neuro-oncology Imaging . . . . . . . . . . . . . . . 963
Jennifer Soun, Lu-Aung Yosuke Masudathaya,
Arabdha Biswas, and Daniel S. Chow
31 Machine Learning for Neurodevelopmental Disorders. . . . . . . . . . . . . . . . . . . . . . . 977
Clara Moreau, Christine Deruelle, and Guillaume Auzias
32 Machine Learning and Brain Imaging for Psychiatric Disorders:
New Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009
Ivan Brossollet, Quentin Gallet, Pauline Favre, and Josselin Houenou
ANA L. AGUILA • University College London, Centre for Medical Image Computing,
COMBINE Lab, London, UK
ANDRE ALTMANN • University College London, Centre for Medical Image Computing,
COMBINE Lab, London, UK
GUILLAUME AUZIAS • Aix-Marseille Université, CNRS, Institut de Neurosciences de la
Timone, Marseille, France
IRENE BALELLI • Université Côte d’Azur, Inria Sophia Antipolis, Epione Research Group,
Nice, France
IMON BANERJEE • Mayo Clinic, Phoenix, AZ, USA; Arizona State University, School of
Computing, Informatics, and Decision Systems Engineering, Tempe, AZ, USA
FREDERIK BARKHOF • Department of Radiology and Nuclear Medicine, Amsterdam
University Medical Center, Amsterdam, The Netherlands; Queen Square Institute of
Neurology and Centre for Medical Image Computing, University College, London, UK
ANDREA BERNASCONI • McGill University, Montreal Neurological Institute and Hospital,
Montreal, QC, Canada
NEDA BERNASCONI • McGill University, Montreal Neurological Institute and Hospital,
Montreal, QC, Canada
ARABDHA BISWAS • Department of Radiological Sciences, University of California, Irvine,
Irvine, CA, USA
MARTINA BOCCHETTA • UCL Queen Square Institute of Neurology, Dementia Research
Centre, London, UK; Centre for Cognitive and Clinical Neuroscience, Division of
Psychology, Department of Life Sciences, College of Health, Medicine and Life Sciences,
Brunel University, London, UK
DANIEL BOS • Department of Radiology and Nuclear Medicine, Erasmus MC, Rotterdam,
The Netherlands; Department of Epidemiology, Erasmus MC, Rotterdam, The Netherlands
SIMONA BOTTANI • Sorbonne Université, Institut du Cerveau—Paris Brain Institute—ICM,
CNRS, Inria, Inserm, AP-HP, Hôpital de la Pitié Salpêtrière, Paris, France
ESTHER E. BRON • Biomedical Imaging Group Rotterdam, Department of Radiology and
Nuclear Medicine, Erasmus MC, Rotterdam, The Netherlands
IVAN BROSSOLLET • Neurospin, UNIACT Lab, PsyBrain Team, CEA Saclay, Gif-sur-Yvette,
France
NINON BURGOS • Sorbonne Université, Institut du Cerveau—Paris Brain Institute—ICM,
CNRS, Inria, Inserm, AP-HP, Hôpital de la Pitié Salpêtrière, Paris, France
ERIC BURGUIÈRE • Sorbonne Université, Institut du Cerveau—Paris Brain Institute—ICM,
CNRS, Inserm, AP-HP, Hôpital de la Pitié-Salpêtrière, Paris, France
FEDERICA CACCIAMANI • University of Bordeaux, Inserm, UMR Bordeaux Population
Health, PHARes Team, Bordeaux, France
ETIENNE CAMENEN • Sorbonne Université, Institut du Cerveau—Paris Brain Institute—
ICM, CNRS, Inria, Inserm, AP-HP, Hôpital de la Pitié Salpêtrière, Paris, France
DAVID M. CASH • UCL Queen Square Institute of Neurology, Dementia Research Centre,
London, UK; UK Dementia Research Institute at UCL, London, UK
DAVID YEN-TING CHEN • Department of Medical Imaging, Taipei Medical University –
Shuang Ho Hospital, Zhonghe District, New Taipei City, Taiwan
xxvii
xxviii Contributors
PAULINE FAVRE • Neurospin, UNIACT Lab, PsyBrain Team, CEA Saclay, Gif-sur-Yvette,
France; INSERM U955, Translational Neuropsychiatry Team, Faculté de Santé,
Université Paris Est Créteil, Créteil, France
DAVIDE FERRARI • Department of Population Health Sciences, King’s College London,
London, United Kingdom
MULUSEW FIKERE • Institute for Molecular Bioscience, The University of Queensland, St
Lucia, QLD, Australia
QUENTIN GALLET • Neurospin, UNIACT Lab, PsyBrain Team, CEA Saclay, Gif-sur-Yvette,
France; INSERM U955, Translational Neuropsychiatry Team, Faculté de Santé,
Université Paris Est Créteil, Créteil, France
MÉLINA GALLOPIN • Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS,
Université Paris-Saclay, Gif-sur-Yvette, France
JAMES C. GEE • University Pennsylvania, Department of Radiology, Philadelphia, PA, USA
RAVNOOR SINGH GILL • McGill University, Montreal Neurological Institute and Hospital,
Montreal, QC, Canada
JULIANA GONZALEZ-ASTUDILLO • Sorbonne Université, Institut du Cerveau—Paris Brain
Institute—ICM, CNRS, Inria, Inserm, AP-HP, Hôpital de la Pitié Salpêtrière, Paris,
France
GABRIEL HADDON-HILL • Department of Population Health Sciences, King’s College London,
London, United Kingdom
JOSHUA HARVEY • University of Exeter Medical School, RILD Building, RD&E Hospital
Wonford, Exeter, UK
RAVI HASSANALY • Sorbonne Université, Institut du Cerveau—Paris Brain Institute—ICM,
CNRS, Inria, Inserm, AP-HP, Hôpital de la Pitié Salpêtrière, Paris, France
JOSSELIN HOUENOU • Neurospin, UNIACT Lab, PsyBrain Team, CEA Saclay, Gif-sur-
Yvette, France; INSERM U955, Translational Neuropsychiatry Team, Faculté de Santé,
Université Paris Est Créteil, Créteil, France; APHP, Mondor Univ. Hospitals, DMU
Impact, Psychiatry Department, Créteil, France
DEWEI HU • Department of Electrical and Computer Engineering, Vanderbilt University,
Nashville, TN, USA
GYUJOON HWANG • Center for Biomedical Image Computing and Analytics, Perelman
School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
BAS JASPERSE • Department of Radiology and Nuclear Medicine, Amsterdam University
Medical Center, Amsterdam, The Netherlands
ROHIT JENA • University Pennsylvania, Department of Radiology, Philadelphia, PA, USA
GABRIEL JIMÉNEZ • Sorbonne Université, Institut du Cerveau—Paris Brain Institute—ICM,
CNRS, Inria, Inserm, AP-HP, Hôpital de la Pitié Salpêtrière, Paris, France
VICKY KALOGEITON • LIX, CNRS, Ecole Polytechnique, IP Paris, Paris, France
SAI SANDEEP KANTAREDDY • Arizona State University, School of Computing, Informatics, and
Decision Systems Engineering, Tempe, AZ, USA
IRFAHAN KASSAM • Lee Kong Chian School of Medicine, Nanyang Technological University,
Singapore, Singapore
ANAHITA FATHI KAZEROONI • Institute for Mental Health and Centre for Human Brain
Health, School of Psychology, University of Birmingham, Birmingham, UK
STEFAN KLEIN • Biomedical Imaging Group Rotterdam, Department of Radiology and
Nuclear Medicine, Erasmus MC, Rotterdam, The Netherlands
xxx Contributors
THIBAULT POINSIGNON • Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS,
Université Paris-Saclay, Gif-sur-Yvette, France
PIERRE POULAIN • Université Paris Cité, CNRS, Institut Jacques Monod, Paris, France
DANIEL RACOCEANU • Sorbonne Université, Institut du Cerveau—Paris Brain Institute—
ICM, CNRS, Inria, Inserm, AP-HP, Hôpital de la Pitié Salpêtrière, Paris, France
THIBAULT ROLLAND • Sorbonne Université, Institut du Cerveau—Paris Brain Institute—
ICM, CNRS, Inria, Inserm, AP-HP, Hôpital de la Pitié Salpêtrière, Paris, France
BERKMAN SAHINER • Division of Imaging, Diagnostics, and Software Reliability, Office of
Science and Engineering Laboratories, Center for Devices and Radiological Health, US
Food and Drug Administration, Silver Spring, MD, USA
THIAGO SANTOS • Emory University, Department of Computer Science, Atlanta, GA, USA
JULIA SIDORENKO • Institute for Molecular Bioscience, The University of Queensland, St
Lucia, QLD, Australia
MARION SMITS • Department of Radiology and Nuclear Medicine, Erasmus MC, Rotterdam,
The Netherlands
JENNIFER SOUN • Department of Radiological Sciences, University of California, Irvine,
Irvine, CA, USA
LACHLAN STRIKE • Queensland Brain Institute, the University of Queensland, St Lucia,
QLD, Australia
AMARA TARIQ • Mayo Clinic, Phoenix, AZ, USA
ELINA THIBEAU-SUTRE • Department of Applied Mathematics, Technical Medical Centre,
University of Twente, Enschede, The Netherlands; Sorbonne Université, Institut du
Cerveau – Paris Brain Institute – ICM, CNRS, Inria, Inserm, AP-HP, Hôpital de la
Pitié-Salpêtrière , Paris, France
NICHOLAS J. TUSTISON • University of Virginia, Department of Radiology and Medical
Imaging, Charlottesville, VA, USA
MARIA VAKALOPOULOU • Université Paris-Saclay, CentraleSupélec, Mathématiques et
Informatique pour la Complexité et les Systémes, Gif-sur-Yvette, France
ERDEM VAROL • Department of Statistics, Center for Theoretical Neuroscience, Zuckerman
Institute, Columbia University, New York, NY, USA
GAEL VAROQUAUX • Soda, Inria, Saclay, France
VIKRAM VENKATRAGHAVAN • Alzheimer Center Amsterdam, Neurology, Vrije Universiteit,
Amsterdam, The Netherlands; Amsterdam Neuroscience, Neurodegeneration, Amsterdam,
The Netherlands
SEBASTIAN R. VAN DER VOORT • Biomedical Imaging Group Rotterdam, Department of
Radiology and Nuclear Medicine, Erasmus MC, Rotterdam, The Netherlands
WENJUAN WANG • Department of Population Health Sciences, King’s College London,
London, United Kingdom
JUNHAO WEN • Center for Biomedical Image Computing and Analytics, Perelman School of
Medicine, University of Pennsylvania, Philadelphia, PA, USA
MARKUS WENZEL • Fraunhofer Institute for Digital Medicine MEVIS, Bremen, Germany
MARGIE WRIGHT • Queensland Brain Institute, the University of Queensland, St Lucia,
QLD, Australia; Centre for Advanced Imaging, The University of Queensland, St Lucia,
QLD, Australia
ZHIJIAN YANG • Center for Biomedical Image Computing and Analytics, Perelman School of
Medicine, University of Pennsylvania, Philadelphia, PA, USA
YANNAN YU • Department of Radiology, University of California San Francisco, San
Francisco, CA, USA
Part I
Abstract
This chapter provides an introduction to machine learning for a non-technical readership. Machine learning
is an approach to artificial intelligence. The chapter thus starts with a brief history of artificial intelligence in
order to put machine learning into this broader scientific context. We then describe the main general
concepts of machine learning. Readers with a background in computer science may skip this chapter.
Key words Machine learning, Artificial intelligence, Supervised learning, Unsupervised learning
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_1,
© The Author(s) 2023
3
4 Olivier Colliot
2 A Bit of History
(a)
(b)
Fig. 2 (a) Biological neuron. The synapses form the input of the neuron. Their signals are combined, and if the
result exceeds a given threshold, the neuron is activated and produces an output signal which is sent through
the axon. (b) The perceptron: an artificial neuron which is inspired by biology. It is composed of the set of
inputs (which correspond to the information entering the synapses) xi, which are linearly combined with
weights wi and then go through a non-linear function g to produce an output y. Image in panel (a) is courtesy of
Thibault Rolland
8 Olivier Colliot
Fig. 4 Two families of AI. The symbolic approach operates on symbols through
logical rules. The connexionist family actually not only encompasses artificial
neural networks but more generally machine learning approaches
Fig. 5 A multilayer perceptron model (here with only one hidden layer, but there
can be many more)
Fig. 6 Artificial intelligence, machine learning, and deep learning are not
synonymous. Deep learning is a type of machine learning which involves
neural networks with a large number of hidden layers. Machine learning is one
approach to artificial intelligence, but other approaches exist
(continued)
12 Olivier Colliot
Box 1 (continued)
Example: learning to detect tumors from medical images
• Task T: detect tumors from medical image
• Performance measure P: proportion of tumors correctly
identified
• Experience E: examining a dataset of medical images where
the presence of tumors has been annotated
3.1 Types of One usually considers three main types of learning: supervised
Learning learning, unsupervised learning, and reinforcement learning (Box
2). In both supervised and unsupervised learning, the experience E
is actually the inspection of a set of examples, which we will refer to
as training examples or training set.
(continued)
Introduction to Machine Learning 13
Box 2 (continued)
• Reinforcement learning. Learns by iteratively performing
actions to maximize some reward
– Classical approach used for learning to play games (chess,
go, etc.) or in the domain of robotics
– Currently few applications in the domain of brain diseases
3.1.2 Unsupervised In unsupervised learning, the examples are not labeled. The two
Learning most common tasks in unsupervised learning are clustering and
dimensionality reduction (Fig. 8). Clustering aims at discovering
groups within the training set, but these groups are not known a
priori. The objective is to find groups such that members of the
same group are similar, while members of different groups are
dissimilar. For example, one can aim to discover disease subtypes
which are not known a priori. Some classical clustering methods
are k-means or spectral clustering, for instance. Dimensionality
reduction aims at finding a space of variables (of lower dimension
than the input space) that best explain the variability of the
training data, given a larger set of input variables. This produces a
new set of variables that, in general, are not among the input
variables but are combinations of them. Examples of such methods
include principal component analysis, Laplacian eigenmaps, or
variational autoencoders.
14 Olivier Colliot
Classification
Regression
30.0
27.5
Body mass index
25.0
22.5
20.0
17.5
20 30 40 50 60
Age (in years)
Fig. 7 Two of the main supervised learning tasks: classification and regression.
The upper panel presents a classification task which aims at linearly separating
the orange and the blue class. Each sample is described by two variables. The
lower panel presents a linear regression task in which the aim is to predict the
body mass index from the age of a person. Figure courtesy of Johann Faouzi
3.1.3 Reinforcement In reinforcement learning, the machine will take a series of actions
Learning in order to maximize a reward. This can, for example, be the case of
a machine learning to play chess, which will play games against itself
in order to maximize the number of victories. These methods are
widely used for learning to play games or in the domain of robotics.
So far, they have had few applications to brain diseases and will not
be covered in the rest of this book.
Clustering
3.2 Overview of the In this section, we aim at formalizing the main concepts underlying
Learning Process most supervised learning methods. Some of these concepts, with
modifications, also extend to unsupervised cases.
The task that we will consider will be to provide an output,
denoted as y, from an input given to the computer, denoted as x. At
this moment, the nature of x does not matter. It can, for example,
be any possible photograph as in the example presented in Fig. 9.
It could also be a single number, a series of numbers, a text, etc. For
now, the nature of y can also be varied. Typically, in the case of
regression, it can be a number. In the case of classification, it
corresponds to a label (for instance, the label “cat” in our example).
For now, you do not need to bother about how these data (images,
labels, etc.) are represented in a computer. For those without a
background in computer science, this will be briefly covered in
Subheading 3.3.
Learning will aim at finding a function f that can transform
x into y, that is, such that y = f(x). For now, f can be of any type—
16 Olivier Colliot
Fig. 9 Main concepts underlying supervised learning, here in the case of classification. The aim is to be able to
recognize the content of a photograph (the input x) which amounts to assigning it a label (the output y). In other
words, we would like to have a function f that transforms x into y. In order to find the function f, we will make
use of a training set (x(1), y(1)), . . ., (x(n), y(n)) (which in our case is a set of photographs which have been
labeled). All images come from https://commons.wikimedia.org/ and have no usage restriction
set which are n pairs of inputs and outputs. We are now going to
search for the function f that makes the minimum error over the
n samples of the training set. In other words, we are looking for the
function which minimizes the average error over the training set.
Let us call this average error the cost function:
n
1
J ðf Þ = ℓ y ðiÞ , f ðx ðiÞ Þ
n
i=1
Table 1
Example where the input is a series of number. Here each patient is
characterized by several variables
3.3 Inputs and In the previous section, we made no assumption on the nature of
Features the input x. It could be an image, a number, a text, etc.
The simplest form of input that one can consider is when x is a
single number. Examples include age, clinical scores, etc. However,
for most problems, characterization of a patient cannot be done
with a single number but requires a large set of measurements
(Table 1). In such a case, the input can be a series of numbers
x1, . . ., xp which can be arranged into a vector:
x1
x= ⋮
xp
Fig. 10 In a computer, an image is represented as an array of numbers. Each number corresponds to the gray
level of a given pixel. Here the example is a slice of an anatomical MRI which has been severely undersampled
so that the different pixels are clearly visible. Note that an anatomical MRI is actually a 3D image and would
thus be represented by a 3D array rather than by a 2D array. Image courtesy of Ninon Burgos
3.4 Illustration in a We will now illustrate step by step the above concepts in a very
Simple Case simple case: univariate linear regression. Univariate means that the
input is a single number as in the example shown in Fig. 7. Linear
means that the model f will be a simple line. The input is a number
x and the output is a number y. The loss will be the least
squares loss: ℓ(y, f(x)) = (y - f(x))2. The model f will be a linear
function of x that is f(x) = w1x + w0 and corresponds to the equa-
tion of a line, w1 being the slope of the line and w0 the intercept. To
further simplify things, we will consider the case where there is no
intercept, i.e., the line passes through the origin. Different values of
w1 correspond to different lines (and thus to different functions f )
and to different values of the cost function J( f ), which can be in
our case rewritten as J(w1) since f only depends on the parameter w1
(Fig. 11). The best model is the one for which J(w1) is minimal.
How can we find w1 such that J(w1) is minimal? We are going to
dJ
use the derivative of J: dw 1
. A minimum of J(w1) is necessarily such
dJ
that dw 1
= 0 (in our specific case, the converse is also true). In our
dJ
case, it is possible to directly solve dw 1
= 0. This will nevertheless
not be the case in general. Very often, it will not be possible to solve
this analytically. We will thus resort to an iterative algorithm. One
classical iterative method is gradient descent. In the general case,
f depends not on only one parameter w1 but on a set of parameters
(w1, . . ., wp) which can be assembled into a vector w. Thus, instead
dJ
of working with the derivative dw 1
, we will work with the gradient
∇wJ. The gradient is a vector that indicates the direction that one
should follow to climb along J. We will thus follow the opposite of
the gradient, hence the name gradient descent. This process is
illustrated in Fig. 12, together with the corresponding algorithm.
20 Olivier Colliot
Fig. 11 We illustrate the concepts of supervised learning on a very simple case: univariate linear regression
with no intercept. Training samples correspond to the black circles. The different models f(x) = w1x
correspond to the different lines. Each model (and thus each value of the parameter w1) corresponds to a
value of the cost J(w1). The best model (the blue line) is the one which minimizes J(w1); here it corresponds to
the line with a slope w1 = 1
repeat
w1 ← w1 − η dw
dJ
1
until convergence;
Fig. 12 Upper panel: Illustration of the concept of gradient descent in a simple case where the model f is
defined using only one parameter w1. The value of w1 is iteratively updated by following the opposite of the
gradient. Lower panel: Gradient descent algorithm where η is the learning rate, i.e., the speed at which w1 will
be updated
Introduction to Machine Learning 21
4 Conclusion
Acknowledgements
The author would like to thank Johann Faouzi for his insightful
comments. This work was supported by the French government
under management of Agence Nationale de la Recherche as part of
the “Investissements d’avenir” program, reference ANR-19-P3IA-
0001 (PRAIRIE 3IA Institute) and reference ANR-10-IAIHU-06
(Institut Hospitalo-Universitaire ICM).
22 Olivier Colliot
References
1. Samuel AL (1959) Some studies in machine code recognition. Neural Comput 1(4):
learning using the game of checkers. IBM J 541–551
Res Dev 3(3):210–229 16. Matan O, Baird HS, Bromley J, Burges CJC,
2. Russell S, Norvig P (2002) Artificial intelli- Denker JS, Jackel LD, Le Cun Y, Pednault
gence: a modern approach. Pearson, London EPD, Satterfield WD, Stenard CE et al (1992)
3. McCulloch WS, Pitts W (1943) A logical cal- Reading handwritten digits: a zip code recog-
culus of the ideas immanent in nervous activity. nition system. Computer 25(7):59–63
Bull Math Biophys 5(4):115–133 17. Legendre AM (1806) Nouvelles méthodes
4. Wiener N (1948) Cybernetics or control and pour la détermination des orbites des comètes.
communication in the animal and the machine. Firmin Didot
MIT Press, Cambridge 18. Pearson K (1901) On lines and planes of closest
5. Hebb DO (1949) The organization of behav- fit to systems of points in space. The London,
ior. Wiley, New York Edinburgh, and Dublin Philosophical Maga-
6. Turing AM (1950) Computing machinery and zine and Journal of Science 2(11):559–572
intelligence. Mind 59(236):433–360 19. Fisher RA (1936) The use of multiple measure-
7. McCarthy J, Minsky ML, Rochester N, Shan- ments in taxonomic problems. Ann Eugenics
non CE (1955) A proposal for the Dartmouth 7(2):179–188
summer research project on artificial intelli- 20. Loh WY (2014) Fifty years of classification and
g e n c e . R e s e a r c h R e p o r t . h t t p : // regression trees. Int Stat Rev 82(3):329–348
raysolomonof f.com/dar tmouth/boxa/ 21. Quinlan JR (1986) Induction of decision trees.
dart564props.pdf Mach Learn 1(1):81–106
8. Newell A, Simon H (1956) The logic theory 22. Vapnik V (1999) The nature of statistical
machine–a complex information processing learning theory. Springer, Berlin
system. IRE Trans Inf Theory 2(3):61–79 23. Boser BE, Guyon IM, Vapnik VN (1992) A
9. Rosenblatt F (1958) The perceptron: a proba- training algorithm for optimal margin
bilistic model for information storage and classifiers. In: Proceedings of the fifth annual
organization in the brain. Psychol Rev 65(6): workshop on computational learning theory,
386 pp 144–152
10. Buchanan BG, Shortliffe EH (1984) Rule- 24. Pedregosa F, Varoquaux G, Gramfort A,
based expert systems: the MYCIN experiments Michel V, Thirion B et al (2011) Scikit-learn:
of the Stanford Heuristic Programming Proj- Machine learning in python. J Mach Learn Res
ect. Addison-Wesley, Boston 12:2825–2830
11. McCarthy J (1960) Recursive functions of 25. Krizhevsky A, Sutskever I, Hinton GE (2012)
symbolic expressions and their computation ImageNet classification with deep convolu-
by machine, part I. Commun ACM 3(4): tional neural networks. In: Advances in neural
184–195 information processing systems, vol 25, pp
12. Cardon D, Cointet JP, Mazières A, Libbrecht E 1097–1105
(2018) Neurons spike back. Reseaux 5:173– 26. LeCun Y, Bengio Y, Hinton G (2015) Deep
220. https://neurovenge.antonomase.fr/ learning. Nature 521(7553):436–444
RevengeNeurons_Reseaux.pdf 27. Abadi M, Agarwal A et al (2015) TensorFlow:
13. Rumelhart DE, Hinton GE, Williams RJ Large-scale machine learning on heteroge-
(1986) Learning representations by back- neous systems. https://www.tensorflow.org/,
propagating errors. Nature 323(6088): software available from tensorflow.org
533–536 28. Paszke A, Gross S, Massa F, Lerer A,
14. Le Cun Y (1985) Une procédure d’apprentis- Bradbury J, Chanan G, Killeen T, Lin Z,
sage pour réseau à seuil assymétrique. Cogni- Gimelshein N, Antiga L et al (2019) PyTorch:
tiva 85:599–604 An imperative style, high-performance deep
15. LeCun Y, Boser B, Denker JS, Henderson D, learning library. In: Advances in neural infor-
Howard RE, Hubbard W, Jackel LD (1989) mation processing systems, vol 32, pp
Backpropagation applied to handwritten zip 8026–8037
Introduction to Machine Learning 23
29. Chollet F et al (2015) Keras. https://github. 33. Radford A, Narasimhan K, Salimans T, Sutsk-
com/fchollet/keras ever I (2018) Improving language understand-
30. Shortliffe E (1976) Computer-based medical ing by generative pre-training. https://cdn.
consultations: MYCIN. Elsevier, Amsterdam openai.com/research-covers/language-unsu
31. Mitchell T (1997) Machine learning. McGraw pervised/language_understanding_paper.pdf
Hill, New York 34. Devlin J, Chang MW, Lee K, Toutanova K
32. Mikolov T, Chen K, Corrado G, Dean J (2013) (2018) Bert: Pre-training of deep bidirectional
Efficient estimation of word representations in transformers for language understanding. arXiv
vector space. arXiv preprint arXiv:13013781 preprint arXiv:181004805
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 2
Abstract
In this chapter, we present the main classic machine learning methods. A large part of the chapter is devoted
to supervised learning techniques for classification and regression, including nearest neighbor methods,
linear and logistic regressions, support vector machines, and tree-based algorithms. We also describe the
problem of overfitting as well as strategies to overcome it. We finally provide a brief overview of unsuper-
vised learning methods, namely, for clustering and dimensionality reduction. The chapter does not cover
neural networks and deep learning as these will be presented in Chaps. 3, 4, 5, and 6.
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_2,
© The Author(s) 2023
25
26 Johann Faouzi and Olivier Colliot
2 Notations
ð1Þ ð1Þ
x ð1Þ x1 ... xp y1
X= ⋮ = ⋮ ⋱ ⋮ , y= ⋮
ðnÞ ðnÞ ðnÞ
x x1 ... xp yn
3.1 Metrics Many metrics have been defined for various types of input data such
as vectors of real numbers, integers, or booleans. Among these
different types, vectors of real numbers are one of the most com-
mon types of input data, for which the most commonly used metric
is the Euclidean distance, defined as:
p
8x, x ′ ∈ I , kx - x ′ k2 = ðx j - x 0j Þ2
j =1
3.2 Neighborhood The two most common definitions of the neighborhood rely on
either the number of neighbors or the radius around the given
point. Figure 1 illustrates the differences between both definitions.
The k-nearest neighbor method defines the neighborhood of a
given point x as the set of the k closest points to x:
k
N ðxÞ = fx ðiÞ gi = 1 with dðx, x ð1Þ Þ ≤ . . . ≤ dðx, x ðnÞ Þ
The radius neighbor method defines the neighborhood of a
given point x as the set of points whose dissimilarity to x is smaller
than the given radius, denoted by r:
N ðxÞ = fx ðiÞ ∈X j dðx, x ðiÞ Þ < rg
0.75
0.50
0.25
0.00
3.3 Weights The two most common approaches to compute the weights are
to use:
• Uniform weights (all the weights are equal):
ðxÞ 1
8i, w i =
jN ðxÞj
• Weights inversely proportional to the dissimilarity:
1
ðxÞ dðx ðiÞ , xÞ 1
8i, w i = =
1 1
dðx ðiÞ , xÞ
j
dðx ðj Þ , xÞ j
dðx ðj Þ , xÞ
3.4 Neighbor Search The brute-force method to compute the neighborhood for
n points with p features is to compute the metric for each pair of
inputs, which has a Oðn2 pÞ algorithmic complexity (assuming that
evaluating the metric for a pair of inputs has a complexity of OðpÞ,
which is the case for most metrics). However, it is possible to
decrease this algorithmic complexity if the metric is a distance,
that is, if the metric d satisfies the following properties:
1. Non-negativity: 8a, b, d(a, b) ≥ 0
2. Identity: 8a, b, d(a, b) = 0 if and only if a = b
4 Linear Regression
1.5
y
1.0
0.5
0.0
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
x
Fig. 3 Ordinary least squares regression. The coefficients (i.e., the intercept and
the slope with a single predictor) are estimated by minimizing the sum of the
squared errors
) X ⊤X w⋆ = X ⊤y
-1
) w ⋆ = ðX ⊤ X Þ X ⊤y
5 Logistic Regression
Intuitively, linear regression consists in finding the line that best fits
the data: the true output should be as close to the line as possible.
For binary classification, one wants the line to separate both classes
as well as possible: the samples from one class should all be in one
subspace, and the samples from the other class should all be in the
other subspace, with the inputs being as far as possible from
the line.
Mathematically, for binary classification tasks, a linear model is
defined by a hyperplane splitting the input space into two subspaces
such that each subspace is characteristic of one class. For instance, a
line splits a plane into two subspaces in the two-dimensional case,
while a plane splits a three-dimensional space into two subspaces. A
hyperplane is defined by a vector w = (w0, w1, . . ., wp), and f(x) =
x⊤w corresponds to the signed distance between the input x and the
hyperplane w: in one subspace, the distance with any input is always
positive, whereas in the other subspace, the distance with any input
is always negative. Figure 4 illustrates the decision function in the
two-dimensional case where both classes are linearly separable.
The sign of the signed distance corresponds to the decision
function of a linear binary classification model:
þ1 if f ðxÞ > 0
^y = signðf ðxÞÞ =
-1 if f ðxÞ < 0
Classic Machine Learning Methods 33
We can see that the w coefficients that maximize the likelihood are
also the coefficients that minimize the sum of the logistic loss values,
with the logistic loss being defined as:
ℓ logistic ðy, f ðxÞÞ = logð1 þ expðyf ðxÞÞÞ= log ð2Þ
Unlike for linear regression, there is no closed formula for this
minimization. One thus needs to use an optimization method
such as gradient descent which was presented in Subheading 3 of
Chap. 1. In practice, more sophisticated approaches such as quasi-
Newton methods and variants of stochastic gradient descent are
often used. The main concepts underlying logistic regression can be
found in Box 3.
(continued)
Classic Machine Learning Methods 35
Box 3 (continued) n
f ðxÞ = w0 þ wj x j
j =1
1
Pðy = þ 1jx = xÞ = σðf ðxÞÞ =
1 þ expð - f ðxÞÞ
• Estimation: likelihood maximization.
• Regularization: can be penalized to avoid overfitting (ℓ 2
penalty), to perform feature selection (ℓ 1 penalty), or both
(elastic-net penalty).
Underfitting Overfitting
(high bias, low variance) (low bias, high variance)
Error
Training set
Test set
Complexity
−1
−2
−3
−1
−2
−3
−2 0 2 −2 0 2
7 Penalized Models
This is, for instance, the case of the linear and logistic regression
methods presented in the previous sections.
7.1 Penalties The main idea is to introduce a penalty term Pen(w) that will
constraint the parameters w to have some desired properties. The
most common penalties are the ℓ 2 penalty, the ℓ 1 penalty, and the
elastic-net penalty.
p
ℓ2 ðwÞ = kwk22 = w 2j
j =1
7.1.2 ℓ1 Penalty The ℓ 2 penalty forces the values of the parameters not to be too
large, but does not incentivize to make small values tend to zero.
Indeed, the square of a small value is even smaller. When the
number of features is large, or when interpretability is important,
it can be useful to make the model select the most important
features. The corresponding metric is the ℓ 0 “norm” (which is not
a proper norm in the mathematical sense), defined as the number of
nonzero elements:
p
ℓ 0 ðwÞ = kwk0 = 1wj ≠ 0
j =1
7.1.3 Elastic-Net Penalty Both the ℓ 2 and ℓ1 penalties have their upsides and downsides. In
order to try to obtain the best of penalties, one can add both
penalties in the objective function. The combination of both penal-
ties is known as the elastic-net penalty:
ENðw, αÞ = αkwk1 þ ð1 - αÞkwk22
where α ∈ [0, 1] is a hyperparameter representing the proportion of
the ℓ 1 penalty compared to the ℓ2 penalty.
which reads as “Find the optimal parameters that minimize the cost
function J among all the parameters w that satisfy Pen(w) < c” for a
positive real number c. Figure 7 illustrates the optimal solution of a
simple linear regression task with different constraints. This figure
Classic Machine Learning Methods 39
0
1
2
Fig. 6 Unit balls of the ℓ 0, ℓ1, and ℓ 2 norms. For each norm, the set of points in
2 whose norm is equal to 1 is plotted. The ℓ 1 norm is the best convex
approximation to the ℓ 0 norm. Note that the lines for the ℓ 0 norm extend to
-1 and + 1 but are cut for plotting reasons
5
25
20
3
2 15
1
10
0
w1
5
−1
−2 0
−2 −1 0 1 2 3 4 5 6
min
w
ky - X wk22 þ λkwk22
Classic Machine Learning Methods 41
λ = 1000 λ = 100
λ = 10 λ=1
λ = 0.1 λ = 0.01
λ = 0.001 λ = 0.0001
λ = 0.00001 λ = 0.000001
λ = 0.0000001 λ = 0.00000001
min
w
ky - X wk22 þ λkwk1
8.1 Original The original support vector machine was invented in 1963 and was
Formulation a linear binary classification method [10]. Figure 9 illustrates the
main concept of its original version. When both classes are linearly
Classic Machine Learning Methods 43
Fig. 9 Support vector machine classifier with linearly separable classes. When
two classes are linearly separable, there exist an infinite number of hyperplanes
separating them (left). The decision function of the support vector machine
classifier is the hyperplane that maximizes the margin, that is, the distance
between the hyperplane and the closest points to the hyperplane (right). Support
vectors are highlighted with a black circle surrounding them
x w = +1
(w2
2)
−1
(w2
2)
−1
x w = −1
4
(y, f (x))
0
−4 −3 −2 −1 0 1 2 3 4
yf (x)
Logistic loss: logistic (y, f (x)) = log(1 + exp(yf (x)))/ log(2)
Hinge loss: hinge (y, f (x)) = max(0, 1 − yf (x))
Fig. 11 Binary classification losses. The logistic loss is always positive, even
when the point is accurately classified with high confidence (i.e., when
yf(x) ≫ 0), whereas the hinge loss is equal to zero when the point is accurately
classified with good confidence (i.e., when yf(x) ≥ 1)
Classic Machine Learning Methods 45
8.2 General The SVM was later updated to non-linear decision functions with
Formulation with the use of kernels [12].
Kernels In order to have a non-linear decision function, one could map
the input space I into another space (often called the feature space),
denoted by G, using a function denoted by ϕ:
ϕ:I →G
x ↦ϕðxÞ
The decision function would still be linear (with a dot product), but
in the feature space:
f ðxÞ = ϕðxÞ⊤ w
Unfortunately, solving the corresponding minimization problem is
not trivial:
n
⊤ 1
min
w
max 0, 1 - y ðiÞ ϕðx ðiÞ Þ w þ kwk22 ð3Þ
i=1
2C
where α solves:
n
1 ⊤
min
α
maxð0, 1 - y ðiÞ ½K αi Þ þ α Kα
i=1
2C
46 Johann Faouzi and Olivier Colliot
K ðx, x ′ Þ = x ⊤ x ′
• The polynomial kernel:
Fig. 12 Impact of the kernel on the decision function of a support vector machine
classifier. A non-linear kernel allows for a non-linear decision function
(y, ŷ)
2
0
−ε +ε
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
y − ŷ
Mean squared error (MSE): MSE (y, ŷ) = (y − ŷ)2
Mean absolute error (MAE): MAE (y, ŷ) = |y − ŷ|
ε-insensitive loss: ε-insensitive (y, ŷ) = max(0, |y − ŷ| − ε)
Fig. 13 Regression losses. The squared error loss takes very small values for
small errors and very large values for large errors, whereas the absolute error
loss takes small values for small errors and large values for large errors. Both
losses take small but nonzero values when the error is small. On the contrary,
the ε-insensitive loss is null when the error is small and otherwise equal the
absolute error loss minus ε. When computed over several samples, the squared
and absolute error losses are often referred to as mean squared error (MSE) and
mean absolute error (MAE), respectively
9 Multiclass Classification
expðx ⊤ w k Þ
8k∈f1, . . ., qg, Pðy = C k jx = xÞ = q
j =1exp x ⊤ w j
- log ðLðw 1 , . . ., w q ÞÞ
n q
=- 1y ðiÞ = C k log P y = C k jx = x ðiÞ
i=1 k=1
n q exp x ðiÞ⊤ w k
= - 1y ðiÞ = C k log
expðx ðiÞ⊤ w j Þ
q
i=1 k=1 j =1
n
= ℓ crossentropy y ðiÞ , softmax x ðiÞ⊤ w 1 , . . ., x ðiÞ⊤ w q
i=1
9.3 One-vs-One Another strategy is to fit a binary classifier for each pair of classes:
this strategy is known as one-vs-one. The advantage of this strategy is
that the classes in each binary classification task are “pure”, in the
sense that different classes are never merged into a single class.
However, the number of binary classifiers that needs to be trained
is larger for the one-vs-one strategy (12 qðq - 1Þ) than for the one-
vs-rest strategy (q). Nonetheless, for the one-vs-one strategy, the
number of training samples in each binary classification task is
smaller than the total number of samples, which makes training
each binary classifier usually faster. Another drawback is that this
strategy is less interpretable compared to the one-vs-rest strategy, as
the predicted class corresponds to the class obtaining the most
Classic Machine Learning Methods 51
Multinomial One-vs.-rest
9.4 Error Correcting A substantially different strategy, inspired by the theory of error
Output Code correction code, consists in merging a subset of classes into one
class and the other subset into the other class, for each binary
classification task. This data is often called the code book and can
be represented as a matrix whose rows correspond to the classes and
whose columns correspond to the binary classification tasks. The
matrix consists only of - 1 and + 1 values that represent the
corresponding label for each class and for each binary task.2 For
1
The confidences are actually taken into account but only in the event of a tie.
2
The values are 0 and 1 when the classifier does not return scores but only probabilities.
52 Johann Faouzi and Olivier Colliot
any input, each binary classifier returns the score (or probability)
associated with the positive class. The predicted class for this input
is the class whose corresponding vector is the most similar to the
vector of scores, with similarity being assessed with the Euclidean
distance (the lower, the more similar). There exist advanced strate-
gies to define the code book, but it has been argued than a random
code book usually gives as good results as a sophisticated one [16].
log Pðy = C k jx = xÞ
= log pxjy = C k ðxÞþ log Pðy = C k Þ - log px ðxÞ
1 1
=- ½x - μk ⊤ Σk- 1 ½x - μk - logjΣk jþ log Pðy = C k Þ
2 2
p
- logð2πÞ - log px ðxÞ
2
1
= - x ⊤ Σk- 1 x þ x ⊤ Σk- 1 μk
2
1 ⊤ -1 1
- μk Σk μk - logjΣk jþ log Pðy = C k Þ
2 2
p
- logð2πÞ - log px ðxÞ
2
ð4Þ
It is also possible to make further assumptions on the covari-
ance matrices that lead to different models. In this section, we
present the most commonly used ones: naive Bayes, linear discrimi-
nant analysis, and quadratic discriminant analysis. Figure 16 illus-
trates the covariance matrices and the decision functions for these
models in the two-dimensional case.
10.1 Naive Bayes The naive Bayes model assumes that, conditionally to each class C k ,
the features are independent and have the same variance σ 2k :
8k, Σk = σ 2k I p
Equation 4 can thus be further simplified:
log Pðy = C k jx = xÞ
1 1 1
= - 2 x ⊤ x þ 2 x ⊤ μk - 2 μ⊤ μ - log σ k þ log Pðy = C k Þ
2σ k σk 2σ k k k
p
- logð2πÞ - log px ðxÞ
2
= x ⊤ W k x þ x ⊤ w k þ w 0k þ s
where:
• Wk = - 1
2σ 2k
I p is the matrix of the quadratic term for class C k .
• wk = 1
σ 2k
μk is the vector of the linear term for class C k .
• w 0k = - 1
2σ 2k
μ⊤
k μk - log σ k þ log Pðy = C k Þ is the intercept for
class C k .
• s = - 2p logð2πÞ - log px ðxÞ is a term that does not depend on
class C k .
Therefore, naive Bayes is a quadratic model. The probabilities for
input x to belong to each class C k can then easily be computed:
expðx ⊤ W k x þ x ⊤ w k þ w 0k Þ
Pðy = C k jx = xÞ = k
j =1 exp x ⊤ W j x þ x ⊤ w j þ w0j
54 Johann Faouzi and Olivier Colliot
where:
• wk = 1
σ2
μk is the vector of the linear term for class C k .
1 ⊤
• w 0k = - μ μ þ log Pðy = C k Þ is the intercept for
2σ 2 k k
class C k .
⊤
• s = - 2σ 2 x x - log σ - 2p logð2πÞ - log px ðxÞ is a
1
term that
does not depend on class C k .
In this case, naive Bayes becomes a linear model.
10.2 Linear Linear discriminant analysis (LDA) makes the assumption that all
Discriminant Analysis the covariance matrices are identical but otherwise arbitrary:
8k, Σk = Σ
Therefore, Eq. 4 can be further simplified:
log Pðy = C k jx = xÞ
1 1
= - ½x - μk ⊤ Σ - 1 ½x - μk - logjΣjþ log Pðy = C k Þ
2 2
p
- logð2πÞ - log px ðxÞ
2
1 ⊤ -1
=- x Σ x - x ⊤ Σ - 1 μk - μ⊤
k Σ
-1
x þ μ⊤k Σ
-1
μk
2
1 p
- logjΣjþ log Pðy = C k Þ - logð2πÞ - log px ðxÞ
2 2
1 1 1
= - x ⊤ Σ - 1 μk - x ⊤ Σ - 1 x - μ⊤ Σ - 1 μk þ log Pðy = C k Þ - logjΣj
2 2 k 2
p
- logð2πÞ - log px ðxÞ
2
= x ⊤ w k þ w 0k þ s
where:
• wk = Σ-1μk is the vector of coefficients for class C k .
-1
• w 0k = - 12 μ⊤
k Σ μk þ log Pðy = C k Þ is the intercept for class C k .
• s = - 12 x ⊤ Σ - 1 x - - 12 logjΣj - 2p logð2πÞ - log px ðxÞ is a term
that does not depend on class C k .
Therefore, linear discriminant analysis is a linear model. When Σ is
diagonal, linear discriminant analysis is identical to naive Bayes with
identical conditional variances.
The probabilities for input x to belong to each class C k can then
easily be computed:
expðx ⊤ w k þ w 0k Þ
Pðy = C k jx = xÞ = k
j =1 exp x ⊤ w j þ w 0j
log Pðy = C k jx = xÞ
1 1 1
= - x ⊤ Σk- 1 x þ x ⊤ Σk- 1 μk - μ⊤
k Σk- 1 μk - logjΣk j
2 2 2
p
þ log Pðy = C k Þ - logð2πÞ - log px ðxÞ
2
= x ⊤ W k x þ x ⊤ w k þ w0k þ s
where:
• W k = - 12 Σk- 1 is the matrix of the quadratic term for class C k .
• w k = Σk- 1 μk is the vector of the linear term for class C k .
-1
• w 0k = - 12 μ⊤ k Σ k μk -
1
2 logjΣk jþ log Pðy = C k Þ is the intercept
for class C k .
• s = - 2p logð2πÞ - log px ðxÞ is a term that does not depend on
class C k .
Therefore, quadratic discriminant analysis is a quadratic model.
The probabilities for input x to belong to each class C k can then
easily be computed:
expðx ⊤ W k x þ x ⊤ w k þ w 0k Þ
Pðy = C k jx = xÞ = k
j =1 exp x ⊤ W j x þ x ⊤ w j þ w0j
11 Tree-Based Methods
11.1 Decision Tree Binary decisions based on conditional statements are frequently
used in everyday life because they are intuitive and easy to under-
stand. Figure 17 illustrates a general approach when someone is ill.
Depending on conditional statements (severity of symptoms, abil-
ity to quickly consult a specialist), the decision (consult your gen-
eral practitioner or a specialist, or call for emergency services) is
different. Models with such an architecture are often used in
machine learning and are called decision trees.
A decision tree is an algorithm containing only conditional
statements and can be represented with a tree [17]. This graph
consists of:
• Decision nodes for all the conditional statements
• Branches for the potential outcomes of each decision node
• Leaf nodes for the final decision
Figure 18 illustrates a decision tree and its corresponding decision
function. For a given sample, the final decision is obtained by
following its corresponding path, starting at the root node.
A decision tree recursively partitions the feature space in order
to group samples with the same labels or similar target values. At
each node, the objective is to find the best (feature, threshold) pair
so that both subsets obtained with this split are the most pure, that
Classic Machine Learning Methods 57
Severity of symptoms
Se
ild
ve
M
e r
Consult your Can you quickly
general practitioner consult a specialist?
N
s
o
Ye
Call for
Consult a specialist
emergency services
No
Ye
5
ŷ = +1 x1 > −4.23 3.34
0
x2
s
No
Ye
x2 > 3.34 ŷ = −1 −5
s
No
Ye
−10
−15 −10 −5 0 5
ŷ = +1 ŷ = −1 x1
Fig. 18 A decision tree: (left) the rules learned by the decision tree and (right) the
corresponding decision function
Gini index
Entropy
0.3
Impurity
0.2
0.1
0.0
0.0 0.2 0.4 0.6 0.8 1.0
pk
Fig. 19 Illustration of Gini index and entropy. The entropy function takes larger
values than the Gini index, especially for pk < 0.8, which thus is more discrimi-
native against heterogeneous subsets (when most classes only represent only a
small proportion of the samples) than Gini index
Figure 19 illustrates the values of the Gini index and the entropy
for a single class C k and for different proportions of samples pk. One
can see that the entropy function takes larger values than the Gini
index, especially for pk < 0.8. Since the sum of the proportions is
equal to 1, most classes only represent a small proportion of the
samples. Therefore, a simple interpretation is that entropy is more
discriminative against heterogeneous subsets than the Gini index.
Misclassification only takes into account the proportion of the most
common class and tends to be even less discriminative against
heterogeneous subsets than both entropy and Gini index.
For regression tasks, the mean error from a reference value
(such as the mean or the median) is often used as the impurity
criterion:
• Mean squared error: jSj
1
ðy - yÞ2 with y= 1
jSj y
1 y∈S y∈S
• Mean absolute error: jSj jy - medianS ðyÞj
y∈S
Theoretically, a tree can grow until every leaf node is perfectly
pure. However, such a tree would have a lot of branches and would
be very complex, making it prone to overfitting. Several strategies
are commonly used to limit the size of the tree. One approach
consists in growing the tree with no restriction and then pruning
the tree, that is, replacing subtrees with nodes [17]. Other popular
strategies to limit the complexity of the tree are usually applied
while the tree is grown and include setting:
• A maximum depth for the tree
• A minimum number of samples required to be at an internal
node
Classic Machine Learning Methods 59
11.2 Random Forest One limitation of decision trees is their simplicity. Decision trees
tend to use a small fraction of the features in their decision function.
In order to use more features in the decision tree, growing a larger
tree is required, but large trees tend to suffer from overfitting, that
is, having a low bias but a high variance. One solution to decrease
the variance without much increasing the bias is to build an ensem-
ble of trees with randomness, hence the name random forest
[18]. An overview of random forests can be found in Box 5.
In a bid to have trees that are not perfectly correlated (thus
building actually different trees), each tree is built using only a
subset of the training samples obtained with random sampling.
Moreover, for each decision node of each tree, only a subset of
the features are considered to find the best split.
The final prediction is obtained by averaging the predictions of
each tree. For classification tasks, the predicted class is either the
most commonly predicted class (hard-voting) or the one with the
highest mean probability estimate (soft-voting) across the trees.
For regression tasks, the predicted value is usually the mean of the
predicted values across the trees.
12 Clustering
k-means
Cluster 1
10 Centroid of cluster 1
Cluster 2
Centroid of cluster 2
5
Cluster 3
Centroid of cluster 3
0
−5
−10 −5 0 5 10
1
μj = x ðiÞ
jX j j
x ðiÞ ∈X j
The centroids fully define the set of clusters because each sample is
assigned to the cluster whose centroid is the closest.
The k-means algorithm aims at finding centroids that minimize
the inertia, also known as within-cluster sum-of-squares criterion:
k
min kx ðiÞ - μj k22
fμ1 , ..., μk g j = 1 ðiÞ
x ∈X j
∀j ∈ {1, . . . , k}, Xj = {x(i) ∈ X | x(i) −μj 22 = min x(i) −μl 22 }
l
Inertia = 184.80
Third, inertia makes the assumption that the clusters are convex
and isotropic. The k-means algorithm may yield poor results when
this assumption does not hold, such as with elongated clusters or
manifolds with irregular shapes.
Fourth, the Euclidean distance tends to be inflated (i.e., the
ratio of the distances of the nearest and farthest neighbors to a
given target is close to 1) in high-dimensional spaces, making
inertia a poor criterion in such spaces [23]. One can alleviate this
issue by running a dimensionality reduction method such as princi-
pal component analysis prior to the k-means algorithm.
12.2 Gaussian A mixture model makes the assumption that each sample is gener-
Mixture Model ated from a mixture of several independent distributions.
Let k be the number of distributions. Each distribution Fj is
characterized by its probability of being picked, denoted by π j, and
its density pj with parameters θj, denoted by pj(; θj). Let Δ = (Δ1,
. . ., Δk) be a vector-valued random variable such that:
k
Δj = 1 and 8j ∈f1, . . ., kg, PðΔj = 1Þ = 1 - PðΔj = 0Þ = π j
j =1
−10 −5 0 5 10
Fig. 22 Gaussian mixture model. For each estimated distribution, the mean
vector and the ellipsis consisting of all the points within one standard deviation
of the mean are plotted
n n k
logðLðθÞÞ = logðpX ðx ðiÞ ; θÞÞ = log π j pj ðx; θj Þ
i=1 i=1 j =1
Result: Mean vectors {μj }kj=1 , covariance matrices {Σj }kj=1 and
probabilities {πj }kj=1
Initialize the mean vectors {μj }kj=1 , covariance matrices {Σj }kj=1
and probabilities {πj }kj=1 ;
while not converged do
E-step: Compute the posterior probability γi (j) for each sample
x(i) to have been generated from distribution Fj :
πj pj (x(i) ; θj , Σj )
∀i ∈ {1, . . . , n}, ∀j ∈ {1, . . . , k}, γi (j) = k
l=1 πl pj (x ; θl , Σl )
(i)
13 Dimensionality Reduction
Feature 2 (47.51%)
Feature 1 (52.49%)
Dimension 2 (5.45%)
Dimension 1 (94.55%)
Fig. 23 Illustration of principal component analysis. On the upper figure, the training data in the original space
(blue points with black axes) is plotted. Both features explain about the same amount of the total variance,
although one can clearly see a linear pattern. Principal component analysis learns a new Cartesian coordinate
system based on the input data (red axes). On the lower figure, the training data in the new coordinate system
is plotted (green points with red axes). The first dimension explains much more variance than the second
dimension
T =XW
• Each column of W maximizes the explained variance.
68 Johann Faouzi and Olivier Colliot
and then finding the unit vector that explains the maximum vari-
ance from this new data matrix:
^ ⊤X
w⊤X ^ w
^
w k = arg maxfkX k wkg = arg max k k
kwk = 1 w∈p w⊤w
One can show that the eigenvector associated with the kth largest
eigenvalue of the X⊤X matrix maximizes the quantity to be
maximized.
Therefore, the matrix W is the matrix whose columns are the
eigenvectors of the X⊤X matrix, sorted by descending order of
their associated eigenvalues.
13.1.2 Truncated Since each principal component iteratively maximizes the remain-
Decomposition ing variance, the first principal components explain most of the
total variance, while the last ones explain a tiny proportion of the
total variance. Therefore, keeping only a subset of the ordered
principal components usually gives a good representation of the
input data.
Mathematically, given a number of dimensions l, the new rep-
resentation is obtained by truncating the matrix of principal com-
ponents W to only keep the first l columns, resulting in the
submatrix W:,:l:
Classic Machine Learning Methods 69
1.5
1.0
Dimension 2 (5.31%)
0.5
0.0
−0.5
Setosa
−1.0 Versicolor
Virginica
−3 −2 −1 0 1 2 3 4
Dimension 1 (92.46%)
~
T = X W :,:l
Figure 24 illustrates the use of principal component analysis as
dimensionality reduction. The Iris flower dataset consists of 50 sam-
ples for each of 3 iris species (setosa, versicolor, and virginica) for
which 4 features were measured, the length and the width of the
sepals and petals, in centimeters. The projection of each sample on
the first two principal components is shown in this figure.
samples from a same class are close to each other, while projections
of samples from different classes are far from each other).
Mathematically, the objective is to find the matrix W ∈p × l
(with l ≤ k - 1) that maximizes the between-class scatter while also
minimizing the within-class scatter:
-1
max tr ðW ⊤ S w W Þ ðW ⊤ S b W Þ
W
vectors.
One can show that the W matrix consists of the first
l eigenvectors of the matrix S w- 1 S b with the corresponding eigen-
values being sorted in descending order. Just as in principal com-
ponent analysis, the corresponding eigenvalues can be used to
determine the contribution of each dimension. However, the crite-
rion for linear discriminant analysis is different from the one from
principal component analysis: it is to maximizing the separability of
the classes instead of maximizing the explained variance.
Figure 25 illustrates the use of linear discriminant analysis as a
dimensionality reduction technique. We use the same Iris flower
dataset as in Fig. 24 illustrating principal component analysis. The
projection of each sample on the learned two-dimensional space is
shown, and one can see that the first (horizontal) axis is more
discriminative of the three classes with linear discriminant analysis
than with principal component analysis.
14 Kernel Methods
Setosa
9 Versicolor
Virginica
Dimension 2 (0.88%)
8
14.1 Kernel Ridge Kernel ridge regression combines ridge regression with the kernel
Regression trick and thus learns a linear function in the space induced by the
respective kernel and the training data [2]. For non-linear kernels,
this corresponds to a non-linear function in the original input
space.
72 Johann Faouzi and Olivier Colliot
The cost function can be simplified using the specific form of the
possible functions:
n
2
ðy ðiÞ - f ðx ðiÞ Þ þ λkf k2
i=1
2 2
n n n
ðiÞ ðj Þ ðiÞ ðiÞ
= y - αj kðx , x Þ þλ αi K ð, x Þ
i=1 j =1 i=1
n
2
= y ðiÞ - α⊤ K :,i þ λα⊤ K α
i=1
= ky - K αk22 þ λα⊤ K α
Training data
15 Conclusion
Acknowledgements
The authors would like to thank Hicham Janati for his fruitful
remarks. The authors would like to acknowledge the extensive
documentation of the scikit-learn Python package, in particular its
user guide, for the relevant information and references provided.
We used the NumPy [28], matplotlib [29], and scikit-learn [27]
Python packages to generate all the figures. This work was sup-
ported by the French government under management of Agence
Nationale de la Recherche as part of the “Investissements d’avenir”
program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute)
and reference ANR-10-IAIHU-06 (Agence Nationale de la
Recherche-10-IA Institut Hospitalo-Universitaire-6), and by the
European Union H2020 program (grant number 826421, project
TVB-Cloud).
References
1. Goodfellow I, Bengio Y, Courville A (2016) 7. Tikhonov AN, Arsenin VY, John F (1977)
Deep learning. MIT Press, Cambridge, Solutions of Ill posed problems. Wiley,
MA. http://www.deeplearningbook.org Washington, New York
2. Murphy KP (2012) Machine learning: a prob- 8. Tibshirani R (1996) Regression shrinkage and
abilistic perspective. The MIT Press, selection via the lasso. J R Stat Soc Series B
Cambridge, MA (Methodological) 58(1):267–288
3. Bentley JL (1975) Multidimensional binary 9. Zou H, Hastie T (2005) Regularization and
search trees used for associative searching. variable selection via the elastic net. J R Stat
Commun ACM 18(9):509–517 Soc Series B (Statistical Methodology) 67(2):
4. Omohundro SM (1989) Five balltree con- 301–320
struction algorithms. Tech. rep., International 10. Vapnik VN, Lerner A (1963) Pattern recogni-
Computer Science Institute tion using generalized portrait method. Autom
5. Bishop CM (2006) Pattern recognition and Remote Control 24:774–780
machine learning. Springer, Berlin 11. Cortes C, Vapnik V (1995) Support-vector
6. Hastie T, Tibshirani R, Friedman J (2009) The networks. Mach Learn 20(3):273–297
elements of statistical learning: data mining, 12. Boser BE, Guyon IM, Vapnik VN (1992) A
inference, and prediction, 2nd edn. Springer training algorithm for optimal margin
series in statistics. Springer, New York classifiers. In: Proceedings of the fifth annual
Classic Machine Learning Methods 75
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made. The images or other
third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated
otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder.
Chapter 3
Abstract
Deep learning belongs to the broader family of machine learning methods and currently provides state-of-
the-art performance in a variety of fields, including medical applications. Deep learning architectures can be
categorized into different groups depending on their components. However, most of them share similar
modules and mathematical formulations. In this chapter, the basic concepts of deep learning will be
presented to provide a better understanding of these powerful and broadly used algorithms. The analysis
is structured around the main components of deep learning architectures, focusing on convolutional neural
networks and autoencoders.
Key words Perceptrons, Backpropagation, Convolutional neural networks, Deep learning, Medical
imaging
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_3,
© The Author(s) 2023
77
78 Maria Vakalopoulou et al.
2.1 Perceptrons The perceptron [1] was originally developed for supervised binary
classification problems, and it was inspired by works from neuros-
cientists such as Donald Hebb [17]. It was built around a
non-linear neuron, namely, the McCulloch-Pitts model of a neu-
ron. More formally, we are looking for a function f(x;w, b) such that
f ð:; w, bÞ : x∈p → fþ1, - 1g where w and b are the parameters
of f and the vector x = [x1, . . ., xp]⊤ is the input. The training set is
{(x(i), y(i))}. In particular, the perceptron relies on a linear model for
performing the classification:
þ1 if w ⊤ x þ b ≥ 0
f ðx; w, bÞ = : ð1Þ
-1 otherwise
1
b
w1
x1
6 ^y
w2
x2 wp
xp
Fig. 1 A simple perceptron model. The input elements are described as neurons
and combined for the final prediction y^. The final prediction is composed of a
weighted sum and an activation function
2.2.1 A Simple Multilayer Without loss of generality, an MLP with one hidden layer can be
Network defined as:
zðxÞ = gðW 1 xÞ
, ð2Þ
y^ = f ðx; W 1 , W 2 Þ = W 2 zðxÞ
where gðxÞ : → denotes the non-linear function (which can be
applied element-wise to a vector), W1 the matrix of coefficients of
the first layer, and W2 the matrix of coefficients of the second layer.
Equivalently, one can write:
d1
yc = W 2ðc,j Þ gðW 1⊤
ðj Þ xÞ, ð3Þ
j =1
z1
x1
z2 ^y1
x2 W1 W2
z3 ^y2
xp
zd 1
Fig. 2 An example of a simple multilayer perceptron model. The input layer is fed
into a hidden layer (z), which is then combined for the last output layer providing
the final prediction
82 Maria Vakalopoulou et al.
which the input neurons are fed into a hidden layer whose neurons
are combined for the final prediction.
There were a lot of research works indicating the capacity of
feedforward neural networks with a single hidden layer of finite size
to approximate continuous functions. In the late 1980s, the first
proof was published [18] for sigmoid activation functions (see
Subheading 2.3 for the definition) and was generalized to other
functions for feedforward multilayer architectures [19–21]. In par-
ticular, these works prove that any continuous function can be
approximated under mild conditions as closely as wanted by a
three-layer network. As N → 1, any continuous function f can
be approximated by some neural network f^, because each compo-
nent gðW Tðj Þ xÞ behaves like a basis function and functions in a
suitable space admit a basis expansion. However, since N may
need to be very large, introducing some limitations for these
types of networks, deeper networks, with more than one hidden
layer, can provide good alternatives.
2.2.2 Deep Neural The simple MLP networks can be generalized to deeper networks
Network with more than one hidden layer that progressively generate
higher-level features from the raw input. Such networks can be
written as:
z 1 ðxÞ = gðW 1 xÞ
...
z k ðxÞ = gðW k z k - 1 ðxÞÞ , ð4Þ
...
y^ = f ðx; W 1 , . . ., W K Þ = z K ðz K - 1 ð. . .ðz 1 ðxÞÞÞÞ
where K denotes the number of layers for the neural network,
which defines the depth of the network. In Fig. 3, a graphical
representation of the deep multilayer perceptron is presented.
Once again, the input layer is fed into the different hidden layers
of the network in a hierarchical way such that the output of one
layer is the input of the next one. The last layer of the network
corresponds to the output layer, which makes the final prediction of
the model.
As for networks with one hidden layer, they are also universal
approximators. However, the approximation theory for deep net-
works is less understood compared with neural networks with one
hidden layer. Overall, deep neural networks excel at representing
the composition of functions.
So far, we have described neural networks as simple chains of
layers, applied in a hierarchical way, with the main considerations
being the depth of the network (the number of layers K) and the
Deep Learning: Basics and CNN 83
z k ,1
x1
z k ,2 ^y1
x2 ╳ ╳ ╳ ╳
z k ,3 ^y2
xp
z k ,d k
Fig. 3 An example of a deep neural network. The input layer, the kth layer of the deep neural network, and the
output layer are presented in the figure
Fig. 4 Comparison of two different networks with almost the same number of parameters, but different
depths. Figure inspired by Goodfellow et al. [24]
width of each k layer (the number of neurons dk). Overall, there are
no rules for the choice of the K and dk parameters that define the
architecture of the MLP. However, it has been shown empirically
that deeper models perform better. In Fig. 4, an overview of
2 different networks with 3 and 11 hidden layers is presented
with respect to the number of parameters and their accuracy. For
each architecture, the number of parameters varies by changing the
number of neurons dk. One can observe that, empirically, deeper
networks achieve better performance using approximately the same
or a lower number of parameters. Additional evidence to support
these empirical findings is a very active field of research [22, 23].
Neural networks can come in a variety of models and architec-
tures. The choice of the proper architecture and type of neural
network depends on the type of application and the type of data.
84 Maria Vakalopoulou et al.
2.3 Main Functions A neural network is a composition of different functions also called
modules. Most of the times, these functions are applied in a sequen-
tial way. However, in more complicated designs (e.g., deep residual
networks), different ways of combining them can be designed. In
the following subsections, we will discuss the most commonly used
functions that are the backbones of most perceptrons and multi-
layer perceptron architectures. One should note, however, that a
variety of functions can be proposed and used for different deep
learning architectures with the constraint to be differentiable –
almost – everywhere. This is mainly due to the way that deep neural
networks are trained, and this will be discussed later in the chapter.
2.3.1 Linear Functions One of the most fundamental functions used in deep neural net-
works is the simple linear function. Linear functions produce a
linear combination of all the nodes of one layer of the network,
weighted with the parameters W. The output signal of the linear
function is Wx, which is a polynomial of degree one. While it is easy
to solve linear equations, they have less power to learn complex
functional mappings from data. Moreover, when the number of
samples is much larger than the dimension of the input space, the
probability that the data is linearly separable comes close to zero
(Box 1). This is why they need to be combined with non-linear
functions, also called activation functions (the name activation has
been initially inspired by biology as the neuron will be active or not
depending on the output of the function).
Fig. 5 Overview of different non-linear functions (in green) and their first-order derivative (blue). (a) Hyperbolic
tangent function (tanh), (b) sigmoid, and (c) rectified linear unit (ReLU)
2.3.2 Non-linear One of the most important components of deep neural networks is
Functions the non-linear functions, also called activation functions. They
convert the linear input signal of a node into non-linear outputs
to facilitate the learning of high-order polynomials. There are a lot
of different non-linear functions in the literature. In this subsec-
tion, we will discuss the most classical non-linearities.
Hyperbolic Tangent One of the most standard non-linear functions is the hyperbolic
Function (tanh) tangent function, aka the tanh function. Tanh is symmetric around
the origin with a range of values varying from - 1 to 1. The biggest
advantage of the tanh function is that it produces a zero-centered
output (Fig. 5a), thereby supporting the backpropagation process
that we will cover in the next section. The tanh function is used
extensively for the training of multilayer neural networks. Formally,
the tanh function, together with its gradient, is defined as:
ex - e - x
g = tanh ðxÞ =
ex þ e - x
: ð5Þ
∂g
= 1 - tanh 2 ðxÞ
∂x
Sigmoid Similar to tanh, the sigmoid is one of the first non-linear functions
that were used to compose deep learning architectures. One of the
main advantages is that it has a range of values varying from 0 to
1 (Fig. 5b) and therefore is especially used for models that aim to
predict a probability as an output. Formally, the sigmoid function,
together with its gradient, is defined as:
1
g = σðxÞ =
1 þ e -x
: ð6Þ
∂g
= σðxÞð1 - σðxÞÞ
∂x
86 Maria Vakalopoulou et al.
Rectified Linear Unit (ReLU) ReLU is considered among the default choice of non-linearity.
Some of the main advantages of ReLU include its efficient calcula-
tion and better gradient propagation with fewer vanishing gradient
problems compared to the previous two activation functions
[26]. Formally, the ReLU function, together with its gradient, is
defined as:
g = max ð0, xÞ
∂g 0, if x ≤ 0 : ð7Þ
=
∂x 1, if x > 0
As it is indicated in Fig. 5c, ReLU is differentiable anywhere
else than zero. However, this is not a very important problem as the
value of the derivative at zero can be arbitrarily chosen to be 0 or
1. In [27], the authors empirically demonstrated that the number
of iterations required to reach 25% training error on the CIFAR-10
dataset for a four-layer convolutional network was six times faster
with ReLU than with tanh neurons. On the other hand, and as
discussed in [28], ReLU-type neural networks which yield a piece-
wise linear classifier function produce almost always high confi-
dence predictions far away from the training data. However, due
to its efficiency and popularity, many variations of ReLU have been
proposed in the literature, such as the leaky ReLU [29] or the
parametric ReLU [30]. These two variations both address the
problem of dying neurons, where some ReLU neurons die for all
inputs and remain inactive no matter what input is supplied. In such
a case, no gradient flows from these neurons, and the training of the
neural network architecture is affected. Leaky ReLU and parametric
ReLU change the g(x) = 0 part, by adding a slope and extending
the range of ReLU.
g = x σðβxÞ
∂g , ð8Þ
= βgðxÞ þ σðβxÞð1 - βgðxÞÞ
∂x
where σ is the sigmoid function and β is either a constant or a
trainable parameter. Swish tends to work better than ReLU on
deeper models, as it has been shown experimentally in [31] in
different domains.
2.3.3 Loss Functions Besides the activation functions, the loss function (which defines
the cost function) is one of the main elements of neural networks. It
is the function that represents the error for a given prediction. To
that purpose, for a given training sample, it compares the prediction
f(x(i);W) to the ground truth y(i) (here we denote for simplicity as
W all the parameters of the network, combining all the W1, . . ., WK
in the multilayer perceptron shown above). The loss is denoted as
ℓ(y, f(x;W)). The average loss across the n training samples is called
the cost function and is defined as:
n
1
J ðW Þ = ℓ y ðiÞ , f ðx ðiÞ ; W Þ , ð10Þ
n
i=1
where {(x(i), y(i))}i=1..n composes the training set. The aim of the
training will be to find the parameters W such that J(W) is mini-
mized. Note that, in deep learning, one often calls the cost function
the loss function, although, strictly speaking, the loss is for a given
sample, and the cost is averaged across samples. Besides, the objec-
tive function is the overall function to minimize, including the cost
and possible regularization terms. However, in the remainder of
this chapter, in accordance with common usage in deep learning,
we will sometimes use the term loss function instead of cost
function.
88 Maria Vakalopoulou et al.
Cross-Entropy Loss One of the most basic loss functions for classification problems
corresponds to the cross-entropy between the expected values and
the predicted ones. It leads to the following cost function:
n
J ðW Þ = - log P y = y ðiÞ jx = x ðiÞ ; W , ð11Þ
i=1
Mean Squared Error Loss For regression problems, the mean squared error is one of the most
basic cost functions, measuring the average of the squares of the
errors, which is the average squared difference between the pre-
dicted values and the real ones. The mean squared error is
defined as:
n
J ðW Þ = jj y ðiÞ - f ðx ðiÞ ; W Þ jj 2 : ð13Þ
i=1
1
First-order means here that the first-order derivatives of the cost function are used as opposed to second-order
algorithms that, for instance, use the Hessian.
90 Maria Vakalopoulou et al.
3.1.1 Stochastic Gradient Gradient descent efficiency is not enough when it comes to
Descent machine learning problems with large numbers of training samples.
Indeed, this is the case for neural networks and deep learning which
often rely on hundreds or thousands of training samples. Updating
the parameters W after calculating the gradient using all the
training samples would lead to a tremendous computational com-
plexity of the underlying optimization algorithm [33]. To deal with
this problem, the stochastic gradient descent (SGD) algorithm is a
drastic simplification. Instead of computing the ∂J∂W
ðW Þ
exactly, each
iteration estimates this gradient on the basis of a small set of
randomly picked examples, as follows:
W tþ1 ← W t - ηt GðW t Þ, ð15Þ
where
K
1 ∂J ðik Þ W t
GðW Þ =t
, ð16Þ
K ∂W
k=1
2
Note that, as often in deep learning, the terminology can be confusing. In isolation, the term batch is usually a
synonym of mini-batch. On the contrary, batch gradient descent means computing the gradient using all training
samples and not only a mini-batch [24].
Deep Learning: Basics and CNN 91
gradient descent will be OðK Þ and for gradient descent OðN Þ. The
ideal choice for the batch size is a debated question. First, an upper
limit for the batch size is often simply given the available GPU
memory, in particular when the size of the input data is large (e.g.,
3D medical images). Besides, choosing K as a power of 2 often
leads to more efficient computations. Finally, small batch sizes tend
to have a regularizing effect which can be beneficial [24]. In any
case, the ideal batch size usually depends on the application, and it
is not uncommon to try different batch sizes. Finally, one calls an
epoch a complete pass over the whole training set (meaning that
each training sample has been used once). The number of epochs is
the number of full passes over the whole training set. It should not
be confused with the number of iterations which is the number of
mini-batches that have been processed.
Note that various improvements over traditional SGD have
been introduced, leading to more efficient optimization methods.
These state-of-the-art optimization methods are presented in
Subheading 3.4.
beginning, has not been used in any way during training, and is
only used to report the performance (see Chap. 20 for details). In
case one cannot have an additional independent test set due to a
lack of data, one should be aware that the performance may be
biased and that this is a limitation of the specific study.
To avoid overfitting and improve the generalization perfor-
mance of the model, usually, the validation set is used to monitor
the loss during the training of the networks. Tracking the training
and validation losses over the number of epochs is essential and
provides important insights into the training process and the
selected hyperparameters (e.g., choice of learning rate). Recent
visualization tools such as TensorBoard3 or Weights & Biases4
make this tracking easy. In the following, we will also mention
some of the most commonly applied optimization techniques that
help with preventing overfitting.
3
https://www.tensorflow.org/tensorboard.
4
https://wandb.ai/site.
94 Maria Vakalopoulou et al.
Loss
Underfitting Overfitting Validation
Training
Time (epochs)
Fig. 8 Illustration of the concept of early stopping. The model that should be selected corresponds to the
dashed bar which is the point where the validation loss starts increasing. Before this point, the model is
underfitting. After, it is overfitting
6 6
W i,j Uniform - , , ð20Þ
mþn mþn
where m is the number of inputs and n the number of outputs of
matrix W. Moreover, the biases b are initialized to 0.
Fig. 9 Examples of data transformations applied in the MNIST dataset. Each of these generated samples is
considered additional training data
3.4 State-of-the-Art Over the years, different optimizers have been proposed and widely
Optimizers used, aiming to provide improvements over the classical stochastic
gradient descent. These algorithms are motivated by challenges
that need to be addressed with stochastic gradient descent and are
focusing on the choice of the proper learning rate, its dynamic
change during training, as well as the fact that it is the same for all
the parameter updates [44]. Moreover, a proper choice of opti-
mizer could speed up the convergence to the optimal solution. In
this subsection, we will discuss some of the most commonly used
optimizers nowadays.
3.4.1 Stochastic Gradient One of the limitations of the stochastic gradient descent is that
Descent with Momentum since the direction of the gradient that we are taking is random, it
can heavily oscillate, making the training slower and even getting
stuck in a saddle point. To deal with this problem, stochastic
gradient descent with momentum [45, 46] keeps a history of the
previous gradients, and it updates the weights taking into account
the previous updates. More formally:
g t ← ρg t - 1 þ ð1 - ρÞGðW t Þ
ΔW t ← - ηt g t , ð21Þ
W tþ1 ← W t þ ΔW t
where gt is the direction of the update of the weights in time-step
t and ρ ∈ [0, 1] is a hyperparameter that controls the contribution
of the previous gradients and current gradient in the current
update. When ρ = 0, it is the same as the classical stochastic gradient
descent. A large value of ρ will mean that the update is strongly
influenced by the previous updates.
The momentum algorithm accumulates an exponentially
decaying moving average of the past gradients and continues to
move in their direction [24]. Momentum increases the speed of
convergence, while it is also helpful to not get stuck in places where
the search space is flat (saddle points with zero gradient), since the
momentum will pursue the search in the same direction as before
the flat region.
3.4.2 AdaGrad To facilitate and speed up, even more, the training process, optimi-
zers with adaptive learning rates per parameter have been proposed.
The adaptive gradient (AdaGrad) optimizer [47] is one of them. It
updates each individual parameter proportionally to their compo-
nent (and momentum) in the gradient. More formally:
Deep Learning: Basics and CNN 97
g t ← GðW t Þ
rt ← rt - 1 þ gt gt
η , ð22Þ
ΔW t ← - p gt
δ þ rt
W tþ1 ← W t þ ΔW t
where gt is the gradient estimate vector in time-step t, rt is the term
controlling the per parameter update, and δ is some small quantity
that is used to avoid the division by zero. Note that rt constitutes of
the gradient’s element-wise product with itself and of the previous
term rt-1 accumulating the gradients of the previous terms.
This algorithm performs very well for sparse data since it
decreases the learning rate faster for the parameters that are more
frequent and slower for the infrequent parameters. However, since
the update accumulates gradients of the previous steps, the updates
could decrease very fast, blocking the learning process. This limita-
tion is mitigated by extensions of the AdaGrad algorithm as we
discuss in the next sections.
3.4.3 RMSProp Another algorithm with adaptive learning rates per parameter is the
root mean squared propagation (RMSProp) algorithm, proposed
by Geoffrey Hinton. Despite its popularity and use, this algorithm
has not been published. RMSProp is an extension of the AdaGrad
algorithm dealing with the problem of radically diminishing
learning rates by being less influenced by the first iterations of the
algorithm. More formally:
g t ← GðW t Þ
r t ← ρr t - 1 þ ð1 - ρÞg t gt
η , ð23Þ
ΔW t ← - p gt
δ þ rt
W tþ1 ← W t þ ΔW t
where ρ is a hyperparameter that controls the contribution of the
previous gradients and the current gradient in the current update.
Note that RMSProp estimates the squared gradients in the same
way as AdaGrad, but instead of letting that estimate continually
accumulate over training, we keep a moving average of it, integrat-
ing the momentum. Empirically, RMSProp has been shown to be
an effective and practical optimization algorithm for deep neural
networks [24].
3.4.4 Adam The effectiveness and advantages of the AdaGrad and RMSProp
algorithms are combined in the adaptive moment estimation
(Adam) optimizer [48]. The method computes individual adaptive
learning rates for different parameters from estimates of the first
and second moments of the gradients. More formally:
98 Maria Vakalopoulou et al.
g t ← GðW t Þ
s t ← ρ1 s t - 1 þ ð1 - ρ1 Þg t
r t ← ρ2 r t - 1 þ ð1 - ρ2 Þg t gt
st
^s t ←
1 - ðρ1 Þt
, ð24Þ
rt
r^t ←
1 - ðρ2 Þt
λ
ΔW t ← - p ^s t
δ þ r^t
W tþ1 ← W t þ ΔW t
where st is the gradient with momentum, rt accumulates the
squared gradients with momentum as in RMSProp, and ^s t and r^t
are smaller than st and rt, respectively, but they converge toward
them. Moreover, δ is some small quantity that is used to avoid the
division by zero, while ρ1 and ρ2 are hyperparameters of the algo-
rithm. The parameters ρ1 and ρ2 control the decay rates of each
moving average, respectively, and their value is close to 1. Empirical
results demonstrate that Adam works well in practice and compares
favorably to other stochastic optimization methods, making it the
go-to optimizer for deep learning problems.
3.4.5 Other Optimizers The development of efficient (in terms of speed and stability)
optimizers is still an active research direction. RAdam [49] is a
variant of Adam, introducing a term to rectify the variance of the
adaptive learning rate. In particular, RAdam leverages a dynamic
rectifier to adjust the adaptive momentum of Adam based on the
variance and effectively provides an automated warm-up custom-
tailored to the current dataset to ensure a solid start to training.
Moreover, LookAhead [50] was inspired by recent advances in the
understanding of loss surfaces of deep neural networks and pro-
vides a breakthrough in robust and stable exploration during the
entirety of the training. Intuitively, the algorithm chooses a search
direction by looking ahead at the sequence of fast weights gener-
ated by another optimizer. These are only some of the optimizers
that exist in the literature, and depending on the problem and the
application, different optimizers could be selected and applied.
to the deep fully connected networks that have been already dis-
cussed, CNNs excel in processing data with a spatial or grid-like
organization (e.g., time series, images, videos, etc.) while at the
same time decreasing the number of trainable parameters due to
their weight sharing properties. The rest of this section is first
introducing the convolution operation and the motivation behind
using it as a building block/module of neural networks. Then, a
number of different variations are presented together with exam-
ples of the most important CNN architectures. Lastly, the impor-
tance of the receptive field – a central property of such networks –
will be discussed.
4.1 The Convolution The convolution operation is defined as the integral of the product
Operation of the two functions ( f, g)5 after one is reversed and shifted over the
other function. Formally, we write:
1
hðtÞ = f ðt - τÞgðτÞ dτ: ð25Þ
-1
5
Note that f and g have no relationship to their previous definitions in the chapter. In particular, f is not the deep
learning model.
100 Maria Vakalopoulou et al.
discrete signals f[k] and g[k], with k∈, the convolution operation
is defined by:
h½k = f ½k - ng½n: ð27Þ
n
Very often, the first signal will be the input of interest (e.g., a
large size image), while the second signal will be of relatively small
size (e.g., a 3 × 3 or 4 × 4 matrix) and will implement a specific
operation. The second signal is then called a kernel. In Fig. 10, a
visualization of the convolution operation is shown in the case of a
2D discrete signal such as an image and a 3 × 3 kernel. In detail, the
convolution kernel is shifted over all locations of the input, and an
element-wise multiplication and a summation are utilized to calcu-
late the convolution output at the corresponding location. Exam-
ples of applications of convolutions to an image are provided in
Fig. 11. Finally, note that, as in multilayer perceptrons, a convolu-
tion will generally be followed by a non-linear activation function,
for instance, a ReLU (see Fig. 12 for an example of activation
applied to a feature map).
In the following sections of this chapter, any reference to the
convolution operation will mostly refer to the 2D discrete case. The
Deep Learning: Basics and CNN 101
1 0 -1 1 1 1
1 0 -1 0 0 0
1 0 -1 -1 -1 -1
Fig. 11 Two examples of convolutions applied to an image. One of the filters acts as a vertical edge detector
and the other one as a horizontal edge detector. Of course, in CNNs, the filters are learned, not predefined, so
there is no guarantee that, among the learned filters, there will be a vertical/horizontal case detector, although
it will often be the case in practice, especially for the first layers of the architecture
4.2 Properties of the In the case of a discrete domain, the convolution operation can be
Convolution Operation performed using a simple matrix multiplication without the need of
shifting one signal over the other one. This can be essentially
achieved by utilizing the Toeplitz matrix transformation. The Toe-
plitz transformation creates a sparse matrix with repeated elements
which, when multiplied with the input signal, produces the convo-
lution result. To illustrate how the convolution operation can be
implemented as a matrix multiplication, let’s take the example of a
3 × 3 kernel (K) and a 4 × 4 input (I):
102 Maria Vakalopoulou et al.
i 00 i 01 i 02 i 03
k00 k01 k02
i 10 i 11 i 12 i 13
K = k10 k11 k12 and I = :
i 20 i 21 i 22 i 23
k20 k21 k22
i 30 i 31 i 32 i 33
Then, the convolution operation can be computed as a matrix
multiplication between the Toepliz transformed kernel:
I = ½ i 00 i 01 i 02 i 03 i 10 i 11 i 12 i 13 i 20 i 21 i 22 i 23 i 30 i 31 i 32 i 33 ⊤ :
The produced output will need to be reshaped as a 2 × 2 matrix
in order to retrieve the convolution output. This matrix multiplica-
tion implementation is quite illuminating on a few of the most
important properties of the convolution operation. These proper-
ties are the main motivation behind using such elements in deep
neural networks.
By transforming the convolution operation to a matrix multi-
plication operation, it is evident that it can fit in the formalization of
the linear functions, which has already been presented in Subhead-
ing 2.3. As such, deep neural networks can be designed in a way to
utilize trainable convolution kernels. In practice, multiple convolu-
tion kernels are learned at each convolutional block, while several of
these trainable convolutional blocks are stacked on top of each
other forming deep CNNs. Typically, the output of a convolution
operation is called a feature map or just features.
Another important aspect of the convolution operation is that
it requires much fewer parameters than the fully connected
MLP-based deep neural networks. As it can also be seen from the
K matrix, the exact same parameters are shared across all locations.
Eventually, rather than learning a different set of parameters for the
different locations of the input, only one set is learned. This is
referred to as parameter sharing or weight sharing and can greatly
decrease the amount of memory that is required to store the
network parameters. An illustration of the process of weight sharing
across locations, together with the fact that multiple filters (result-
ing in multiple feature maps) are computed for a given layer, is
illustrated in Fig. 13. The multiple feature maps for a given layer are
stored using another dimension (see Fig. 14), thus resulting in a 3D
Deep Learning: Basics and CNN 103
Fig. 13 For a given layer, several (usually many) filters are learned, each of them being able to detect a
specific characteristic in the image, resulting in several feature/filter maps. On the other hand, for a given
filter, the weights are shared across all the locations of the image
Fig. 14 The different feature maps for a given layer are arranged along another dimension. The feature maps
will thus be a 3D array when the input is a 2D image (and a 4D array when the input is a 3D image)
array when the input is a 2D image (and a 4D array when the input
is a 3D image).
Convolutional neural networks have proven quite powerful in
processing data with spatial structure (e.g., images, videos, etc.).
This is effectively based on the fact that there is a local connectivity
of the kernel elements while at the same time the same kernel is
applied at different locations of the input. Such processing grants a
quite useful property called translation equivariance enabling the
104 Maria Vakalopoulou et al.
4.3 Functions and An observant reader might have noticed that the convolution
Variants operation can change the dimensionality of the produced output.
In the example visualized in Fig. 10, the image of size 7 × 7, when
convolved with a kernel of size 3 × 3, produces a feature map of size
of 5 × 5. Even though dimension changes can be avoided with
appropriate padding (see Fig. 15 for an illustration of this process)
prior to the convolution operation, in some cases, it is actually
desired to reduce the dimensions of the input. Such a decrease
can be achieved in a number of ways depending on the task at
hand. In this subsection, some of the most typical functions that
are utilized in CNNs will be discussed.
Fig. 15 The padding operation, which involves adding zeros around the image, allows to obtain feature maps
that are of the same size as the original image
Deep Learning: Basics and CNN 105
Fig. 16 Effect of a pooling operation. Here, a maximum pooling of size 2 × 2 with a stride of 2
4.4 Receptive Field In the context of deep neural networks and specifically CNNs, the
Calculation term receptive field is used to define the proportion of the input
that produces a specific feature. For example, a CNN that takes an
image as input and applies only a single convolution operation with
a kernel size of 3 × 3 would have a receptive field of 3 × 3. This
means that for each pixel of the first feature map, a 3 × 3 region of
the input would be considered. Now, if another layer were to be
added, with again 3 × 3 size, then the receptive field of the new
feature map with respect to the CNN’s input would be 5 × 5. In
other words, the proportion of the input that is used to calculate
each element of the feature map of the second convolution layer
increases.
Calculating the receptive field at different parts of a CNN is
crucial when trying to understand the inner workings of a specific
architecture. For instance, a CNN that is designed to take as an
input an image of size 256 × 256 and that requires information
Deep Learning: Basics and CNN 107
from all parts of it should have a receptive field close to the size of
the input. The receptive field can be influenced by all the different
convolution parameters and down-/upsampling operations
described in the previous section. A comprehensive presentation
of mathematical derivations for calculating receptive fields for
CNNs is given in [52].
4.5 Classical In the last decades, a variety of convolutional neural network archi-
Convolutional Neural tectures have been proposed. In this chapter, we cover only a few
Network Architectures classical architectures for classification and regression. Note that
classification and regression can usually be performed with the
same architecture, just changing the loss function (e.g., cross-
entropy for classification, mean squared error for regression).
Architectures for other tasks can be found in other chapters.
Fig. 18 A basic CNN architecture. Classically, it is composed of two main parts. The first one, using
convolution operations, performs feature learning. The features are then flattened and fed into a set of fully
connected layers (i.e., a multilayer perceptron), which performs the classification or the regression task
108 Maria Vakalopoulou et al.
Fig. 19 A drawing describing a CNN architecture. Classically, it is composed of two main parts. Here
16@3 × 3 × 3 means that 16 features with a 3 × 3 × 3 convolution kernel will be computed. For the pooling
operation, the kernel size is also mentioned (2 × 2). Finally, the stride is systematically indicated
5 Autoencoders
are not covered in the present chapter and are presented, together
with other generative models, in Chap. 5.
6 Conclusion
Acknowledgements
References
1. Rosenblatt F (1957) The perceptron, a perceiv- 7. Le Cun Y (1985) Une procédure d’apprentis-
ing and recognizing automaton Project Para. sage pour réseau à seuil assymétrique. Cogni-
Cornell Aeronautical Laboratory, Buffalo tiva 85:599–604
2. Minsky M, Papert S (1969) Perceptron: an 8. Hochreiter S, Schmidhuber J (1997) Long
introduction to computational geometry. short-term memory. Neural Comput 9(8):
MIT Press, Cambridge, MA 1735–1780
3. Minsky ML, Papert SA (1988) Perceptrons: 9. Hinton GE, Osindero S, Teh YW (2006) A fast
expanded edition. MIT Press, Cambridge, MA learning algorithm for deep belief nets. Neural
4. Linnainmaa S (1976) Taylor expansion of the Comput 18(7):1527–1554
accumulated rounding error. BIT Numer Math 10. Hinton GE (2007) Learning multiple layers of
16(2):146–160 representation. Trends Cogn Sci 11(10):
5. Werbos PJ (1982) Applications of advances in 428–434
nonlinear sensitivity analysis. In: System mod- 11. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei
eling and optimization. Springer, Berlin, pp L (2009) ImageNet: a large-scale hierarchical
762–770 image database. In: 2009 IEEE conference on
6. Rumelhart DE, Hinton GE, Williams RJ computer vision and pattern recognition.
(1986) Learning representations by back- IEEE, pp 248–255
propagating errors. Nature 323(6088): 12. Bergstra J, Bastien F, Breuleux O, Lamblin P,
533–536 Pascanu R, Delalleau O, Desjardins G, Warde-
Farley D, Goodfellow I, Bergeron A et al
Deep Learning: Basics and CNN 113
(2011) Theano: deep learning on GPUs with 27. Krizhevsky A, Sutskever I, Hinton GE (2012)
Python. In: NIPS 2011, Big learning work- ImageNet classification with deep convolu-
shop, Granada, Spain, vol 3. Citeseer, pp 1–48 tional neural networks. In: Advances in neural
13. Jia Y, Shelhamer E, Donahue J, Karayev S, information processing systems, vol 25
Long J, Girshick R, Guadarrama S, Darrell T 28. Hein M, Andriushchenko M, Bitterwolf J
(2014) Caffe: convolutional architecture for (2019) Why ReLU networks yield high-
fast feature embedding. In: Proceedings of the confidence predictions far away from the train-
22nd ACM international conference on Multi- ing data and how to mitigate the problem. In:
media, pp 675–678 Proceedings of the IEEE/CVF conference on
14. Abadi M, Agarwal A, Barham P, Brevdo E, computer vision and pattern recognition, pp
Chen Z, Citro C, Corrado GS, Davis A, 41–50
Dean J, Devin M et al (2016) TensorFlow: 29. Maas AL, Hannun AY, Ng AY et al (2013)
large-scale machine learning on heterogeneous Rectifier nonlinearities improve neural network
distributed systems. arXiv preprint acoustic models. In: Proc. ICML, Atlanta,
arXiv:160304467 Georgia, vol 30. p 3
15. Chollet F et al (2015) Keras. https://github. 30. He K, Zhang X, Ren S, Sun J (2015) Delving
com/fchollet/keras deep into rectifiers: surpassing human-level
16. Paszke A, Gross S, Massa F, Lerer A, performance on ImageNet classification. In:
Bradbury J, Chanan G, Killeen T, Lin Z, Proceedings of the IEEE international confer-
Gimelshein N, Antiga L et al (2019) PyTorch: ence on computer vision, pp 1026–1034
an imperative style, high-performance deep 31. Ramachandran P, Zoph B, Le QV (2017)
learning library. In: Advances in neural infor- Searching for activation functions. arXiv pre-
mation processing systems, vol 32 print arXiv:171005941
17. Hebb DO (1949) The organization of behav- 32. Dauphin YN, Pascanu R, Gulcehre C, Cho K,
ior: a psychological theory. Wiley, New York Ganguli S, Bengio Y (2014) Identifying and
18. Cybenko G (1989) Approximations by super- attacking the saddle point problem in high-
positions of a sigmoidal function. Math Con- dimensional non-convex optimization. In:
trol Signals Syst 2:183–192 Advances in neural information processing sys-
19. Hornik K, Stinchcombe M, White H (1989) tems, vol 27
Multilayer feedforward networks are universal 33. Bottou L (2010) Large-scale machine learning
approximators. Neural Netw 2(5):359–366 with stochastic gradient descent. In: Proceed-
20. Mhaskar HN (1996) Neural networks for opti- ings of COMPSTAT’2010. Springer, Berlin,
mal approximation of smooth and analytic pp 177–186
functions. Neural Comput 8(1):164–177 34. Allen-Zhu Z, Li Y, Song Z (2019) A conver-
21. Pinkus A (1999) Approximation theory of the gence theory for deep learning via over-
MLP model in neural networks. Acta Numer 8: parameterization. In: International conference
143–195 on machine learning, PMLR, pp 242–252
22. Poggio T, Mhaskar H, Rosasco L, Miranda B, 35. Baydin AG, Pearlmutter BA, Radul AA, Siskind
Liao Q (2017) Why and when can deep-but JM (2018) Automatic differentiation in
not shallow-networks avoid the curse of machine learning: a survey. J Mach Learn Res
dimensionality: a review. Int J Autom Comput 18:1–43
14(5):503–519 36. Prechelt L (1998) Early stopping-but when? In:
23. Rolnick D, Tegmark M (2017) The power of Neural networks: tricks of the trade. Springer,
deeper networks for expressing natural func- Berlin, pp 55–69
tions. arXiv preprint arXiv:170505502 37. Reed R, MarksII RJ (1999) Neural smithing:
24. Goodfellow I, Bengio Y, Courville A (2016) supervised learning in feedforward artificial
Deep learning. MIT Press, Cambridge, MA neural networks. MIT Press, Cambridge, MA
25. Cover TM (1965) Geometrical and statistical 38. Glorot X, Bengio Y (2010) Understanding the
properties of systems of linear inequalities with difficulty of training deep feedforward neural
applications in pattern recognition. IEEE networks. In: Proceedings of the thirteenth
Trans Electron Comput 3:326–334 international conference on artificial intelli-
gence and statistics, JMLR workshop and con-
26. Glorot X, Bordes A, Bengio Y (2011) Deep ference proceedings, pp 249–256
sparse rectifier neural networks. In: Proceed-
ings of the fourteenth international conference 39. Srivastava N, Hinton G, Krizhevsky A,
on artificial intelligence and statistics, JMLR Sutskever I, Salakhutdinov R (2014) Dropout:
workshop and conference proceedings, pp a simple way to prevent neural networks from
315–323
114 Maria Vakalopoulou et al.
overfitting. J Mach Learn Res 15(1): code recognition. Neural Comput 1(4):
1929–1958 541–551
40. Deng L (2012) The MNIST database of hand- 54. Krizhevsky A, Sutskever I, Hinton GE (2012)
written digit images for machine learning ImageNet classification with deep convolu-
research. IEEE Signal Process Mag 29(6): tional neural networks. In: Pereira F,
141–142 Burges C, Bottou L, Weinberger K (eds)
41. Pérez-Garcı́a F, Sparks R, Ourselin S (2021) Advances in neural information processing sys-
TorchIO: a Python library for efficient loading, tems, vol 25. Curran Associates. https://
preprocessing, augmentation and patch-based proceedings.neurips.cc/paper/2012/file/c3
sampling of medical images in deep learning. 99862d3b9d6b76c8436e924a68c45b-
Comput Methods Programs Biomed 208: Paper.pdf
106236 55. Simonyan K, Zisserman A (2014) Very deep
42. Ioffe S, Szegedy C (2015) Batch normaliza- convolutional networks for large-scale image
tion: accelerating deep network training by recognition. arXiv preprint arXiv:14091556
reducing internal covariate shift. In: Interna- 56. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S,
tional conference on machine learning, Anguelov D, Erhan D, Vanhoucke V, Rabino-
PMLR, pp 448–456 vich A (2015) Going deeper with
43. Brock A, De S, Smith SL, Simonyan K (2021) convolutions. In: Proceedings of the IEEE
High-performance large-scale image recogni- conference on computer vision and pattern
tion without normalization. In: International recognition, pp 1–9
conference on machine learning, PMLR, pp 57. He K, Zhang X, Ren S, Sun J (2016) Deep
1059–1071 residual learning for image recognition. In:
44. Ruder S (2016) An overview of gradient Proceedings of the IEEE conference on com-
descent optimization algorithms. arXiv pre- puter vision and pattern recognition, pp
print arXiv:160904747 770–778
45. Polyak BT (1964) Some methods of speeding 58. Lu MY, Williamson DF, Chen TY, Chen RJ,
up the convergence of iteration methods. Barbieri M, Mahmood F (2021) Data-efficient
USSR Comput Math Math Phys 4(5):1–17 and weakly supervised computational pathol-
46. Qian N (1999) On the momentum term in ogy on whole-slide images. Nat Biomed Eng
gradient descent learning algorithms. Neural 5(6):555–570
Netw 12(1):145–151 59. Benkirane H, Vakalopoulou M,
47. Duchi J, Hazan E, Singer Y (2011) Adaptive Christodoulidis S, Garberis IJ, Michiels S,
subgradient methods for online learning and Cournède PH (2022) Hyper-AdaC: adaptive
stochastic optimization. J Mach Learn Res clustering-based hypergraph representation of
12(7) whole slide images for survival analysis. In:
Machine learning for health, PMLR, pp
48. Kingma DP, Ba J (2014) Adam: a method for 405–418
stochastic optimization. arXiv preprint
arXiv:14126980 60. Horry MJ, Chakraborty S, Paul M, Ulhaq A,
Pradhan B, Saha M, Shukla N (2020) X-ray
49. Liu L, Jiang H, He P, Chen W, Liu X, Gao J, image based COVID-19 detection using
Han J (2019) On the variance of the adaptive pre-trained deep learning models. Engineering
learning rate and beyond. arXiv preprint Archive, Menomonie
arXiv:190803265
61. Li JP, Khan S, Alshara MA, Alotaibi RM,
50. Zhang M, Lucas J, Ba J, Hinton GE (2019) Mawuli C et al (2022) DACBT: deep learning
LookAhead optimizer: k steps forward, 1 step approach for classification of brain tumors
back. Adv Neural Inf Process Syst 32 using MRI data in IoT healthcare environ-
51. Fukushima K, Miyake S (1982) Neocognitron: ment. Sci Rep 12(1):1–14
a self-organizing neural network model for a 62. Nandhini I, Manjula D, Sugumaran V (2022)
mechanism of visual pattern recognition. In: Multi-class brain disease classification using
Competition and cooperation in neural nets. modified pre-trained convolutional neural net-
Springer, Berlin, pp 267–285 works model with substantial data augmenta-
52. Araujo A, Norris W, Sim J (2019) Computing tion. J Med Imaging Health Inform 12(2):
receptive fields of convolutional neural net- 168–183
works. Distill https://doi.org/10.23915/dis 63. Raghu M, Zhang C, Kleinberg J, Bengio S
till.00021 (2019) Transfusion: understanding transfer
53. LeCun Y, Boser B, Denker JS, Henderson D, learning for medical imaging. In: Advances in
Howard RE, Hubbard W, Jackel LD (1989) neural information processing systems, vol 32
Backpropagation applied to handwritten zip
Deep Learning: Basics and CNN 115
64. Wen J, Thibeau-Sutre E, Diaz-Melo M, Sam- 70. Ng A et al (2011) Sparse autoencoder. CS294A
per-González J, Routier A, Bottani S, Lecture Notes 72(2011):1–19
Dormont D, Durrleman S, Burgos N, Colliot 71. Vincent P, Larochelle H, Bengio Y, Manzagol
O (2020) Convolutional neural networks for PA (2008) Extracting and composing robust
classification of Alzheimer’s disease: overview features with denoising autoencoders. In: Pro-
and reproducible evaluation. Med Image Anal ceedings of the 25th international conference
63:101694 on machine learning, pp 1096–1103
65. Chen S, Ma K, Zheng Y (2019) Med3D: trans- 72. Baur C, Denner S, Wiestler B, Navab N, Albar-
fer learning for 3D medical image analysis. qouni S (2021) Autoencoders for unsupervised
arXiv preprint arXiv:190400625 anomaly segmentation in brain MR images: a
66. Tan M, Le Q (2019) EfficientNet: rethinking comparative study. Med Image Anal 69:
model scaling for convolutional neural 101952
networks. In: International conference on 73. Salah R, Vincent P, Muller X, et al (2011)
machine learning, PMLR, pp 6105–6114 Contractive auto-encoders: explicit invariance
67. Wang J, Liu Q, Xie H, Yang Z, Zhou H (2021) during feature extraction. In: Proceedings of
Boosted EfficientNet: detection of lymph node the 28th international conference on machine
metastases in breast cancer using convolutional learning, pp 833–840
neural networks. Cancers 13(4):661 74. Ronneberger O, Fischer P, Brox T (2015)
68. Oloko-Oba M, Viriri S (2021) Ensemble of U-net: convolutional networks for biomedical
EfficientNets for the diagnosis of tuberculosis. image segmentation. In: International Confer-
Comput Intell Neurosci 2021:9790894 ence on Medical image computing and
69. Ali K, Shaikh ZA, Khan AA, Laghari AA (2021) computer-assisted intervention. Springer, Ber-
Multiclass skin cancer classification using lin, pp 234–241
EfficientNets—a first step towards preventing
skin cancer. Neurosci Inform 2(4):100034
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made. The images or other
third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated
otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder.
Chapter 4
Abstract
Recurrent neural networks (RNNs) are neural network architectures with hidden state and which use
feedback loops to process a sequence of data that ultimately informs the final output. Therefore, RNN
models can recognize sequential characteristics in the data and help to predict the next likely data point in
the data sequence. Leveraging the power of sequential data processing, RNN use cases tend to be
connected to either language models or time-series data analysis. However, multiple popular RNN
architectures have been introduced in the field, starting from SimpleRNN and LSTM to deep RNN, and
applied in different experimental settings. In this chapter, we will present six distinct RNN architectures and
will highlight the pros and cons of each model. Afterward, we will discuss real-life tips and tricks for training
the RNN models. Finally, we will present four popular language modeling applications of the RNN
models –text classification, summarization, machine translation, and image-to-text translation– thereby
demonstrating influential research in the field.
Key words Recurrent neural network (RNN), LSTM, GRU, Bidirectional RNN (BRNN), Deep
RNN, Language modeling
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_4,
© The Author(s) 2023
117
118 Susmita Das et al.
Unfold
h h(....) h(t-1) h(t) h(t+1) h(....)
Fig. 1 (Left) Circuit diagram for SimpleRNN with input x being incorporated into hidden state h with a feedback
connection and an output o. (Right) The same SimpleRNN network shown as an unfolded computational graph
with nodes at every time step
where b and c are the biases and U, V , and W are the weight matrix
for input-to-hidden connections, hidden-to-output connection,
and hidden-to-hidden connections respectively, and σ is a sigmoid
function. The total loss for a sequence of x values and its
corresponding y values is obtained by summing up the losses over
all time steps.
τ
ð Þ
L ðtÞ = L ðx ð1Þ , . . . . . . , x ðτÞ Þ, ðy ð1Þ , . . . . . . , y ðτÞ Þ
t =1
2.1.3 SimpleRNN Different variations also exist depending on the number of inputs
Architecture Variations and outputs:
Based on Inputs and
• One-to-one: The traditional RNN has one-to-one input to out-
Outputs
put mapping at each time step t as shown in Fig. 3a.
• One-to-many: One-to-many RNN has one input at a time step
for which it generates a sequence of outputs at consecutive time
steps as shown in Fig. 3b. This type of RNN architecture is often
used for image captioning.
RNN Architectures and Research 121
(a) (b)
o(t-1) o(t) o(t+1) o(....) o(t-1) o(t) o(t+1)
h(....) h(t-1) h(t) h(t+1) h(....) h(....) h(t-1) h(t) h(t+1) h(....)
(c)
o
Fig. 2 Types of SimpleRNN architectures based on parameter sharing: (a) SimpleRNN with connections
between hidden units, (b) SimpleRNN with connections from output to hidden units, and (c) SimpleRNN with
connections between hidden units that read the entire sequence and produce a single output
2.1.4 Challenges of SimpleRNN works well with the short-term dependencies, but
Long-Term Dependencies when it comes to long-term dependencies, it fails to remember
in SimpleRNN the long-term information. This problem arises due to the vanish-
ing gradient or exploding gradient [6]. When the gradients are
propagated over many stages, it tends to vanish most of the times
or sometimes explodes. The difficulty arises due to the exponen-
tially smaller weight assigned to the long-term interactions com-
pared to the short-term interactions. It takes very long time to learn
the long-term dependencies as the signals from these dependencies
tend to be hidden by the small fluctuations arising from the short-
term dependencies.
122 Susmita Das et al.
(d) (e)
Fig. 3 (a) One-to-one RNN. (b) One-to-many RNN. (c) Many-to-one RNN. (d) Many-to-many RNN. (e) Many-to-
many RNN. x represents the input and o represents the output
2.2 Long Short-Term To address this long-term dependency problem, gated RNNs were
Memory (LSTM) proposed. Long short-term memory (LSTM) is a type of gated
RNN which was proposed in 1997 [7]. Due to the property of
remembering the long-term dependencies, LSTM has been a suc-
cessful model in many applications like speech recognition,
machine translation, image captioning, etc. LSTM has an inner
self loop in addition to the outer recurrence of the RNN. The
gradients in the inner loop can flow for longer duration and are
conditioned on the context rather than being fixed. In each cell, the
input and output is the same as that of ordinary RNN but has a
system of gating units to control the flow of information. Figure 4
shows the flow of the information in LSTM with its gating units.
There are three gates in the LSTM—the external input gate,
the forget gate, and the output gate. The forget gate at time t and
ðtÞ
state si (f i ) decides which information should be removed from
the cell state. The gate controls the self loop by setting the weight
between 0 and 1 via a sigmoid function σ. When the value is near to
1, the information of the past is retained, and if the value is near to
RNN Architectures and Research 123
ot
ht
c(t–1) ct
X +
tanh
X
s s tanh s X
h(t–1) ht
xt
Fig. 4 Long short-term memory with cell state ct, hidden state ht, input xt, and output ot
ðtÞ f f ðt - 1Þ f
f i =σ U i , j x tj þ W i , j hj þ bi
j j
ðtÞ ðt - 1Þ ðt - 1Þ
s i = f ti s i þ g ti σ b i þ U i , j x tj þ W i , j hj
j j
g g g ðt - 1Þ
g ti = σ b i þ U i , j x tj þ W i , j hj
j j
h ti = tanhðs ti Þq ti
ðt - 1Þ
q ti = σ b oi þ U oi , j x ti þ W oi , j h j
j
ht
h(t–1)
X +
Update
Reset 1–
X gate
gate
X
s s
tanh
xt
Fig. 5 Gated recurrent neural network (GRU) with input xt and hidden unit ht
2.3 Gated Recurrent In LSTM, the computation time is large as there are a lot of
Unit (GRU) parameters involved during back-propagation. To reduce the com-
putation time, gated recurrent unit (GRU) was proposed in the
year 2014 by Cho et al. with less gates than in LSTM [8]. The
functionality of the GRU is similar to that of LSTM but with a
modified architecture. The representation diagram for GRU can be
found in Fig. 5. Like LSTM, GRU also solves the vanishing and
exploding gradient problem by capturing the long-term dependen-
cies with the help of gating units. There are two gates in GRU, the
reset gate and the update gate. The reset gate determines how
much of the past information it needs to forget, and the update
gate determines how much of the past information it needs to carry
forward.
The computation at the reset gate (r ti ) and the update gate (uti ),
as well as hidden state (h ti ) and the at time t, can be represented by the
following:
ðtÞ ðtÞ ðtÞ
r i = σðb ri þ U ri,j x j þ W ri,j h j Þ
j j
ðtÞ ðtÞ ðtÞ
ui = σðb ui þ Uui,j x j þ W ui,j h j Þ
j j
ðtÞ ðt - 1Þ ðt - 1Þ
hi = ui hi þ ð1 - ui Þ
ðt - 1Þ ðt - 1Þ ðt - 1Þ
× σðb i þ U i,j x j þ W i,j r j hj Þ
j j
2.3.1 Advantage of LSTM The LSTM and GRU can handle the vanishing gradient issue of
and GRU over SimpleRNN SimpleRNN with the help of gating units. The LSTM and GRU
have the additive feature that they retain the past information by
adding the relevant past information to the present state. This
additive property makes it possible to remember a specific feature
in the input for longer time. In SimpleRNN, the past information
loses its relevance when new input is seen. In LSTM and GRU, any
important feature is not overwritten by new information. Instead, it
is added along with the new information.
2.3.2 Differences There are a few differences between LSTM and GRU in terms of
Between LSTM and GRU gating mechanism which in turn result in differences observed in
the content generated. In LSTM unit, the amount of the memory
content to be used by other units of the network is regulated by the
output gate, whereas in GRU, the full content that is generated is
exposed to other units. Another difference is that the LSTM com-
putes the new memory content without controlling the amount of
previous state information flowing. Instead, it controls the new
memory content that is to be added to the network. On the other
hand, the GRU controls the flow of the past information when
computing the new candidate without controlling the candidate
activation.
2.4 Bidirectional In SimpleRNN, the output of a state at time t only depends on the
RNN (BRNN) information of the past x(1), .. . ., x(t-1) and the present input x(t).
However, for many sequence-to-sequence applications, the present
state output depends on the whole sequence information. For
example, in language translation, the correct interpretation of the
current word depends on the past words as well as the next words.
To overcome this limitation of SimpleRNN, bidirectional RNN
(BRNN) was proposed by Schuster and Paliwal in the year
1997 [9].
Bidirectional RNNs combine an RNN which moves forward
with time, beginning from the start of the sequence, with another
RNN that moves backward through time, beginning from the end
of the sequence. Figure 6 illustrates a bidirectional RNN with h(t)
126 Susmita Das et al.
Fig. 6 Bidirectional RNN with forward sub-RNN having ht hidden state and
backward sub-RNN having gt hidden state
the state of the sub-RNN that moves forward through time and g(t)
the state of the sub-RNN that moves backward with time. The
output of the sub-RNN that moves forward is not connected to
the inputs of sub-RNN that moves backward and vice versa. The
output o(t) depends on both past and future sequence data but is
sensitive to the input values around t.
2.5 Deep RNN Deep models are more efficient than their shallow counterparts,
and, with the same hypothesis, deep RNN was proposed by
Pascanu et al. in 2014 [10]. In “shallow” RNN, there are generally
three blocks for computation of parameters: the input state, the
hidden state, and the output state. These blocks are associated with
a single weight matrix corresponding to a shallow transformation
which can be represented by a single-layer multilayer perceptron
(MLP). In deep RNN, the state of the RNN can be decomposed
into multiple layers. Figure 7 shows in general a deep RNN with
multiple deep MLPs. However, different types of depth in an RNN
can be considered separately like input-to-hidden, hidden-to-
hidden, and hidden-to-output layer. The lower layer in the hierar-
chy can transform the input into an appropriate representation for
higher levels of hidden state. In hidden-to-hidden state, it can be
constructed with a previous hidden state and a new input. This
introduces additional non-linearity in the architecture which
becomes easier to quickly adapt changing modes of the input. By
introducing deep MLP in hidden-to-output state makes the layer
compact which helps in summarizing the previous inputs and helps
in predicting the output easily. Due to the deep MLP in the RNN
architecture, the learning becomes slow and optimization is
difficult.
RNN Architectures and Research 127
o1 o2 o3 ot
x1 x2 x3 xt
long sequence if the context vector is too small. This problem was
solved by Bahdanau et al. (2015) [11] by making the context vector
a variable length sequence with added attention mechanism.
2.7 Attention Models Due to the sequential learning mechanism, the context vector
(Transformers) generated by the encoder (see Subheading 2.6) is more focused
on the later part of the sequence than on the earlier part. An
extension to the encoder–decoder model was proposed by
Bahdanau et al. [11] for machine translation where the model
generates each word based on the most relevant information in
the source sentence and previously generated words. Unlike the
previous encoder–decoder model where the whole input sequence
is encoded into a single context vector, this extended encoder–
decoder model learns to give attention to the relevant words pres-
ent in the source sequence regardless of the position in the
sequence by encoding the input sequence into sequences of vectors
and chooses selectively while decoding each word. This mechanism
of paying attention to the relevant information that are related to
each word is known as attention mechanism.
Although this model solves the problem for fixed-length con-
text vectors, the sequential decoding problem still persists. To
decode the sequence in less time by introducing parallelism, self-
attention was proposed by Google Brain team, Ashish Vaswani et al.
[12]. They invented the Transformer model which is based on self-
attention mechanism and was designed to reduce the computation
time. It computes the representation of a sequence that relates to
different positions of the same sequence. The self-attention mech-
anism was embedded in the Transformer model. The Transformer
model has a stack of six identical layers each for encoding the
sequence and decoding the sequence as illustrated in Fig. 8. Each
layer of the encoder and decoder has sub-layers comprising multi-
head self-attention mechanisms and position-wise fully connected
layers. There is a residual connection around the two sub-layers
followed by normalization. In addition to the two sub-layers, there
is a third layer in the decoder that performs multi-head attention
over the output of the encoder stack. In the decoder, the multi-
head attention is masked to prevent the position from attending the
later part of the sequence. This ensures that the prediction for a
position p depends only on the positions less than p in the sequence.
The attention function can be described as mapping a query and
key-value pairs to an output. All the parameters involved in the
computation are all vectors. To calculate the output, scalar
dot product operation is p performed on the query and all keys,
and divide each key by d k (where dk is the dimension on
the keys). Finally, the softmax is applied to it to obtain the
weights on the values. The computation of attention function
can be represented by the following equation:
RNN Architectures and Research 129
Output
Encoder Decoder
Encoder Decoder
Encoder Decoder
Encoder Decoder
Encoder Decoder
Encoder Decoder
Input
Fig. 8 Transformer with six layers of encoders and six layers of decoders
3.1 Skip Connection The practice of skipping layers effectively simplifies the network by
using fewer direct connected layers in the initial training stages.
This speeds learning by reducing the impact of vanishing gradients,
as there are fewer layers to propagate through. As the network
learns the feature space during the training phase, it gradually
restores the skipped layers. Lin et al. [15] proposed the use of
such skip connections, which follows from the idea of incorporating
delays in feedforward neural networks from Lang et al. [16]. Con-
ceptually, skip connections are a standard module in deep architec-
tures and are commonly referred to as residual networks, as
described by He et al. [17]. They are responsible to skip layers in
the neural network and feeding the output of one layer as the input
to the next layers. This technique is used to allow gradients to flow
through a network directly, without passing through non-linear
activation functions, and it has been empirically proven that these
additional steps are often beneficial for the model convergence
[17]. Skip connections can be used through the non-sequential
layer in two fundamental ways in neural networks:
• Additive Skip Connections. In this type of design, the data
from early layers is transported to deeper layers via matrix addi-
tion, causing back-propagation to be done via addition
(Fig. 9b). This procedure does not require any additional para-
meters because the output from the previous layer is added to
the layer ahead. One of the most common techniques used in
this type of architecture is to stack the skip residual blocks
together and use an identity function to preserve the gradient
[18]. The core concept is to use a vector addition to back-
RNN Architectures and Research 131
Fig. 9 Skip connection residual architectures: (a) concatenate output of previous layer and skip connection; (b)
sum of the output of previous layer and skip connection
3.2 Leaky Units One of the major challenges when training RNNs is capturing
long-term dependencies and efficiently transferring information
from distant past to present. An effective method to obtain coarse
time scales is to employ leaky units [19], which are hidden units
with linear self-connections and a weight on the connections that is
close to one. In a leaky RNN, hidden units are able to access values
from prior states and can be utilized to obtain temporal representa-
tions. Formula ht = α ht-1 + (1 - α) ht expresses the state update
132 Susmita Das et al.
3.3 Clipping Gradient clipping is a technique that tries to overcome the explod-
Gradients ing gradient problem in RNN training, by constraining gradient
norms (element-wise) to a predetermined minimum or maximum
threshold value since the exploding gradients are clipped and the
optimization begins to converge to the minimum point. Gradient
clipping can be used in two fundamental ways:
4.2.2 Abstractive Text Abstractive summarization frameworks expect the RNN to process
Summarization input text and generate a new sequence of text that is the summary
of input text, effectively using many-to-many RNN as a text gener-
ation model. While it is relatively straightforward for extractive
summarizers to achieve basic grammatical correctness as correct
sentences are picked from the document to generate a summary,
it has been a major challenge for abstractive summarizers. Gram-
matical correctness depends on the quality of the text generation
module. Grammatical correctness of abstractive text summarizers
has improved recently due to developments in contextual text
processing, language modeling, as well as availability of computa-
tional power to process large amounts of text.
Handling of rare tokens/words is a major concern for modern
abstractive summarizers. For example, proper nouns such as specific
names of people and places occur less frequently in the text; how-
ever, generated summaries are incomplete and incomprehensible if
such tokens are ignored. Nallapati et al. proposed a novel solution
composed of GRU-RNN layers with attention mechanism by
including switching decoder in their abstractive summarizer archi-
tecture [28] where the text generator module has a switch which
can enable the module to choose between two options: (1) generate
a word from the vocabulary and (2) point to one of the words in the
input text. Their model is capable of handling rare tokens by
pointing to their position in the original text. They also employed
large vocabulary trick which limits the vocabulary of the generator
module to tokens of the source text only and then adds frequent
tokens to the vocabulary set until its size reaches a certain thresh-
old. This trick is useful in limiting the size of the network.
Summaries have latent structural information, i.e., they convey
information following certain linguistic structures such as “What-
Happended” or “Who-Action-What.” Li et al. presented a recur-
rent generative decoder based on variational auto-encoder (VAE)
[29]. VAE is a generative model that takes into account latent
variables, but is not inherently sequential in nature. With the his-
torical dependencies in latent space, it can be transformed into a
sequential model where generative output is taking into account
history of latent variables, hence producing a summary following
latent structures.
4.3 Machine Neural machine translation (NMT) models are trained to process
Translation input sequence of text and generate an output sequence which is
the translation of the input sequence in another language. As
mentioned in Subheading 2.6, machine translation is a classic
example of conversion of one sequence to another using encoder–
RNN Architectures and Research 135
4.5 ChatBot for ChatBots are automatic conversation tools that have gained vast
Mental Health and popularity in e-commerce and as digital personal assistants like
Autism Spectrum Apple’s Siri and Amazon’s Alexa. ChatBots represent an ideal appli-
Disorder cation for RNN models as conversations with ChatBots represent
sequential data. Questions and answers in a conversation should be
based on past iterations of questions and answers in that conversa-
tion as well as patterns of sequences learned from other conversa-
tions in the dataset.
Recently, ChatBots have found application in screening and
intervention for mental health disorders such as autism spectrum
disorder (ASD). Zhong et al. designed a Chinese-language Chat-
Bot using bidirectional LSTM in sequence-to-sequence framework
which showed great potential for conversation-mediated interven-
tion for children with ASD [35]. They used 400,000 selected
sentences from chatting histories involving children in many
cases. Rakib et al. developed similar sequence-to-sequence model
based on Bi-LSTM to design a ChatBot to respond empathetically
to mentally ill patients [36]. A detailed survey of medical ChatBots
is presented in [37]. This survey includes references to ChatBots
built using NLP techniques, knowledge graphs, as well as modern
RNN for a variety of applications including diagnosis, searching
through medical databases, dialog with patients, etc.
5 Conclusion
References
1. Rumelhart DE, Hinton GE, Williams RJ 13. Bengio Y, Simard P, Frasconi P (1994)
(1986) Learning representations by back- Learning long-term dependencies with gradi-
propagating errors. Nature 323(6088): ent descent is difficult. IEEE Trans Neural
533–536 Netw 5(2):157–166. https://doi.org/10.110
2. Hopfield JJ (1982) Neural networks and phys- 9/72.279181
ical systems with emergent collective computa- 14. Pascanu R, Mikolov T, Bengio Y (2013) On
tional abilities. Proc Natl Acad Sci 79(8): the difficulty of training recurrent neural
2554–2558 networks. In: Dasgupta S, McAllester D (eds)
3. Schmidhuber J (1993) Netzwerkarchitekturen, Proceedings of the 30th international confer-
Zielfunktionen und Kettenregel (Network ence on machine learning, PMLR, Atlanta, vol
architectures, objective functions, and chain 28, pp 1310–1318
rule), Habilitation thesis, Institut für Informa- 15. Berger AL, Pietra VJD, Pietra SAD (1996) A
tik, Technische Universitüt München maximum entropy approach to natural lan-
4. Mozer MC (1995) A focused backpropagation guage processing. Comput Linguist 22(1):
algorithm for temporal. Backpropag Theory 39–71
Architect Appl 137 16. Becker S, Hinton G (1992) Self-organizing
5. Goodfellow I, Bengio Y, Courville A (2016) neural network that discovers surfaces in
Deep learning. MIT Press, Cambridge random-dot stereograms. Nature 355:161–
6. Hochreiter S (1998) The vanishing gradient 163. https://doi.org/10.1038/355161a0
problem during learning recurrent neural nets 17. He K, Zhang X, Ren S, Sun J (2016) Deep
and problem solutions. Int J Uncertainty Fuzz- residual learning for image recognition. In:
iness Knowledge Based Syst 6(02):107–116 2016 IEEE conference on computer vision
7. Hochreiter S, Schmidhuber J (1997) Long and pattern recognition (CVPR), pp
short-term memory. Neural Comput 9(8): 770–778. https://doi.org/10.1109/CVPR.
1735–1780 2016.90
8. Cho K, Van Merriënboer B, Gulcehre C, 18. Wu H, Zhang J, Zong C (2016) An empirical
Bahdanau D, Bougares F, Schwenk H, Bengio exploration of skip connections for sequential
Y (2014) Learning phrase representations tagging. Preprint. arXiv:161003167
using RNN encoder-decoder for statistical 19. Jaeger H (2002) Tutorial on training recurrent
machine translation. Preprint. arXiv:14061078 neural networks, covering BPPT, RTRL, EKF
9. Schuster M, Paliwal KK (1997) Bidirectional and the echo state network approach.
recurrent neural networks. IEEE Trans Signal GMD-Forschungszentrum
Process 45(11):2673–2681 Informationstechnik 5
10. Pascanu R, Gulcehre C, Cho K, Bengio Y 20. Bengio Y, Ducharme R, Vincent P (2001) A
(2013) How to construct deep recurrent neu- neural probabilistic language model. In:
ral networks. Preprint. arXiv:13126026 Advances in neural information processing sys-
tems, pp 932–938
11. Bahdanau D, Cho K, Bengio Y (2014) Neural
machine translation by jointly learning to align 21. Mikolov T, Karafiát M, Burget L et al (2010)
and translate. Preprint. arXiv:14090473 Recurrent neural network based language
model. In: INTERSPEECH 2010. Citeseer
12. Vaswani A, Shazeer N, Parmar N, Uszkoreit J,
Jones L, Gomez AN, Kaiser Ł, Polosukhin I 22. Jain G, Sharma M, Agarwal B (2019) Optimiz-
(2017) Attention is all you need. In: Advances ing semantic lstm for spam detection. Int J
in neural information processing systems, pp Inform Technol 11(2):239–250
5998–6008
138 Susmita Das et al.
23. Bagnall D (2015) Author identification using 31. Luong MT, Pham H, Manning CD (2015)
multi-headed recurrent neural networks. Pre- Effective approaches to attention-based neural
print. arXiv:150604891 machine translation. Preprint.
24. Zhou C, Sun C, Liu Z, Lau F (2015) A arXiv:150804025
C-LSTM neural network for text classification. 32. Karpathy A, Fei-Fei L (2015) Deep visual-
Preprint. arXiv:151108630 semantic alignments for generating image
25. Wang B (2018) Disconnected recurrent neural descriptions. In: Proceedings of the IEEE con-
networks for text categorization. In: Proceed- ference on computer vision and pattern recog-
ings of the 56th annual meeting of the associa- nition, pp 3128–3137
tion for computational linguistics (volume 1: 33. Xu K, Ba J, Kiros R, Cho K, Courville A,
long papers), pp 2311–2320 Salakhudinov R, Zemel R, Bengio Y (2015)
26. Nallapati R, Zhai F, Zhou B (2017) Summar- Show, attend and tell: neural image caption
unner: a recurrent neural network based generation with visual attention. In: Interna-
sequence model for extractive summarization tional conference on machine learning, PMLR,
of documents. In: Thirty-first AAAI conference pp 2048–2057
on artificial intelligence 34. Biten AF, Gomez L, Rusinol M, Karatzas D
27. Xu J, Durrett G (2019) Neural extractive text (2019) Good news, everyone! context driven
summarization with syntactic compression. entity-aware captioning for news images. In:
Preprint. arXiv:190200863 Proceedings of the IEEE/CVF conference on
28. Nallapati R, Zhou B, dos Santos C, Gulcehre computer vision and pattern recognition, pp
Ç, Xiang B (2016) Abstractive text summariza- 12466–12475
tion using sequence-to-sequence rnns and 35. Zhong H, Li X, Zhang B, Zhang J (2020) A
beyond. In: Proceedings of the 20th SIGNLL general chinese chatbot based on deep learning
conference on computational natural language and its’ application for children with ASD. Int J
learning, pp 280–290 Mach Learn Comput 10:519–526. https://
29. Li P, Lam W, Bing L, Wang Z (2017) Deep doi.org/10.18178/ijmlc.2020.10.4.967
recurrent generative decoder for abstractive 36. Rakib AB, Rumky EA, Ashraf AJ, Hillas MM,
text summarization. In: Proceedings of the Rahman MA (2021) Mental healthcare chatbot
2017 conference on empirical methods in nat- using sequence-to-sequence learning and
ural language processing, pp 2091–2100 bilstm. In: Brain informatics, springer interna-
30. Cho K, van Merrienboer B, Gülçehre Ç, tional publishing, pp 378–387
Bahdanau D, Bougares F, Schwenk H, Bengio 37. Tjiptomongsoguno ARW, Chen A, Sanyoto
Y (2014) Learning phrase representations HM, Irwansyah E, Kanigoro B (2020) Medical
using RNN encoder-decoder for statistical chatbot techniques: a review. In: Silhavy R,
machine translation. In: The 2014 conference Silhavy P, Prokopova Z (eds) Software engi-
on empirical methods in natural language pro- neering perspectives in intelligent systems.
cessing (EMNLP) Springer International Publishing, Cham, pp
346–356
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 5
Abstract
Generative networks are fundamentally different in their aim and methods compared to CNNs for classifi-
cation, segmentation, or object detection. They have initially been meant not to be an image analysis tool
but to produce naturally looking images. The adversarial training paradigm has been proposed to stabilize
generative methods and has proven to be highly successful—though by no means from the first attempt.
This chapter gives a basic introduction into the motivation for generative adversarial networks (GANs)
and traces the path of their success by abstracting the basic task and working mechanism and deriving the
difficulty of early practical approaches. Methods for a more stable training will be shown, as well as typical
signs for poor convergence and their reasons.
Though this chapter focuses on GANs that are meant for image generation and image analysis, the
adversarial training paradigm itself is not specific to images and also generalizes to tasks in image analysis.
Examples of architectures for image semantic segmentation and abnormality detection will be acclaimed,
before contrasting GANs with further generative modeling approaches lately entering the scene. This will
allow a contextualized view on the limits but also benefits of GANs.
Key words Generative models, Generative adversarial networks, GAN, CycleGAN, StyleGAN,
VQGAN, Diffusion models, Deep learning
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_5,
© The Author(s) 2023
139
140 Markus Wenzel
Fig. 1 The fundamental GAN setup for image generation consisting of a genera-
tor and a discriminator network; here, CNNs
Fig. 2 Google web search-based interest estimate for “generative adversarial networks” since 2014. Relative
scale
Fig. 3 Some of the most-starred shared GAN code repositories on Github, until 2018. Ranking within this
selection in brackets
2 Generative Models
2.1 The Language of Understanding the principles of generative models requires a basic
Generative Models: knowledge of distributions. The reason is that—as already hinted at
Distributions, Density in the previous section—the “fraction of the world” is in fact
Estimation, and something that can be thought of as a distribution in a parameter
Estimators space. If you were to describe a part of the world in a computer-
interpretable way, you would define descriptive parameters. To
describe persons, you could characterize them by simple measures
like age, height, weight, hair and eye color, and many more. You
could add blood pressure, heart rate, muscle mass, maximum
strength, and more, and even a whole-genome sequencing result
might be a parameter. Each of the parameters individually can be
collected for the world population, and you will obtain a picture of
how this parameter is “distributed” worldwide. In addition, para-
meters will be in relation with each other, for example, age and
maximum strength. Countless such relationships exist, of which the
majority are and probably will remain unknown. Those interrela-
tionships are called a joint distribution. Would you know the joint
distribution, you could “create” a plausible parameter combination
of a nonexisting human. Let us formalize these thoughts now.
GANs and Beyond 143
(continued)
144 Markus Wenzel
Box 1 (continued)
described by b B. The
conditional probability is
written as p(a|b) for the
concrete instances or P(A|
B) if talking about the
entire probability distribu-
tions A and B.
Joint Probability P(A, B) The probability of seeing
instantiations of A and
B together is termed the
joint probability. Notably,
if expanded, this will lead
to a large table of probabil-
ities, joining each possible
a A (e.g., stroke, demen-
tia, Parkinson’s disease,
etc.) with each possible
b B (casual smoker, fre-
quent smoker, nonsmoker,
etc.).
Marginal Probability The marginal probabilities
of A and B (denoted,
respectively, P(A) and
P(B)) are the probabilities
of each possible outcome
across (and independent
of) all of the possible out-
comes of the other distribu-
tion. For example, it is the
probability of seeing non-
smokers across all neuro-
logical diseases or seeing a
specific disease regardless of
smoking status. It is said to
be the probability of one
distribution marginalized
over the other probability
distributions.
2.1.2 Density Estimation We assume in the following that our observations have been pro-
duced by a function or process that is not known to us and that
cannot be guessed from an arrangement of the observations. In a
practical example, the images from a CT or MRI scanner are pro-
duced by such a function. Notably, the concern is less about the
intractability of the imaging physics but about the appearance of the
human body. The imaging physics might be modeled analytically
up to a certain error. But the outer shape and inner structure of the
GANs and Beyond 145
Fig. 4 Brain MRI (left) and histogram of gray values for one slice of a brain MRI
2200
2000
1800
1600
1400
Count
1200
1000
800
600
400
200
0
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
Voxel value
Fig. 5 Abdominal CT (left) and histogram of gray values for one slice of an abdominal CT
146 Markus Wenzel
1. Gaussian
2. Gaussian
3. Gaussian
4. Gaussian
Gaussian Mixture
0.012
0.010
0.008
0.006
0.004
0.002
0.000
0 250 500 750 1000 1250 1500 1750 2000
Fig. 6 A Gaussian mixture model (GMM) of four Gaussians was fit to the brain MRI data we have visualized as a
histogram in Fig. 4
2.1.3 Estimators and the Assume we have found suitable mean values and standard devia-
Expected Value tions for three normal distributions that together approximate the
shape of the MRI data density estimate to our satisfaction. Such a
combination of normal (Gaussian) distributions is called a Gaussian
mixture model (GMM), and sampling from such a GMM is
straightforward. We are thus able to sample single pixels in any
number, and over time we will sample them such that their density
estimate or histogram will look similar to the one we started with.
However, if we want to generate a brain MRI image using a
sampling process from our closed-form GMM representation of the
distribution, we will notice that a very important notion wasn’t
respected in our approach. We start with one slice of 512 × 512
voxels and therefore randomly draw the required number of voxel
values from the distribution. However, this will not yield an image
that resembles one slice of a brain MRI, but will almost look like
random noise, because we did not model the spatial relation of the
gray values with respect to each other. Since the majority of voxels
of a brain MRI are not independent of each other, drawing one new
voxel from the distribution needs to depend on the spatial locations
and gray values of all voxels drawn before. Neighboring voxels will
have a higher likelihood of similar gray values than voxels far apart
from each other, for example. More crucially, underneath the inter-
dependence lies the image generation process: the image values
observed in a real brain MRI stem from actual tissue—and this is
what defines their interdependence. This means the anatomy of the
brain indirectly reflects itself in the rules describing the dependency
of gray values of one another.
For the modeling process, this implies that we cannot argue
about single-voxel values and their likelihood, but we need to
approach the generative process differently. One idea for a genera-
tive process has been implied in the above description already: pick
a random location of the to-be-generated image and predict the
gray value depending on all existing voxel values. Implemented
with the method of mixture models, this results in unfathomably
many distributions to be estimated, as for each possible “next
voxel” location, any possible combination of already existing
voxel numbers and positions needs to be considered. We will see
in Subheading 5.1 on diffusion models how this general approach
to image generation can still be made to work.
A different sequential approach to image generation has also
been attempted, in which pixels are generated in a defined order,
starting at the top left and scanning the image row by row across
the columns. Again, the knowledge about the already produced
GANs and Beyond 149
pixels is memorized and used to predict the next voxel. This has
been dubbed the PixelRNN (Pixel Recurrent Neural Network),
which lends its general idea from text processing networks [9].
Lastly, a direct approach to image generation could be formu-
lated by representing or approximating the full joint distribution of
all voxels in one distribution that is tangible and to sample all voxels
at once from this. The full joint distribution in this approach
remains implicit, and we use a surrogate. This will actually be the
approach implemented in GANs, though not in a naı̈ve way.
Running the numbers of what a likelihood-based naı̈ve
approach implies, the difficulties of making it work will become
obvious. Consider an MRI image as the joint distribution of
512 × 512 voxels (one slice of our brain MRI), where we approxi-
mated the gray value distribution of one voxel with a GMM with six
parameters. This results in a joint distribution of 512 × 512 × 6 = 1,
572, 864 parameters. Conceptually, this representation therefore
spans a 1,572,864-dimensional space, in which every one brain
MRI slice will be one data point. Referring back to the histograms
of CT and MRI images in the figures above, we have seen continu-
ous lines with densities because we have collected all voxels of an
entire medical image, which are many million. Still, we only covered
one single dimension out of the roughly 1.5 million. Searching for
the density in the 1,572,864-dimensional MRI-slice-space that is
given by all collected brain MRI slices is the difficult task any
generative algorithm has to solve. In this vastly large space, the
brain MRI slices “live” in a very tiny region that is extremely hard to
find. We say the images occupy a low-dimensional manifold within
the high-dimensional space.
Consider the maximum likelihood formulation
θ^ = arg max xP data log Q θ ðxjθÞ ð1Þ
θ
1
We will use θ when referring to parameters of models in general but designate parameters of neural networks
with w in accordance with literature.
150 Markus Wenzel
2.1.4 Sampling from When a distribution is a model of how observed values occur, then
Distributions sampling from this distribution is the process of generating random
new values that could have been observed, with a probability similar
to the probability to observe this value in reality. There are two
basic approaches to sampling from distributions: generating a ran-
dom number from the uniform distribution (this is what a random
number generator is always doing underneath) and feeding this
number through the inverse cumulative density function (iCDF)
of the distribution, which is the function that integrates the proba-
bility density function (PDF) of the distribution. This can only be
achieved if the CDF is given in closed form. If it is not, the second
approach to sampling can be used, which is called acceptance
(or rejection) sampling. With f being the PDF, two random num-
bers x and y are drawn from the uniform distribution. The random
x is accepted, if f(x) > y, and rejected otherwise.
Our use case, as we have seen, involves not only high-
dimensional (multivariate) distributions but even more their joints,
and they are not given in closed form. In such scenarios, sampling
can be done still, using Markov chain Monte Carlo (MCMC)
sampling, which is a framework using rejection sampling with
added mechanisms to increase efficiency. While MCMC has favor-
able theoretic properties, it is still computationally very demanding
for complex joint distributions, which leads to important difficul-
ties in the context of sampling from distributions we are facing in
the domain of image analysis and generation.
We are therefore at this point facing two problems: we can
hardly hope to be able to estimate the density, and even if we
could, we could practically not sample from it.
3.1 Generative vs. To emphasize the difficulty that generative models are facing, com-
Discriminative Models pare them to discriminative models. Discriminative models solve
tasks like classification, detection, and segmentation, to name some
of the most prominent examples. How classification models are in
the class of discriminative models is obvious: discriminating exam-
ples is exactly classifying them. Detection models are also discrimi-
native models, though in a broader sense, in that they classify the
GANs and Beyond 151
Fig. 7 The discriminative task compared to the generative task. Discriminative models only need to find the
separating line between classes, while generative models need to delineate the part of space covering the
classes (figure inspired by: https://developers.google.com/machine-learning/gan/generative)
152 Markus Wenzel
3.2 Before GANs: Generative adversarial networks (GANs) haven’t been the first or
Variational only attempt at generating realistically looking images (or any type
Autoencoders of output, generally speaking). Apart from GANs, a related neural
network-based approach to generative modeling is the variational
autoencoder, which will be treated in more details below. Among
other generative models with different approaches are as follows:
3.2.1 From AE to VAE VAEs are an interesting subject to study to emphasize the limits a
loss function like KL divergence may place on a model. We will
begin with a recourse to plain autoencoders to introduce the con-
cept of learning a latent representation. We will then proceed to
modify the autoencoder into a variational formulation which brings
about the switch to a divergence measure as a loss function. From
these grounds, we will then show how GANs again modified the
loss function to succeed in high-quality image generation.
Figure 8 shows the schematic of a plain autoencoder (AE). As
indicated in the sketch, input and output are of potentially very
high dimensionality, like images. In between the encoder and
decoder networks lies a “bottleneck” representation, which is, for
example, a convolutional layer of orders of magnitude lower
dimensionality (represented, for example, by a convolutional layer
with only a few channels or a dense layer with a given low number
of weights), which forces the network to find an encoding that
preserves all information required for reconstruction.
A typical loss function to use when training the autoencoder is,
for example, cross entropy, which is applicable for sigmoid activa-
tion functions, or simply the mean squared error (MSE). Any loss
shall essentially force the AE to learn the identity function between
input and output.
Let us introduce the notation for this. Let X be the input image
tensor and X′ the output image tensor. With fw being the encoder
function given as a neural network parameterized by weights and
biases w and gv the decoder function parameterized by v, the loss
hence works to make X = X′ = gv( fw(X)).
GANs and Beyond 155
This, however, does not make the search any easier since we
need to enumerate and sum up all z. Therefore, an approximation is
made through a surrogate distribution, parameterized by another
set of parameters, qv . Weng [15] shows in her explanation of the
VAE the graphical model highlighting how qv is a stand-in for the
unknown searched pw (see Fig. 9).
The reason to introduce this surrogate distribution actually
comes from our wish to train neural networks for the decoding/
encoding functions, and this requires us to back-propagate through
the random variable, z, which of course cannot be done. Instead, if
we have control over the distribution, we can select it such that the
reparameterization trick can be employed. We define qv to be a
multivariate Gaussian distribution with means and a covariance
matrix that can be learned and a stochastic element multiplied to
the covariance matrix for sampling [15, 16]. With this, we can back-
propagate through the sampling process.
2
Though variational autoencoders are in general not necessarily neural networks, in our context, we restrict
ourselves to this implementation and stick to the notation with parameters w and v, where in many publications
they are denoted θ and ϕ.
156 Markus Wenzel
Fig. 9 The graphical model of the variational autoencoder. In a VAE, the variational decoder is pw(X|z), while
the variational encoder is qv(z|X) (Figure after [15])
= 0:0899
Note that in the example in Box 3, there is both a P(X = xi) and
Q(X = xi) for each i ∈{0, 1, . . ., 8}. This is crucial for KL divergence
to work as a loss function.
3.2.3 Optimizing the KL Examine what happens in forward and reverse KL if this condition
Divergence is not satisfied for some i. If in forward KL P has values everywhere
but Q has not (or extremely small values), the quotient in the log
Fig. 11 The distributions P and Q, scaled to unit density, with added labels
Fig. 12 The distributions P (solid) and Qθ (dashed), in the initial configuration and after minimizing reverse KL
DKL(Qθ|P). This time, in the initial configuration, Qθ has values greater than 0 where P has not (marked with
green shading)
3.2.4 The Limits of VAE In the VAE, reverse KL is used. Our optimization goal is maximiz-
ing the likelihood to produce realistic looking examples—ones with
a high pw(x). Simultaneously, we want to minimize the difference
between the real and estimated posterior distributions qv and pw .
This can only be achieved through a reformulation of reverse KL
[15]. After some rearranging of reverse KL, the loss of the varia-
tional autoencoder becomes
L VAE ðw, vÞ = - log pw ðX Þ þ D KL ðq v ðzjX Þkpw ðzjX ÞÞ
= - zq v ðzjX Þ log pw ðX jzÞ þ D KL ðq v ðzjX Þkpw ðzÞÞ
ð4Þ
^ and v^ are the parameters maximizing the loss.
w
We have seen how mode-seeking reverse KL divergence limits
the generative capacity of variational autoencoders through the
potential underrepresentation of all modes of the original
distribution.
160 Markus Wenzel
3.3 The Fundamental At the core of the adversarial training paradigm is the idea to create
GAN Approach two players competing in a minimax game. In such games, both
players have access to the same variables but have opposing goals, so
that they will manipulate the variables in different directions.
Referring to Fig. 13, we can see the generative part in orange
color, where random numbers are drawn from the latent space and,
one by one, converted into a set of “fake images” by the generator
Fig. 13 Schematic of a GAN network. Generator (orange) creates fake images based on random numbers
drawn from a latent space. These together with a random sample of real images are fed into the discriminator
(blue, right). The discriminator looks at the batch of real/fake images and tries to assign the correct label (“0”
for fake, “1” for real)
GANs and Beyond 161
3.4 Why Early GANs GANs with this training objective implicitly use JS divergence for
Were Hard to Train the loss, which can be seen by examining the GAN training objec-
tive. Consider the ideal discriminator D for a fixed generator. Its
loss is minimal for the optimal discriminator given by [1]
^ pdata ðxÞ
DðxÞ = : ð6Þ
pdata ðxÞ þ pG ðxÞ
^ in Eq. 5 yields (without proof) the implicit use
Substituting D
of the Jensen-Shannon divergence if the above training objective is
employed:
162 Markus Wenzel
^ = 2D JS ðp kp Þ - log 4:
J ðG, DÞ ð7Þ
data G
3.5 Improving GANs GAN training has quickly become notorious for the difficulties it
posed upon the researchers attempting to apply the mechanism to
real-world problems. We have qualitatively attributed a part of these
problems to the inherently difficult task of density estimation and
motivated the intuition that while fewer samples might suffice to
learn a decision boundary in a discriminative task, many more
examples are required to build a powerful generative model.
In the following, some more light shall be shed on the reasons
why GAN training might fail. Typical GAN problems comprise the
following:
Fig. 14 PET images generated from random noise using a DCGAN architecture. Image taken from [21]
(CC-BY4.0)
166 Markus Wenzel
3.6 Wasserstein Wasserstein GANs were appealing to the deep learning and GAN
GANs scene very quickly after Arjovsky et al.’s [22] seminal publication
because of a number of traits their inventors claimed they’d have.
For one, Wasserstein GANs are based on the theoretical idea that
the change of the loss function to the Wasserstein distance should
lead to improved results. This combined with the reported bench-
mark performance would already justify attention. But Wasserstein
GANs additionally were reported to train much more stably,
because, as opposed to previous GANs, the discriminator would
be trained to convergence in every iteration, instead of demanding
a carefully and heuristically found update schedule for generator
and discriminator. In addition, the loss was directly reported to
correlate with visual quality of generated results, instead of being
essentially meaningless in a minimax game.
Wasserstein GANs are therefore worth an in-depth treatment in
the following sections.
3.6.1 The Wasserstein The Wasserstein distance figuratively measures how, with an opti-
(Earthmover) Distance mal transport plan, mass can be moved from one configuration to
another configuration with minimal work. Think, for example, of
heaps of earth. Figure 15 shows two heaps of earth, P and
Q (discrete probability distributions), both containing the same
amount of earth in total, but in different concrete states x and
y out of all possible states.
Work is defined as the shovelfuls of earth times the distance it is
moved. In the three rows of the figure, earth is moved (only within
one of P or Q, not from one to the other), in order to make the
configuration identical. First, one shovelful of earth is moved one
pile further, which adds one to the Wasserstein distance. Then, two
shovelfuls are moved three piles, adding six to the final Wasserstein
distance of DW = 7.
Note that in an alternative plan, it would have been possible to
move two shovelfuls of earth from p4 to p1 (costing six) and one
from p4 to p3, which is the inverse transport plan of the above,
executed on P, and leading to the same Wasserstein distance. The
Wasserstein distance is in fact a distance, not a divergence, because
it yields the same result regardless of the direction. Also note that
GANs and Beyond 167
Fig. 15 One square is one shovel full of earth. Transporting the earth shovel-wise
from pile to pile amasses performed work: the Wasserstein (earthmover) dis-
tance. The example shows a Wasserstein distance of DW = 7
3
The support, graphically, is the region where the distribution is not equal to zero.
168 Markus Wenzel
4
An extensive treatment of Wasserstein distance and optimal transport in general is given in the 1.000-page
treatment of Villani’s book [24], which is freely available for download.
5
This section owes to the excellent blog post of Vincent Herrmann, at https://vincentherrmann.github.io/
blog/wasserstein/. Also recommended is the treatment of the “Wasserstein GAN” paper by Alex Irpan at https://
www.alexirpan.com/2017/02/22/wasserstein-gan.html. An introductory treatment of Wasserstein distance is
also found in [25, 26].
GANs and Beyond 169
3.6.2 Implementing To implement the distance as a loss function, we rewrite the last
WGANs result again as
D W ðP, Q Þ = max xP ½D w ðxÞ - zQ ½D w ðG w ðzÞÞ: ð9Þ
w∈W
3.6.3 Example One of the first applications of Wasserstein GANs in a practical use
Application: Brain case was presented in the medical domain, specifically in the context
Abnormality Detection of attributing visible changes of a diseased patient with respect to a
Using WGAN normal control to locations in the images [27]. The way this
detection problem was cast into a GAN approach (and then solved
with a Wasserstein GAN) was to delineate the regions that make the
images of a diseased patient look “diseased,” i.e., find the residual
region, that, if subtracted from the diseased-looking image, would
make it look “normal.”
Figure 16 shows the construction of the VA-GAN architecture
with images from a mocked dataset for illustration. For the authors’
results, see their publication and code repository.7
For their implementation, the authors note that neither batch
normalization nor layer normalization helped convergence and
hypothesize that the difference between real and generated exam-
ples may be a reason that in particular batch normalization may in
fact have an adverse effect especially during the early training phase.
Instead, they impose an ℓ1 norm loss component on the U-Net-
generated “visual (feature) attribution” (VA) map to ensure it to be
a minimal change to the subject. This serves to prevent the genera-
tor from changing the subject into some “average normal” image
that it may otherwise learn. They employ an update regime that
trains the critic network for more iterations than the generator, but
doesn’t train it to convergence as proposed in the original WGAN
publications. Apart from these measures, in their code repository,
the authors give several practical hints and heuristics that may
stabilize the training, e.g., using a tanh activation for the generator
or exploring other dropout settings and in general using a large
enough dataset. They also point out that the Wasserstein distance
isn’t suited for model selection since it is too unstable and not
directly correlated to the actual usefulness of the trained model.
6
The discriminator network in the context of continuous generator loss functions like the Wasserstein-based loss
is called a “critique” network, as it no longer discriminates but yields a metric. For ease of reading, this chapter
sticks to the term “discriminator.”
7
https://github.com/baumgach/vagan-code.
GANs and Beyond 171
Fig. 16 An image of a diseased patient is run through a U-Net with the goal to yield a map that, if added to the
input image, results in a modified image that fools the discriminator (“critique”) network into classifying it as a
“normal” control. The map can be interpreted as the regions attributed to appear abnormal, giving rise to the
name of the architecture: visual attribution GAN (VA-GAN)
3.7 GAN One imminent question has so far been postponed, though it
Performance Metrics implicitly plays a crucial role in the quest for “better” GANs:
How to actually measure the success of a GAN or the performance
in terms of result quality?
GANs can be adapted to solve image analysis tasks like segmen-
tation or detection (cf. Subheading 3.6.3). In such cases, the qual-
ity and success can be measured in terms of task-related
performance (Jaccard/Dice coefficient for segmentation, overlap
metrics for detection etc.).
Performance assessment is less trivial if the GAN is meant to
generate unseen images from random vectors. In such scenarios,
the intuitive criterion is how convincing the generated results are.
But convincing to whom? One could expose human observers to
the real and fake images, ask them to tell them apart, and call a GAN
better than a competing GAN if it fools the observer more consis-
tently.8 Since this is practically infeasible, metrics were sought that
provide a more objective assessment.
8
In fact, there is only very little research on the actual performance of GANs in fooling human observers, though
guides exist on how to spot “typical” GAN artifacts in generated images. These are older than the latest GAN
models, and it can be hypothesized that the lack of such literature is indirect confirmation of the overwhelming
capacity of GANs to fool human observers.
172 Markus Wenzel
The most widely used way to assess GAN image quality is the
Fréchet inception distance (FID). This distance is conceptually
related to the Wasserstein distance. It has an analytical solution to
calculate the distance of Gaussian (normal) distributions. In the
multivariate case, the Fréchet distance between two distributions
X and Y is given by the squared distance of their means μX (resp.
μY) and a term depending on the covariance matrix describing their
variances ΣX (resp. ΣY):
p
dðX , Y Þ = jjμX - μY jj2 þ TrðΣX þ ΣY - 2 ΣX ΣY Þ: ð10Þ
The way this distance function is being used is often the score,
which is computed as follows:
• Take two batches of images (real/fake, respectively).
• Run them through a feature extraction or embedding model.
For FID, the inception model is used, pretrained on ImageNet.
Retain the embeddings for all examples.
• Fit each one multivariate normal distribution to the embedded
real/fake examples.
• Calculate their Fréchet distance according to the analytical for-
mula in Eq. 10.
This metric has a number of downsides. Typically, if computed
for a larger batch of images, it decreases, although the same model
is being evaluated. This bias can be remedied, but FID remains the
most used metric still. Also, if the inception network cannot capture
the features of the data FID should be used on, it might simply be
uninformative. This is obviously a grave concern in the medical
domain where imaging features look much different from natural
images (although, on the other hand, transfer learning for medical
classification problems proved to work surprisingly well, so that
apparently convolutional filters trained on photographs also extract
applicable features from medical images). In any case, the selection
of the pretrained embedding model brings a bias into the validation
results. Lastly, the assumption of a multivariate normal distribution
for the inception features might not be accurate, and only describ-
ing it through their means and covariances is a severe reduction of
information. Therefore, a qualitative evaluation is still required.
One obvious additional question arises: If the ultimate metric
to judge the quality of the generator is given by, for example, the
FID, why can’t it be used as the optimization goal instead of
minimizing a discriminator loss? In particular, as the Fréchet dis-
tance is a variant of the Wasserstein distance, an answer to this
question is not obvious. In fact, feature matching as described in
Box 4 exactly uses this type of idea, and likewise, it has been
partially adopted in recent GAN architectures to enhance the sta-
bility of training with a more fine-grained loss component than a
pure categorical cross-entropy loss on the “real/fake” classification
of the discriminator.
GANs and Beyond 173
4.1 Conditional GAN GANs cannot be told what to produce—at least that was the case
with early implementations. It was obvious, though, that a properly
trained GAN would imprint the semantics of the domain onto its
latent space, which was evidenced by experiments in which the
latent space was traversed and images of certain characteristics
could be produced by sampling accordingly. Also, it was found
that certain dimensions of the latent space can correspond to
certain features of the images, like hair color or glasses, so that
modifying them alone can add or take away such visible traits.
With the improved development of conditional GANs [32]
following a number of GANs that modeled the conditioning
input more explicitly, another approach was introduced that was
based on the U-Net architecture as a generator and a favorable
discriminator network that values local style over a full-image
assessment.
Technically, the formulation of a conditional GAN is straight-
forward. Recalling the value function (learning objective) of GANs
from Eq. 5,
J ðG, DÞ = xpdata ½log DðxÞ þ zpG ½1 - log DðGðzÞÞ,
We now want to condition the generation on some additional
knowledge or input. Consequently, both the generator G and the
discriminator D will receive an additional “conditioning” input,
which we call x. This can be a class label but also any other asso-
ciated information. Very commonly, the additional input will be an
image, as, for example, for image translation application (e.g.,
transforming from one image modality to another such as, for
instance, MRI to CT). The result is the cGAN objective function:
J cGAN ðG, DÞ = xpdata ½log DðxjyÞ þ zpG ½1 - log DðGðzjyÞÞ
ð11Þ
174 Markus Wenzel
Fig. 17 A possible architecture for a cGAN. Left: the generator network takes the base images x as input and
generates a translated image y^. The discriminator receives either this pair of images or a true pair x, y (right).
The additional generator reconstruction loss (often a ℓ 1 loss) is calculated between y and y^
9
https://github.com/phillipi/pix2pix.
176 Markus Wenzel
4.2 CycleGAN While cGANs require paired data for the gold standard and condi-
tioning input, this is often hard to come by, in particular in medical
use cases. Therefore, the development of the CycleGAN set a
milestone as it alleviates this requirement and allows to train
image-to-image translation networks without paired input samples.
The basic idea in this architecture is to train two mapping
functions between two domains and to execute them in sequence
so that the resulting output is considered to be in the origin domain
again. The output is compared against the original input, and their
ℓ 1 or ℓ2 distance establishes a novel addition to the otherwise usual
adversarial GAN loss. This might conceptually remind one of the
autoencoder objectives: reproduce the input signal after encoding
and decoding; only this time, there is no bottleneck but another
interpretable image space. This can be exploited to stabilize the
training, since the sequential concatenation of image translation
functions, which we will call G and F, can be reversed. Figure 19
shows a schematic of the overall process (left) and one incarnation
of the cycle, here from image domain X to Y and back (middle).
CycleGANs employ several loss terms in training: two adver-
sarial losses J ðG, D Y Þ and J ðF , D X Þ and two cycle consistency
losses, of which one J cyc ðG, F Þ is indicated rightmost in Fig. 19.
Zhu et al. [37] presented the initial publication with a participation
of the cGAN author Isola [37]. The cycle consistency losses are ℓ1
losses in their implementation, and the GAN losses are least square
losses instead of negative log likelihood, since more stable training
was observed with this choice.
Almahairi et al. [38] provided an augmented version [38],
noting that the original implementation suffers from the inability
to generate stochastic results in the target domain Y but rather
learns a one-to-one mapping between X and Y and vice versa. To
GANs and Beyond 177
Fig. 19 Cycle GAN. Left: image translation functions G and F convert between two domains. Discriminators DX
and DY give adversarial losses in both domains. Middle: for one concrete translation of an image x, the
translation to Y and back to X is depicted. Right: after the translation cycle, the original and back-translated
result are compared in the cycle consistency loss
Fig. 20 Cycle GAN with shape consistency loss (rightmost part of figure). Note that the figure shows only one
direction to ease readability
Cycle consistency losses J cyc . This is the ℓ1 loss presented by the
original CycleGAN authors dis-
cussed above.
Shape consistency losses J shape . The shape consistency loss is a new
addition proposed by the authors.
A cross-correlation loss takes into
account two segmentations, the
first being the gold standard seg-
mentation mx for an x ∈ X and one
segmentation produced by a seg-
menter network S that was trained
on domain Y and receives the
translated image ^y = GðxÞ.
4.3 StyleGAN and One of the most powerful image synthesis GANs to date is the
Successor successor of StyleGAN, StyleGAN2 [44, 45]. The authors, at the
time of writing researching at Nvidia, deviate from the usual GAN
approach in which an image is generated from a randomly sampled
vector from a latent space. Instead, they use a latent space that is
created by a mapping function f which is in their architecture
implemented as a multilayer perceptron which maps from a
512-dimensional space Z into a 512-dimensional space W. The
second major change consisted of the so-called adaptive instance
normalization layer, AdaIN, which implements a normalization to
zero-mean and unit variance of each feature map, followed by a
multiplicative factor and an additive bias term. This serves to
GANs and Beyond 179
Fig. 21 StyleGAN architecture, after [44]. Learnable layers and transformations are shown in green, the AdaIN
function in blue
details of the last layers. The architecture and results gained wide-
spread attention through a website,10 which recently was followed
up by further similar pages. Results are depicted in Fig. 22.
The crucial finding in StyleGAN was that the mapping function
F transforming the latent space vector from Z to W serves to ensure
a disentangled (flattened) latent space. Practically, this means that if
interpolating points zi between two points z1 and z2 drawn from Z
and reconstructing images from these interpolated points zi,
semantic objects might appear (in a StyleGAN-generating faces,
for example, a hat or glasses) that are neither part of the generated
images from the first point z1 nor the second point z2 between
which it has been interpolated. Conversely, if interpolating in W ,
this “semantic discontinuity” is no longer the case, as the authors
show with experiments in which they measure the visual change of
resulting images when traversing both latent spaces.
In their follow-up publications, the same authors improve the
performance even further. They stick to the basic architecture but
redesign the generative network pertaining to the AdaIN function.
In addition, they add their metric from [44] that was meant to
quantify the entanglement of the latent space as a regularizer. The
discriminator network was also enhanced, and the mechanisms of
StyleGAN that implement the progressive growing have been suc-
cessively replaced by more performance-efficient setups. In their
experiments, they show a growth of visual and measured quality
and removal of several artifacts reported for StyleGAN [45].
4.4 Stabilized GAN GAN training was very demanding both regarding GPU power, in
for Few-Shot Learning particular for high-performance architectures like StyleGAN and
StyleGAN2, and, as importantly, availability of data. StyleGAN2,
for example, has typical training times of about 10 days on a Nvidia
8-GPU Tesla V100. The datasets comprised at least tens of
thousands of images and easily orders of magnitude more. Particu-
larly in the medical domain, such richness of data is typically hard
to find.
10
https://thispersondoesnotexist.com/.
GANs and Beyond 181
Fig. 23 The FastGAN generator network. Shortcut connections through feature map weighting layers (called
skip-layer excitation, SLE) transport information from low-resolution feature maps into high-resolution feature
maps. For details regarding the blocks, see text
Fig. 24 The FastGAN self-supervision mechanism of the discriminator network. Self-supervision manifests
through the loss term indicated by the curly bracket between reconstructions from feature maps and
resampled/cropped versions of the original real image, J recon
Fig. 25 FastGAN as implemented by the authors has been used to train a CT slice generative model. Images
are not cherry-picked, but arranged by similar anatomical regions
Fig. 26 The VQGAN+CLIP combination creates images from text inputs, here: “A
child drawing of a dark garden full of animals”
We have already seen how GANs were not the first approach to
image generation but have prevailed for a time when they became
computationally feasible and in consequence have been better
understood and improved to accomplish tasks in image analysis
and image generation with great success. In parallel with GANs,
other fundamentally different generative modeling approaches
have also been under continued development, most of which have
precursors from the “before-GAN” era as well. To give a compre-
hensive outlook, we will sketch in this last section the state of the art
of a selection of these approaches.11
11
The research on the so-called flow-based models, e.g., normalizing flows, has been omitted in this chapter,
though acknowledging their emerging relevance also in the context of image generation. Flow-based models are
built from sequences of invertible transformations, so that they learn data distributions explicitly at the expense of
sometimes higher computational costs due to their sequential architecture. When combined, e.g., with a powerful
GAN, they allow innovative applications, for example, to steer the exploration of a GAN’s latent space to achieve
fine-grained control over semantic attributes for conditional image generation. Interested readers are referred to
the literature [11, 13, 50–52].
186 Markus Wenzel
5.1 Diffusion and Diffusion models take a completely different approach to distribu-
Score-Based Models tion estimation. GANs implicitly represent the target distribution
by learning a surrogate distribution. Likelihood-based models like
VAE approximate the target distribution explicitly, not requiring
the surrogate. In diffusion models, however, the gradient of the log
probability density function is estimated, instead of looking at the
distribution itself (which would be the unfathomable integral of the
gradient). This value is known as the Stein score function, leading
to the notion that diffusion models are one variant of score-based
models [53].
The simple idea behind this class of models is to revert a
sequential noising process. Consider some image. Then, perform
a large number of steps. In each step, add a small amount of noise
from a known distribution, e.g., the normal distribution. Do this
until the result is indistinguishable from random noise.
The denoising process is then formulated as a latent variable
model, where T - 1 latents successively progress from a noise image
xT N ðxT ; 0, IÞ to the reconstruction that we call x0 q(x0). The
reconstructed image, x0, is therefore obtained by a reverse process
qθ(x0:T). Note that each step in this chain can be evaluated in closed
form [54]. Several model implementations of this approach exist,
one being the deep diffusion probabilistic model (DDPM). Here, a
deep neural network learns to perform one denoising step given the
so-far achieved image and a t ∈{1, . . ., T}. Iterative application of
the model to the result of the last iteration will eventually yield a
generated image from noise input.
Autoregressive diffusion models (ARDMs) [55] follow yet
another thought model, roughly reminiscent of PixelRNNs we
have briefly mentioned above (see Subheading 3.2). Both share
the approach to condition the prediction of the next pixel or pixels
on the already predicted ones. Other than in the PixelRNN, how-
ever, the specific ARDM proposed by the authors does not rely on a
predetermined schedule of pixel updates, so that these models can
be categorized as latent variable models.
As of late, the general topic of score-based methods, among
which diffusion models are one variant, received more attention in
the research community, fueled by a growing body of publications
that report image synthesis results that outperform GANs [53, 56,
57]. Score function-based and diffusion models superficially share
the similar concept of sequentially adding/removing noise but
achieve their objective with very different means: where score
function-based approaches are trained by score-matching and
their sampling process uses Langevin dynamics [58], diffusion
models are trained using the evidence lower bound (ELBO) and
sample with a decoder, which is commonly a neural network.
Figure 27 visualizes an example for a score function.
Score function-based (sometimes also score-matching) genera-
tive models have been developed to astounding quality levels, and
GANs and Beyond 187
Fig. 27 The Stein score function can be conceived of as the gradient of the log probability density function,
here indicated by two Gaussians. The arrows represent the score function
the recent works of Yang Song and others provide accessible blog
posts,12 and a comprehensive treatment of the subject in several
publications [53, 58, 59].
In the work of Ho et al. [54], the stepwise reverse (denoising)
process is the basis of the denoising diffusion probabilistic models
(DDPM). The authors emphasize that a proper selection of the
noise schedule is crucial to fast, yet high-quality, results. They point
out that their work is a combination of diffusion probabilistic
models with score-matching models, in this combination also gen-
eralizing and including the ideas of autoregressive denoising mod-
els. In an extension of Ho et al.’s [54] work by Nichol and Dhariwal
[57], an importance sampling scheme was introduced that lets the
denoising process steer the most easy to predict next image ele-
ments. Equipped with this new addition, the authors can show that,
in comparison to GANs, a wider region of the target distribution is
covered by the generative model.
5.2 Transformer- The basics of how attention mechanisms and transformer architec-
Based Generative tures work will be covered in the subsequent chapter on this
Models promising technology (Chapter 6). Attention-based models, pre-
dominantly transformers, have been used successfully for some time
in sequential data processing and are now considered the superior
alternative to recurrent networks like long-short-term memory
(LSTM) networks. Transformers have, however, only recently
made their way into the image analysis and now also the image
generation world. In this section, we will only highlight some
developments in the area of generative tasks.
12
https://yang-song.github.io/blog/.
188 Markus Wenzel
Acknowledgements
References
[1] Goodfellow IJ, Pouget-Abadie J, Mirza M, [11] Rezende DJ, Mohamed S (2015) Variational
Xu B, Warde-Farley D, Ozair S, Courville A, inference with normalizing flows. In: ICML
Bengio Y (2014) Generative adversarial [12] van den Oord A, Kalchbrenner N,
nets. In: Proceedings of the 27th interna- Espeholt L, Kavukcuoglu K, Vinyals O,
tional conference on neural information pro- Graves A (2016) Conditional image genera-
cessing systems - volume, NIPS’14 . MIT tion with PixelCNN decoders. In: NIPS
Press, Cambridge, pp 2672–2680 [13] Dinh L, Sohl-Dickstein J, Bengio S (2017)
[2] Casella G, Berger RL (2021) Statistical infer- Density estimation using Real NVP. ArXiv
ence. Cengage Learning, Boston abs/1605.08803
[3] Grinstead C, Snell LJ (2006) Introduction to [14] Salakhutdinov R, Hinton G (2009) Deep
probability. Swarthmore College, Boltzmann machines. In: van Dyk D, Well-
Swarthmore ing M (eds) Proceedings of the twelfth inter-
[4] Severini TA (2005) Elements of distribution national conference on artificial intelligence
theory, vol 17. Cambridge University Press, and statistics, PMLR, hilton clearwater
Cambridge beach resort, clearwater beach, Florida
[5] Murphy KP (2012) Machine learning: a prob- USA, Proceedings of Machine Learning
abilistic perspective. MIT Press, Cambridge Research, vol 5, pp 448–455. https://
[6] Murphy KP (2022) Probabilistic machine proceedings.mlr.press/v5/salakhutdinov0
learning: an introduction. MIT Press, Cam- 9a.html
bridge. http://doi.org/probml.ai [15] Weng L (2018) From autoencoder to
[7] Do CB, Batzoglou S (2008) What is the Beta-VAE. lilianwenggithubio/lil-log.
expectation maximization algorithm? Nat http://lilianweng.github.io/lil-log/201
Biotechnol 26:8, 26:897–899. https://doi. 8/08/12/from-autoencoder-to-beta-vae.
org/10.1038/nbt1406. https://www. html
nature.com/articles/nbt1406 [16] Kingma DP, Welling M (2014) Auto-
[8] Dempster AP, Laird NM, Rubin DB (1977) encoding variational bayes. ArXiv
Maximum likelihood from incomplete data 1312.6114
via the em algorithm. J Roy Statist Soc Ser B [17] Creswell A, White T, Dumoulin V,
(Methodolog) 39:1–22. https://doi. Arulkumaran K, Sengupta B, Bharath AA
org/10.1111/J.2517-6161.1977.TB01 (2018) Generative adversarial networks: an
600.X.https://onlinelibrary.wiley.com/doi/ overview. IEEE Signal Process Mag 35(1):
full/10.1111/j.2517-6161.1977.tb01600.x. 53–65. https://doi.org/10.1109/MSP.
h tt ps://o nli nelibrar y. wiley. co m/d oi/ 2017.2765202
abs/10.1111/j.2517-6161.1977.tb01600.x. [18] Arjovsky M, Bottou L (2017) Towards prin-
h t t p s : // r s s . o n l i n e l i b r a r y. w i l e y. c o m / cipled methods for training generative
doi/10.1111/j.2517-6161.1977.tb01600.x adversarial networks. ArXiv
[9] van den Oord A, Kalchbrenner N, Kavukcuo- abs/1701.04862
glu K (2016) Pixel recurrent neural networks. [19] Theis L, van den Oord A, Bethge M (2016)
ArXiv abs/1601.06759 A note on the evaluation of generative mod-
[10] Magnusson K (2020) Understanding maxi- els. CoRR abs/1511.01844
mum likelihood: an interactive visualization. [20] Radford A, Metz L, Chintala S (2015)
https://rpsychologist.com/likelihood/ Unsupervised representation learning with
190 Markus Wenzel
[41] Hoffman J, Tzeng E, Park T, Zhu JY, [49] van den Oord A, Vinyals O, Kavukcuoglu K
Isola P, Saenko K, Efros AA, Darrell T (2017) Neural discrete representation
(2017) CyCADA: Cycle-consistent adver- learning. CoRR abs/1711.00937. http://
sarial domain adaptation. ArXiV 1 arxiv.org/abs/1711.00937
711.03213 [50] Weng L (2018) Flow-based deep generative
[42] Huo Y, Xu Z, Bao S, Assad A, Abramson models. lilianwenggithubio/lil-log. http://
RG, Landman BA (2018) Adversarial syn- lilianweng.github.io/lil-log/2018/10/13/
thesis learning enables segmentation with- flow-based-deep-generative-models.html
out target modality ground truth. In: 2018 [51] Kingma DP, Dhariwal P (2018) Glow: gen-
IEEE 15th international symposium on bio- erative flow with invertible 1x1 convolu-
medical imaging (ISBI 2018), pp tions. ArXiv https://doi.org/10.48550/
1217–1220. https://doi.org/10.1109/ ARXIV.1807.03039. https://arxiv.org/
ISBI.2018.8363790 abs/1807.03039
[43] Yang D, Xiong T, Xu D, Zhou SK (2020) [52] Abdal R, Zhu P, Mitra NJ, Wonka P (2021)
Segmentation using adversarial image-to- StyleFlow: attribute-conditioned explora-
image networks. In: Handbook of medical tion of StyleGAN-generated images using
image computing and computer assisted conditional continuous normalizing flows.
intervention, pp 165–182. https://doi. ACM Trans Graph 40(3):1–21. https://
org/10.1016/B978-0-12-8161 doi.org/10.1145/3447648. https://doi.
76-0.00012-0 org/10.1145%2F3447648
[44] Karras T, Laine S, Aila T (2018) A style- [53] Song Y, Sohl-Dickstein J, Kingma DP,
based generator architecture for generative Kumar A, Ermon S, Poole B (2021) Score-
adversarial networks. IEEE Trans Pattern based generative modeling through stochas-
Analy Mach Intell 43:4217–4228. https:// tic differential equations. In: International
doi.org/10.1109/TPAMI.2020.2970919. conference on learning representations.
https://arxiv.org/abs/1812.04948v3 h t t p s : // o p e n r e v i e w. n e t / f o r u m ? i d =
[45] Karras T, Laine S, Aittala M, Hellsten J, PxTIG12RRHS
Lehtinen J, Aila T (2020) Analyzing and [54] Ho J, Jain A, Abbeel P (2020) Denoising
improving the image quality of diffusion probabilistic models. ArXiV 200
StyleGAN. In: Proceedings of the IEEE 6.11239
computer society conference on computer [55] Hoogeboom E, Gritsenko AA, Bastings J,
vision and pattern recognition, pp Poole B, van den Berg R, Salimans T
8107–8116. https://doi.org/10.1109/ (2021) Autoregressive diffusion models.
CVPR42600.2020.00813. https://arxiv. ArXiV 2110.02037
org/abs/1912.04958v2
[56] Dhariwal P, Nichol A (2021) Diffusion
[46] Liu B, Zhu Y, Song K, Elgammal A (2021) models beat GANs on image synthesis.
Towards faster and stabilized GAN training ArXiV http://arxiv.org/abs/2105.05233
for high-fidelity few-shot image
synthesis. In: International conference on [57] Nichol A, Dhariwal P (2021) Improved
l e a r n i n g r e p r e s e n t a t i o n s . h t t p s : // denoising diffusion probabilistic models.
openreview.net/forum?id=1Fqg133qRaI ArXiV http://arxiv.org/abs/2102.09672
[47] Esser P, Rombach R, Ommer B (2021) [58] Song Y, Ermon S (2019) Generative model-
Taming transformers for high-resolution ing by estimating gradients of the data
image synthesis. In: 2021 IEEE/CVF con- distribution. In: Advances in neural informa-
ference on computer vision and pattern rec- tion processing systems, pp 11895–11907
ognition (CVPR), pp 12868–12878. [59] Song Y, Garg S, Shi J, Ermon S (2019)
https://doi.org/10.1109/CVPR46437. Sliced score matching: a scalable approach
2021.01268 to density and score estimation. In: Proceed-
[48] Radford A, Kim JW, Hallacy C, Ramesh A, ings of the thirty-fifth conference on uncer-
Goh G, Agarwal S, Sastry G, Askell A, tainty in artificial intelligence, UAI 2019,
Mishkin P, Clark J, Krueger G, Sutskever I Tel Aviv, Israel, July 22–25, 2019, p 204.
(2021) Learning transferable visual models http://auai.org/uai2019/proceedings/
from natural language supervision. ArXiV papers/204.pdf
2103.00020
192 Markus Wenzel
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 6
Abstract
Transformers were initially introduced for natural language processing (NLP) tasks, but fast they were
adopted by most deep learning fields, including computer vision. They measure the relationships between
pairs of input tokens (words in the case of text strings, parts of images for visual transformers), termed
attention. The cost is exponential with the number of tokens. For image classification, the most common
transformer architecture uses only the transformer encoder in order to transform the various input tokens.
However, there are also numerous other applications in which the decoder part of the traditional trans-
former architecture is also used. Here, we first introduce the attention mechanism (Subheading 1) and then
the basic transformer block including the vision transformer (Subheading 2). Next, we discuss some
improvements of visual transformers to account for small datasets or less computation (Subheading 3).
Finally, we introduce visual transformers applied to tasks other than image classification, such as detection,
segmentation, generation, and training without labels (Subheading 4) and other domains, such as video or
multimodality using text or audio data (Subheading 5).
1 Attention
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_6,
© The Author(s) 2023
193
194 Robin Courant et al.
1.1 The History of This notion of context is the motivation behind the introduction of
Attention the attention mechanism in 2015 [1]. Before this, language trans-
lation was mostly relying on encoder-decoder architectures: recur-
rent neural networks (RNNs) [2] and in particular long-short-term
memory (LSTMs) networks were used to model the relationship
among words [3]. Specifically, each word of an input sentence is
processed by the encoder sequentially. At each step, the past and
present information are summarized and encoded into a fixed-
length vector. In the end, the encoder has processed every word
and outputs a final fixed-length vector, which summarizes all input
information. This final vector is then decoded and finally translates
the input information into the target language.
However, the main issue of such structure is that all the infor-
mation is compressed into one fixed-length vector. Given that the
sizes of sentences vary and as the sentences get longer, a fixed-
length vector is a real bottleneck: it gets increasingly difficult not to
lose any information in the encoding process due to the vanishing
gradient problem [1].
As a solution to this issue, Bahdanue et al. [1] proposed the
attention module in 2015. The attention module allows the model
to consider the parts of the sentence that are relevant to predicting
the next word. Moreover, this facilitates the understanding of
relationships among words that are further apart.
1
Note that in the literature, there are two main attention functions: additive attention [1] and dot-product
attention (Eq. 1). In practice, the dot product is more efficient since it is implemented using highly optimized
matrix multiplication, compared to the feed-forward network of the additive attention; hence, the dot product is
the dominant one.
196 Robin Courant et al.
1.3 Types of There exist two dominant types of attention mechanisms: self-
Attention attention and cross attention [4]. In self-attention, the queries,
keys, and values come from the same input, i.e., X = Y; in cross
attention, the queries come from a different input than the key and
value vectors, i.e., X ≠ Y. These are described below in Subheadings
1.3.1 and 1.3.2, respectively.
1.3.2 Cross Attention Most real-world data are multimodal—for instance, videos contain
frames, audios, and subtitles, images come with captions, etc.
Therefore, models that can deal with such types of multimodal
information have become essential.
Transformers and Visual Transformers 197
1.4 Variation of Attention is typically employed in two ways: (1) multi-head self-
Attention attention (MSA, Subheading 1.4) and (2) masked multi-head
attention (MMA, Subheading 1.4).
Attention Head We call attention head the mechanism presented
in Subheading 1.2, i.e., query-key-value projection, followed by
scaled dot product attention (Eqs. 1 and 2).
When employing an attention-based model, relying only on a
single attention head can inhibit learning. Therefore, the multi-
head attention block is introduced [4].
2 Visual Transformers
2.1.1 Embedding The first step of transformers consists in converting input tokens2
into embedding tokens, i.e., vectors with meaningful features. To
do so, following standard practice [6], each input is projected into
an embedding space to obtain embedding tokens Ze. The embed-
ding space is structured in a way that the distance between a pair of
vectors is relative to the semantic similarity of their associated
words. For the initial NLP case, this means that we get a vector of
each word, such that the vectors that are closer together have
similar meanings.
2
Note the initial transformer architecture was proposed for natural language processing (NLP), and therefore the
inputs were words.
200 Robin Courant et al.
2.1.2 Positional Encoding As discussed in Subheading 1.5, the attention mechanism is posi-
tional agnostic, which means that it does not store the information
on the position of each input. However, in most cases, the order of
input tokens is relevant and should be taken into account, such as
the order of words in a sentence matter as they may change its
meaning. Therefore, [4] introduced the Positional Encoding
PE ∈ N × d x , which adds a positional token to each embedding
token Z e ∈ N × d x .
Transformers and Visual Transformers 201
Sinusoidal Positional The sinusoidal positional encoding [4] is the main positional
Encoding encoding method, which encodes the position of each token with
sinusoidal waves of multiple frequency. For an embedding token
Z e ∈ N × d x , its positional encoding PE ∈ N × d x is defined as:
i
PEði, 2j Þ = sin
100002j =d
i
PEði, 2j þ 1Þ = cos , 8i, j ∈½j1, nj × ½j1, dj:
100002j =d
ð7Þ
Learnable Positional An orthogonal approach is to let the model learn the positional
Encoding encoding. In this case, PE ∈ N × d x becomes a learnable parameter.
This, however, increases the memory requirements, without neces-
sarily bringing improvements over the sinusoidal encoding.
Positional Embedding After its computation, either the positional encoding PE is added
to the embedding tokens or they are concatenated as follows:
Z pe = Z e þ PE, or
ð8Þ
Z pe = ConcatðZ e , PEÞ,
where Concat denotes vector concatenation. Note that the concat-
enation has the advantage of not altering the information contained
in Ze, since the positional information is only added to the unused
dimension. Nevertheless, it augments the input dimension, leading
to higher memory requirements. Instead, the addition does pre-
serve the same input dimension while altering the content of the
embedding tokens. When the input dimension is high, this content
altering is trivial, as most of the content is preserved. Therefore, in
practice, for high dimension, summing positional encodings is
preferred, whereas for low dimensions concatenating them prevails.
2.1.3 Encoder Block The encoder block takes as input the embedding and positional
tokens and outputs features of the input, to be decoded by the
decoder block. It consists of a stack of L multi-head self-attention
(MSA) layers and a multilayer perceptron (MLP). Specifically, the
N ×d
embedding and positional tokens, Z pe x ∈ , go through a
multi-head self-attention block. Then, a residual connection with
layer normalization is deployed. In the transformer, this operation
is performed after each sub-layer. Next, we feed its output to an
MLP and a normalization layer. This operation is performed
L times, and each time the output of each encoder block (of size
N × d) is the input of the subsequent block. In the L -th time, the
output of the normalization is the input of the cross-attention
block in the decoder (Subheading 2.1.4).
202 Robin Courant et al.
2.1.4 Decoder Block The decoder has two inputs: first, an input that constitutes the
queries Q ∈ N × d of the encoder, and, second, the output of the
encoder that constitutes the key-value K , V ∈ N × d pair. Similar
to Subheadings 2.1.1 and 2.1.2, the first step constitutes encoding
the output token to output embedding token and output positional
token. These tokens are fed into the main part of the decoder,
which consists of a stack of L masked multi-head self-attention
(MMSA) layers, multi-head cross-attention (MCA) layers, and
multilayer perceptron (MLP) followed by normalizations. Specifi-
N ×d
cally, the embedding and positional tokens, Z pe y ∈ , go
through a MMSA block. Then, a residual connection with layer
normalization follows. Next, an MCA layer (followed by normali-
zation) maps the queries to the encoded key values before forward-
ing the output to an MLP. Finally, we project the output of the
L decoder blocks (of dimension N × dy) through a linear layer and
get output probability through a softmax layer.
2.2 Advantages of Since their introduction, the transformers have had a significant
Transformers impact on deep learning approaches.
In natural language processing (NLP), before transformers,
most architectures used to rely on recurrent modules, such as
RNNs [2] and in particular LSTMs [3]. However, recurrent models
process the input sequentially, meaning that, to compute the cur-
rent state, they require the output of the previous state. This makes
them tremendously inefficient, as they are impossible to parallelize.
On the contrary, in transformers, each input is processed indepen-
dent of the others, and the multi-head attention can perform
multiple attention computations at once. This makes transformers
highly efficient, as they are highly parallelizable.
This results in not only exceptional scalability, both in the
complexity of the model and the size of datasets, but also relatively
fast training. Notably, the recent switch transformers [7] was pre-
trained on 34 billion tokens from the C4 dataset [8], scaling the
model to over 1 trillion parameters.
This scalability [7] is the principal reason for the power of the
transformer. While it was originally introduced for translation, it
refrains from introducing many inductive biases, i.e., the set of
assumptions that the user makes about the structure of the model
input. In doing so, the transformer relies on data to learn how they
are structured. Compared to its counterparts with more biases, the
transformer requires much more data to produce comparable
results [5]. However, if a sufficient amount of data is available,
the lack of inductive bias becomes a strength. By learning the
structure of the data from the data, the transformer is able to
learn better without human assumptions hindering [9].
In most tasks involving transformers, the model is first pre-
trained on a large dataset and then fine-tuned for the task at hand
on a smaller dataset. The pretraining phase is essential for
Transformers and Visual Transformers 203
2.3 Vision Transformers offer an alternative to CNNs that have long held a
Transformer stranglehold on computer vision. Before 2020, most attempts to
use transformers for vision tasks were still highly reliant on CNNs,
either by using self-attention jointly with convolutions [13, 14] or
by keeping the general structure of CNNs while using self-attention
[15, 16].
The reason for this is rooted in the two main weaknesses of the
transformers. First, the complexity of the attention operation is
high. As attention is a quadratic operation, the number of para-
meters skyrockets quickly when dealing with visual data, i.e.,
images—and even more so with videos. For instance, in the case
of ImageNet [17], inputting a single image with 256 × 256 = 65,
536 pixels in an attention layer would be too heavy computation-
ally. Second, transformers suffer from lack of inductive biases. Since
CNNs were specifically created for vision tasks, their architecture
includes spatial inductive biases, like translation equivariance and
locality. Therefore, the transformers have to be pretrained on a
significantly large dataset to achieve similar performances.
The vision transformer (ViT) [5] is the first systematic
approach that uses directly transformers for vision tasks by addres-
sing both aforementioned issues. It rids the concept of convolu-
tions altogether, using purely a transformer-based architecture. In
doing so, it achieves the state of the art on image recognition on
various datasets, including ImageNet [17] and CIFAR-100 [18].
Figure 4 illustrates the ViT architecture. The input image is first
split into 16 × 16 patches, flattened, and mapped to the expected
dimension through a learnable linear projection. Since the image
size is reduced to 16 × 16, the complexity of the attention mecha-
nism is no longer a bottleneck. Then, ViT encodes the positional
information and attaches a learnable embedding to the front of the
sequence, similarly to BERT’s classification token [10]. The output
of this token represents the entirety of the input—it encodes the
information from each part of the input. Then, this sequence is fed
into an encoder block, with the same structure as in the standard
transformers [4]. The output of the classification token is then fed
into an MLP that outputs class probabilities.
204 Robin Courant et al.
Fig. 4 The vision transformer architecture (ViT). First, the input image is split into patches (bottom), which are
linearly projected (embedding), and then concatenated with positional embedding tokens. The resulting tokens
are fed into a transformer, and finally the resulting classification token is passed through an MLP to compute
output probabilities. Figure inspired from [5]
3.1 Data Efficiency As discussed in Subheading 2.3, the vision transformer (ViT) [5] is
pretrained on a massive proprietary dataset (JFT-300M) which
contains 300 million labeled images. This need arises with trans-
formers because we remove the inductive biases from the architec-
ture compared to convolutional-based networks. Indeed,
convolutions contain some translation equivariance. ViT does not
benefit from this property and thus has to learn such biases, requir-
ing more data. JFT-300M is an enormous dataset, and to make ViT
work in practice, better data-efficiency is needed. Indeed, collecting
that amount of data is costly and can be infeasible for most tasks.
Data-Efficient Image Transformers (DeiT) [21] The first work
to achieve an improved data efficiency is DeiT [21] . The main idea
of DeiT is to distil the inductive biases from a CNN into a trans-
former (Fig. 5). DeiT adds another token that works similarly to the
class token. When training, ground truth labels are used to train the
network according to the class token output with a cross-entropy
(CE) loss. However, for the distillation network, the output labels
are compared to the labels provided from a teacher network with a
Fig. 5 The DeiT architecture. The architecture features an extra token, the
distillation token. This token is used similarly to the class token.
Figure inspired from [21]
206 Robin Courant et al.
with Ψ the softmax function, Zclass the class token output, Zdistill the
class token output, y the ground truth label, and yT the teacher label
prediction.
The teacher network is a Convolutional Neural Network
(CNN). The main idea is that the distillation head will provide
the inductive biases needed to improve the data efficiency of the
architecture. By doing this, DeiT achieves remarkable performance
on the ImageNet dataset, by training “only” on ImageNet-1K
[17], which contains 1.3 million images.
Fig. 6 Compact convolutional transformers. This architecture features a convolutional-based patch extraction
to leverage a smaller transformer network, leading to higher data efficiency. Figure inspired from [23]
3.2 Computational The vision transformer architecture (Subheading 2.3) suffers from
Efficiency a Oðn2 Þ complexity with respect to the number of tokens. When
considering small resolution images or big patch size, this is not a
limitation; for instance, for an image of 224 × 224 resolution with
16 × 16 patches, this amounts to 196 tokens. However, when
needing to process larger images (for instance, 3D images in medi-
cal imaging) or when considering smaller patches, using and train-
ing such models becomes prohibitive. For instance, in tasks such as
segmentation or image generation, it is needed to have more
granular representations than 16 × 16 patches; hence, it is crucial
to solve this issue to enable more applications of vision transformer.
Fig. 7 Shifting operation in the Swin transformer [24]. Between each attention operation, the attention window
is shifted so that each patch can communicate with different patches than before. This allows the network to
gain more global knowledge with the network’s depth. Figure inspired from [24]
Fig. 8 The perceiver architecture [25, 26]. A set of latent tokens retrieve information from the image through
cross attention. Self-attention is performed between the tokens to refine the learned representation. These
operations are linear with respect to the number of image tokens. Figure inspired from [25, 26]
4.1 Object Detection Detection is one of the early tasks that have seen improvements
with Transformers thanks to transformers. Detection is a combined recognition and
localization problem; this means that a successful detection system
should both recognize whether an object is present in an image and
localize it spatially in the image. Carion et al. [14] is the first
approach that uses transformers for detection.
4.2 Image The goal of image segmentation is to assign to each pixel of an image
Segmentation with the label of the object it belongs to. The segmenter [27] is a purely
Transformers ViT approach addressing image segmentation. The idea is to first use
ViT to encode the image. Then, the segmenter learns a token per
3
Note that, in DETR, the transformer is not directly used to extract the visual representation. Instead, it focuses
on refining the visual representation to extract the object information.
210 Robin Courant et al.
Image Transformed
Features Features
Class,
FFNN Box
No
CNN FFNN Object
Transformer Transformer Seagull Seagull
Encoder Decoder No
FFNN Object
Class,
FFNN Box
Object queries
Fig. 9 The DETR architecture. It refines a CNN visual representation to extract object localization and classes.
Figure inspired from [14]
Fig. 10 The segmenter architecture. It is a purely ViT-based approach to perform semantic segmentation.
Figure inspired from [27]
semantic label. The encoded patch tokens and the semantic tokens
are then fed to a second transformer. Finally, by computing the scalar
product between the semantic tokens and the image tokens, the
network assigns a label to each patch. Figure 10 displays this.
4.3 Training Visual transformers have initially been trained for classification
Transformers Without tasks. However, this tasks requires having access to massive
Labels amounts of labeled data, which can be hard to obtain
(as discussed in Subheading 3.1). Subheadings 3.1 and 3.2 present
ways to train ViT more efficiently. However, it would also be
interesting to be able to train this type of networks with “cheaper”
data. Therefore, the goal of this part is to introduce unsupervised
learning with transformers, i.e., training transformers without any
labels.
Transformers and Visual Transformers 211
Fig. 11 The DINO training procedure. It consists in matching the outputs between
two networks (p1 and p2) having two different augmentations (X1 and X2) of the
same image as input (X). The parameters of the teacher model are updated with
an exponential moving average (ema) of the student parameters. Figure inspired
from [28]
Fig. 12 DINO samples. Visualization of the attention matrix of ViT heads trained with DINO. The ViT discovers
the semantic structure of an image in an unsupervised way
Fig. 13 The MAE training procedure. After masking some tokens of an image, the remaining tokens are fed to
an encoder. Then a decoder tries to reconstruct the original image from this representation. Figure inspired
from [29]
4.4 Image Attention and vision transformers have also helped in developing
Generation with fresh ideas and creating new architectures for generative models
Transformers and and in particular for generative adversarial networks (GANs).
Attention
GANsformers [30] GANsformers are the most representative
work of GANs with transformers, as they are a hybrid architecture
using both attention and CNNs. The GANsformer architecture is
illustrated in Fig. 14. The model first splits the latent vector of a
GAN into multiple tokens. Then, a cross-attention mechanism is
used to improve the generated feature maps, and at the same time,
the GANsformer architecture retrieves information from the gen-
erated feature map to enrich the tokens. This mechanism allows the
GAN to have better and richer semantic knowledge, which is
showed to be useful for generating multimodal images.
Latents Image
Convolutions
Image
Cross-Attention
Cross-Attention
Latents Image
5.1.2 CLIP Connecting Text and Images (CLIP) [35] is designed to address
two major issues of deep learning models: costly datasets and
inflexibility. While most deep learning models are trained on labeled
datasets, CLIP is trained on 400 million text-image pairs that are
scraped from the Internet. This reduces the labor of having to
manually label millions of images that are required to train powerful
deep learning models. When models are trained on one specific
dataset, they also tend to be difficult to extend to other applica-
tions. For instance, the accuracy of a model trained on ImageNet is
generally limited to its own dataset and cannot be applied to real-
world problems. To optimize training, CLIP models learn to per-
form a wide variety of tasks during pretraining, and this task allows
for zero-shot transfer to many existing datasets. While there are still
several potential improvements, this approach is competitive to
supervised models that are trained on specific datasets.
CLIP Architecture and CLIP is used to measure the similarity between the text input and
Training the image generated from a latent vector. At the core of the
approach is the idea of learning perception from supervision
contained in natural language. Methods which work on natural
language can learn passively from the supervision contained in the
vast amount of text on the Internet.
Given a batch of N (image, text) pairs, CLIP is trained to
predict which of the N × N possible (image, text) pairings across a
batch actually occurred. To do this, CLIP learns a multimodal
embedding space by jointly training an image encoder and a text
encoder to maximize the cosine similarity of the image and text
Transformers and Visual Transformers 215
5.1.3 DALL-E and DALL-E [38] is another example of the application of transformers
DALL-E 2 in vision. It generates images from a natural language prompt—
some examples include “an armchair in the shape of an avocado”
and “a penguin made of watermelon.” It uses a decoder-only
model, which is similar to GPT-3 [39]. DALL-E uses 12 billion
parameters and is pretrained on Conceptual Captions [34] with
over 3.3 million text-image pairs. DALL-E 2 [40] is the upgraded
version of DALL-E, based on diffusion models and CLIP (Sub-
heading 5.1.2), and allows better performances with more realistic
and accurate generated images. In addition to producing more
realistic results with a better resolution than DALL-E, DALL-E
2 is also able to edit the outputs. Indeed, with DALL-E 2, one can
add or remove realistically an element in the output and can also
generate different variations of the same output. These two models
clearly demonstrate the powerful nature and scalability of transfor-
mers that are capable of efficiently processing a web-scale amount
of data.
4
[SOS], start of sequence; [EOS], end of sequence
216 Robin Courant et al.
5.1.4 Flamingo Flamingo [41] is a visual language model (VLM) tackling a wide
range of multimodal tasks based on few-shot learning. This is an
adaptation of large language models (LLMs) handling an extra
visual modality with 80B parameters.
Flamingo consists of three main components: a vision encoder,
a perceiver resampler, and a language model. First, to encode
images or videos, a vision convolutional encoder [42] is pretrained
in a contrastive way, using image and text pairs.5 Then, inspired by
the perceiver architecture [25] (detailed in Subheading 1.3.2), the
perceiver resampler takes a variable number of encoded visual fea-
tures and outputs a fixed-length latent code. Finally, this visual
latent code conditions the language model by querying language
tokens through cross-attention blocks. Those cross-attention
blocks are interleaved with pretrained and frozen language model
blocks.
The whole model is trained using three different kinds of
datasets without annotations (text with image content from web-
pages [41], text and image pairs [41, 43], and text and video pairs
[41]). Once the model is trained, it is fine-tuned using few-shot
learning techniques to tackle specific tasks.
5
The text is encoded using a pretrained BERT model [10].
Transformers and Visual Transformers 217
5.2.1 General Principle Generally, inputs of video transformers are RGB video clips
X ∈ F × H × W × 3 , with F frames of size H × W.
To begin with, video transformers split the input video clip
X into ST tokens x i ∈ K , where S and T are, respectively, the
number of tokens along the spatial and temporal dimension and
K is the size of a token.
To do so, the simplest method extracts nonoverlapping 2D
patches of size P × P from each frame [5], as used in TimeSformer
[50]. This results in S = HW/P2, T = F, and K = P2.
However, there exist more elegant and efficient token extrac-
tion methods for videos. For instance, in ViViT [49], the authors
propose to extract 3D volumes from videos (involving T ≠ F) to
capture spatiotemporal information within tokens. In TokenLear-
ner [47], they propose a learnable token extractor to select the
most important parts of the video.
Once raw tokens xi are extracted, transformer architectures aim
to map them into d-dimensional embedding vectors Z ∈ ST × d
using a linear embedding E ∈ d × K :
Z = ½z cls , Ex 1 , Ex 2 , . . ., Ex ST þ PE, ð10Þ
where z cls ∈ is a classification token that encodes information
d
5.2.2 Full Space-Time As described in [49, 50], full space-time attention mechanism is the
Attention most basic and direct spatiotemporal attention mechanism. As
shown in Fig. 15, it consists in computing self-attention across all
pairs of extracted tokens.
This method results in a heavy complexity of OðLS 2 T 2 Þ
[49, 50]. This quadratic complexity can fast be memory-
consuming, in which it is especially true when considering videos.
Therefore, using full space-time attention mechanism is
impractical [50].
5.2.3 Divided Space- A smarter and more efficient way to compute spatiotemporal atten-
Time Attention tion is the divided space-time attention mechanism, first described
in [50].
As shown in Fig. 16, it relies on computing spatial and temporal
attention separately in each transformer layer. Indeed, we first
compute the spatial attention, i.e., self-attention within each tem-
poral index, and then the temporal attention, i.e., self-attention
across all temporal indices.
Transformers and Visual Transformers 219
5.2.4 Cross-Attention An even more refined way to reduce the computational cost of
Bottlenecks attention calculation consists of using cross attention as a bottle-
neck. For instance, as shown in Fig. 17 and mentioned in Subhead-
ing 3.2, the perceiver [25] projects the extracted tokens xi into a
220 Robin Courant et al.
5.2.5 Factorized Encoder Lastly, the factorized encoder [49] architecture is the most efficient
with respect to the complexity/performance trade-off.
As in divided space-time attention, the factorized encoder aims
to compute spatial and temporal attention separately. Nevertheless,
as shown in Fig. 18, instead of mixing spatiotemporal tokens in
each transformer layer, here, there exist two separate encoders:
6
In practice, N ≤ 512 for perceiver [25], against ST = 16 × 16 × (32/2) = 4096 for ViViT-L [49]
Transformers and Visual Transformers 221
Fig. 18 Factorized encoder mechanism. First, a spatial transformer processes input tokens along the spatial
dimension. Then, a temporal transformer processes the resulting spatial embedding along the temporal
dimension. Figure inspired from [25]
5.3.1 TimeSformer TimeSformer [50] is one of the first architectures with space-time
attention that impacted the video classification field. It follows the
same structure and principle described in Subheading 5.2.1.
First, it takes as input an RGB video clip sampled at a rate of
1/32 and decomposed into 2D 16 × 16 patches.
As shown in Fig. 19, the TimeSformer architecture is based on
the ViT architecture (Subheading 2.3), with 12 12-headed MSA
layers. However, the added value compared to the ViT is that
TimeSfomer uses the divided space-time attention mechanism (Sub-
heading 5.2.3). Such attention mechanism enables to capture high-
level spatiotemporal features while taming the complexity of the
model. Moreover, the authors introduce three variants of the archi-
tecture: (i) TimeSformer, the standard version of the model, that
operates on 8 frames of 224 × 224; (ii) TimeSformer-L, a configu-
ration with high spatial resolution, that operates on 16 frames of
448 × 448; and (iii) TimeSformer-HR, a long temporal range setup,
that operates on 96 frames of 224 × 224.
222 Robin Courant et al.
5.3.2 ViViT ViViT [49] is the main extension of the ViT [5] architecture
(Subheading 2.3) for video classification.
First, the authors use a 16 tubelet embedding instead of a 2D
patch embedding, as mentioned in Subheading 5.2.1. This alter-
nate embedding method aims to capture the spatiotemporal
Transformers and Visual Transformers 223
Fig. 20 ViViT architecture. The ViViT first projects input to embedding tokens,
which are summed to positional embedding tokens. The resulting tokens are first
passed through Ls spatial attention blocks and then through Lt temporal attention
blocks. The resulting output is linearly projected to obtain output probabilities
7
ViT-B: 12 12-headed MSA layers; ViT-L: 24 16-headed MSA layers; and ViT-H: 32 16-headed MSA layers.
224 Robin Courant et al.
5.4 Multimodal Video Nowadays, one of the main gaps between artificial and human
Transformers intelligence is the ability for us to process multimodal signals and
to enrich the analysis by mixing the different modalities. Moreover,
until recently, deep learning models have been focusing mostly on
very specific visual tasks, typically based on a single modality, such as
image classification [5, 17, 18, 56, 57], audio classification [25, 52,
58, 59], and machine translation [10, 60–63]. These two factors
combined have pushed researchers to take up multimodal
challenges.
The default solution for multimodal tasks consists in first cre-
ating an individual model (or network) per modality and then in
fusing the resulting single-modal features together [64, 65]. Yet,
this approach fails to model interactions or correlations among
different modalities. However, the recent rise of attention [4, 5,
49] is promising for multimodal applications, since attention per-
forms very well at combining multiple inputs [25, 52, 66, 67].
Here, we present two main ways of dealing with several
modalities:
1. Concatenating tokens from different modalities into one
vector [25, 66]. The multimodal video transformer
(MM-ViT) [66] combines raw RGB frames, motion features,
and audio spectrogram for video action recognition. To do so,
the authors fuse tokens from all different modalities into a
single-input embedding and pass it through transformer layers.
However, a drawback of this method is that it fails to distin-
guish well one modality to another. To overcome this issue, the
authors of the perceiver [25] propose to learn a modality
embedding in addition to the positional embedding (see Sub-
headings 3.2 and 5.2.1). This allows associating each token
Transformers and Visual Transformers 225
6 Conclusion
References
1. Bahdanau D, Cho K, Bengio Y (2015) Neural 1 (long and short papers), association for
machine translation by jointly learning to align computational linguistics, pp 4171–4186
and translate. In: International conference on 11. Wikimedia Foundation (2019) Wikimedia
learning representations downloads. https://dumps.wikimedia.org
2. Cho K, van Merriënboer B, Gulcehre C, 12. Zhu Y, Kiros R, Zemel R, Salakhutdinov R,
Bahdanau D, Bougares F, Schwenk H, Bengio Urtasun R, Torralba A, Fidler S (2015) Align-
Y (2014) Learning phrase representations ing books and movies: towards story-like visual
using RNN encoder–decoder for statistical explanations by watching movies and reading
machine translation. In: Empirical methods in books. In: Proceedings of the international
natural language processing, association for conference on computer vision, pp 19–27
computational linguistics, pp 1724–1734 13. Wang X, Girshick R, Gupta A, He K (2018)
3. Sutskever I, Vinyals O, Le QV (2014) Non-local neural networks. In: Proceedings of
Sequence to sequence learning with neural the IEEE conference on computer vision and
networks. In: Advances in neural information pattern recognition, pp 7794–7803
processing systems, vol 27 14. Carion N, Massa F, Synnaeve G, Usunier N,
4. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Kirillov A, Zagoruyko S (2020) End-to-end
Jones L, Gomez AN, Kaiser Ł, Polosukhin I object detection with transformers. In: Pro-
(2017) Attention is all you need. In: Advances ceedings of the European conference on com-
in neural information processing systems, puter vision. Springer, Berlin, pp 213–229
vol 30 15. Ramachandran P, Parmar N, Vaswani A,
5. Dosovitskiy A, Beyer L, Kolesnikov A, Bello I, Levskaya A, Shlens J (2019) Stand-
Weissenborn D, Zhai X, Unterthiner T, alone self-attention in vision models. In:
Dehghani M, Minderer M, Heigold G, Advances in neural information processing sys-
Gelly S, Uszkoreit J, Houlsby N (2021) An tems, vol 32
image is worth 16 × 16 words: transformers 16. Wang H, Zhu Y, Green B, Adam H, Yuille A,
for image recognition at scale. In: International Chen LC (2020) Axial-deeplab: stand-alone
conference on learning representations axial-attention for panoptic segmentation. In:
6. Press O, Wolf L (2017) Using the output Proceedings of the European conference on
embedding to improve language models. In: computer vision. Springer, Berlin, pp 108–126
Proceedings of the 15th conference of the 17. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei
European chapter of the association for L (2009) Imagenet: a large-scale hierarchical
computational linguistics: volume 2, short image database. In: Proceedings of the IEEE
papers, association for computational linguis- conference on computer vision and pattern
tics, pp 157–163 recognition, pp 248–255
7. Fedus W, Zoph B, Shazeer N (2021) Switch 18. Krizhevsky A, Hinton G (2009) Learning mul-
transformers: scaling to trillion parameter tiple layers of features from tiny images. Tech
models with simple and efficient sparsity. arXiv rep University of Toronto, Toronto, ON
preprint arXiv:210103961
19. Sun C, Shrivastava A, Singh S, Gupta A (2017)
8. Raffel C, Shazeer N, Roberts A, Lee K, Revisiting unreasonable effectiveness of data in
Narang S, Matena M, Zhou Y, Li W, Liu PJ deep learning era. In: Proceedings of the inter-
(2020) Exploring the limits of transfer learning national conference on computer vision, pp
with a unified text-to-text transformer. J Mach 843–852
Learn Res 21(140):1–67
20. Selva J, Johansen AS, Escalera S, Nasrollahi K,
9. Khan S, Naseer M, Hayat M, Zamir SW, Khan Moeslund TB, Clapés A (2022) Video trans-
FS, Shah M (2021) Transformers in vision: a formers: a survey. arXiv preprint
survey. ACM Comput Surv 24:200 arXiv:220105991
10. Devlin J, Chang MW, Lee K, Toutanova K 21. Touvron H, Cord M, Douze M, Massa F,
(2019) BERT: pre-training of deep bidirec- Sablayrolles A, Jégou H (2021) Training data-
tional transformers for language efficient image transformers & distillation
understanding. In: Proceedings of the 2019 through attention. In: International confer-
conference of the north American chapter of ence on machine learning. PMLR, pp
the association for computational linguistics: 10347–10357
human language technologies, volume
Transformers and Visual Transformers 227
22. d’Ascoli S, Touvron H, Leavitt ML, Morcos 33. Ren S, He K, Girshick R, Sun J (2015) Faster
AS, Biroli G, Sagun L (2021) Convit: improv- R-CNN: towards real-time object detection
ing vision transformers with soft convolutional with region proposal networks. In: Advances
inductive biases. In: International conference in neural information processing systems,
on machine learning. PMLR, pp 2286–2296 vol 28
23. Hassani A, Walton S, Shah N, Abuduweili A, 34. Sharma P, Ding N, Goodman S, Soricut R
Li J, Shi H (2021) Escaping the big data para- (2018) Conceptual captions: a cleaned, hyper-
digm with compact transformers. arXiv pre- nymed, image alt-text dataset for automatic
print arXiv:210405704 image captioning. In: Proceedings of ACL
24. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, 35. Radford A, Kim JW, Hallacy C, Ramesh A,
Lin S, Guo B (2021) Swin transformer: hierar- Goh G, Agarwal S, Sastry G, Askell A,
chical vision transformer using shifted Mishkin P, Clark J et al (2021) Learning trans-
windows. In: Proceedings of the international ferable visual models from natural language
conference on computer vision supervision. In: International conference on
25. Jaegle A, Gimeno F, Brock A, Vinyals O, machine learning. PMLR, pp 8748–8763
Zisserman A, Carreira J (2021) Perceiver: gen- 36. He X, Peng Y (2017) Fine-grained image clas-
eral perception with iterative attention. In: sification via combining vision and
International conference on machine learning. language. In: Proceedings of the IEEE confer-
PMLR, pp 4651–4664 ence on computer vision and pattern recogni-
26. Jaegle A, Borgeaud S, Alayrac JB, Doersch C, tion, pp 5994–6002
Ionescu C, Ding D, Koppula S, Zoran D, 37. Sennrich R, Haddow B, Birch A (2016) Neural
Brock A, Shelhamer E et al (2021) machine translation of rare words with sub-
Perceiver IO: a general architecture for word units. In: Proceedings of the 54th annual
structured inputs & outputs. arXiv preprint meeting of the association for computational
arXiv:210714795 linguistics (volume 1: long papers), association
27. Strudel R, Garcia R, Laptev I, Schmid C for computational linguistics, pp 1715–1725
(2021) Segmenter: transformer for semantic 38. Ramesh A, Pavlov M, Goh G, Gray S, Voss C,
segmentation. In: Proceedings of the interna- Radford A, Chen M, Sutskever I (2021) Zero-
tional conference on computer vision, pp shot text-to-image generation. In: Interna-
7262–7272 tional conference on machine learning.
28. Caron M, Touvron H, Misra I, Jégou H, PMLR, pp 8821–8831
Mairal J, Bojanowski P, Joulin A (2021) 39. Brown T, Mann B, Ryder N, Subbiah M,
Emerging properties in self-supervised vision Kaplan JD, Dhariwal P, Neelakantan A,
transformers. In: Proceedings of the interna- Shyam P, Sastry G, Askell A, et al (2020) Lan-
tional conference on computer vision guage models are few-shot learners. In:
29. He K, Chen X, Xie S, Li Y, Dollár P, Girshick R Advances in neural information processing sys-
(2021) Masked autoencoders are scalable tems, vol 33, pp 1877–1901
vision learners. arXiv preprint 40. Ramesh A, Dhariwal P, Nichol A, Chu C, Chen
arXiv:211106377 M (2022) Hierarchical text-conditional image
30. Hudson DA, Zitnick L (2021) Generative generation with clip latents. arXiv preprint
adversarial transformers. In: International con- arXiv:220406125
ference on machine learning. PMLR, pp 41. Alayrac JB, Donahue J, Luc P, Miech A, Barr I,
4487–4499 Hasson Y, Lenc K, Mensch A, Millican K, Rey-
31. Zhang B, Gu S, Zhang B, Bao J, Chen D, nolds M et al (2022) Flamingo: a visual lan-
Wen F, Wang Y, Guo B (2022) Styleswin: guage model for few-shot learning. arXiv
transformer-based GAN for high-resolution preprint arXiv:220414198
image generation. In: Proceedings of the 42. Brock A, De S, Smith SL, Simonyan K (2021)
IEEE conference on computer vision and pat- High-performance large-scale image recogni-
tern recognition tion without normalization. In: International
32. Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: conference on machine learning. PMLR, pp
Pretraining task-agnostic visiolinguistic repre- 1059–1071
sentations for vision-and-language tasks. In: 43. Jia C, Yang Y, Xia Y, Chen YT, Parekh Z,
Advances in neural information processing sys- Pham H, Le Q, Sung YH, Li Z, Duerig T
tems, vol 32 (2021) Scaling up visual and vision-language
228 Robin Courant et al.
representation learning with noisy text kinetics human action video dataset. arXiv pre-
supervision. In: International conference on print arXiv:170506950
machine learning. PMLR, pp 4904–4916 56. Wu H, Xiao B, Codella N, Liu M, Dai X,
44. Epstein D, Vondrick C (2021) Learning goals Yuan L, Zhang L (2021) CvT: introducing
from failure. In: Proceedings of the IEEE con- convolutions to vision transformers. In: Pro-
ference on computer vision and pattern ceedings of the international conference on
recognition computer vision, pp 22–31
45. Marin-Jimenez MJ, Kalogeiton V, Medina- 57. Touvron H, Cord M, Sablayrolles A,
Suarez P, Zisserman A (2019) Laeo-net: revi- Synnaeve G, Jégou H (2021) Going deeper
siting people looking at each other in with image transformers. In: Proceedings of
videos. In: Proceedings of the IEEE conference the international conference on computer
on computer vision and pattern recognition vision, pp 32–42
46. Nagrani A, Yang S, Arnab A, Jansen A, 58. Gemmeke JF, Ellis DP, Freedman D, Jansen A,
Schmid C, Sun C (2021) Attention bottlenecks Lawrence W, Moore RC, Plakal M, Ritter M
for multimodal fusion. In: Advances in neural (2017) Audio set: an ontology and human-
information processing systems labeled dataset for audio events. In: IEEE
47. Ryoo M, Piergiovanni A, Arnab A, international conference on acoustics, speech
Dehghani M, Angelova A (2021) Tokenlear- and signal processing (ICASSP). IEEE, pp
ner: Adaptive space-time tokenization for 776–780
videos. In: Advances in neural information pro- 59. Gong Y, Chung YA, Glass J (2021) AST: audio
cessing systems, vol 34 spectrogram transformer. In: Proceedings of
48. Hara K, Kataoka H, Satoh Y (2018) Can spa- interspeech 2021, pp 571–575
tiotemporal 3d CNNs retrace the history of 2d 60. Bojar O, Buck C, Federmann C, Haddow B,
CNNs and imagenet? In: Proceedings of the Koehn P, Leveling J, Monz C, Pecina P,
IEEE conference on computer vision and pat- Post M, Saint-Amand H, et al (2014) Findings
tern recognition, pp 6546–6555 of the 2014 workshop on statistical machine
49. Arnab A, Dehghani M, Heigold G, Sun C, translation. In: Proceedings of the 9th work-
Lučić M, Schmid C (2021) Vivit: a video vision shop on statistical machine translation, pp
transformer. In: Proceedings of the interna- 12–58
tional conference on computer vision, pp 61. Liu X, Duh K, Liu L, Gao J (2020) Very deep
6836–6846 transformers for neural machine translation.
50. Bertasius G, Wang H, Torresani L (2021) Is arXiv preprint arXiv:200807772
space-time attention all you need for video 62. Edunov S, Ott M, Auli M, Grangier D (2018)
understanding? In: International conference Understanding back-translation at scale. In:
on machine learning Empirical methods in natural language proces-
51. Ba JL, Kiros JR, Hinton GE (2016) Layer nor- sing, association for computational linguistics,
malization. arXiv preprint arXiv:160706450 pp 489–500
52. Nagrani A, Yang S, Arnab A, Jansen A, 63. Lin Z, Pan X, Wang M, Qiu X, Feng J,
Schmid C, Sun C (2021) Attention bottlenecks Zhou H, Li L (2020) Pre-training multilingual
for multimodal fusion. In: Advances in neural neural machine translation by leveraging align-
information processing systems, vol 34 ment information. In: Empirical methods in
53. Feichtenhofer C, Fan H, Malik J, He K (2019) natural language processing, association for
Slowfast networks for video recognition. In: computational linguistics, pp 2649–2663
Proceedings of the international conference 64. Owens A, Efros AA (2018) Audio-visual scene
on computer vision, pp 6202–6211 analysis with self-supervised multisensory
54. Carreira J, Zisserman A (2017) Quo vadis, features. In: Proceedings of the European con-
action recognition? A new model and the kinet- ference on computer cision, pp 631–648
ics dataset. In: Proceedings of the IEEE con- 65. Ramanishka V, Das A, Park DH,
ference on computer vision and pattern Venugopalan S, Hendricks LA, Rohrbach M,
recognition, pp 6299–6308 Saenko K (2016) Multimodal video
55. Kay W, Carreira J, Simonyan K, Zhang B, description. In: Proceedings of the ACM inter-
Hillier C, Vijayanarasimhan S, Viola F, national conference on multimedia, pp
Green T, Back T, Natsev P et al (2017) The 1092–1096
Transformers and Visual Transformers 229
66. Chen J, Ho CM (2022) Mm-vit: Multi-modal summarization. In: Advances in neural infor-
video transformer for compressed video action mation processing systems, vol 34
recognition. In: Proceedings of the IEEE/ 68. Liu ZS, Courant R, Kalogeiton V (2022) Fun-
CVF winter conference on applications of com- nyNet: Audiovisual Learning of Funny
puter vision, pp 1910–1921 Moments in Videos. In: Proceedings of the
67. Narasimhan M, Rohrbach A, Darrell T (2021) Asian conference on computer vision. Springer,
Clip-it! language-guided video Macau, China, pp 3308–3325
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution,
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other
third-party material in this chapter are included in the chapter’s Creative Commons license, unless indicated
otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder.
Part II
Data
Chapter 7
Abstract
The clinical evaluation of brain diseases strictly depends on patient’s complaint and observation of their
behavior. The specialist, often the neurologist, chooses whether and how to assess cognition, motor system,
sensory perception, and autonomic nervous system. They may also decide to request a more in-depth
examination, such as neuropsychological and language assessments and imaging or laboratory tests. From
the synthesis of all these results, they will be able to make a diagnosis. The neuropsychological assessment in
particular is based on the collection of medical history, on the clinical observation, and on the administra-
tion of standardized cognitive tests validated in the scientific literature. It is therefore particularly useful
when a neurological disease with cognitive and/or behavioral manifestation is suspected. The introduction
of machine learning methods in neurology represents an important added value to the evaluation per-
formed by the clinician to increase the diagnostic accuracy, track disease progression, and assess treatment
efficacy.
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_7,
© The Author(s) 2023
233
234 Stéphane Epelbaum and Federica Cacciamani
1.2 Peculiarities of The brain has functionally distinct regions, so there is a topograph-
Clinical Assessment of ical correspondence between the location of the lesion in the brain
Brain Disorders and the symptom. The characterization of symptoms therefore
allows to trace which brain region is affected. This helps in identify-
ing the underlying disease. The motor and sensory cortices are
perfect examples of this functional topography often depicted as
homunculi [7].
Clinical evaluations for brain disorders thus follow a standar-
dized procedure. In addition to the symptoms and signs appraisal,
the physician often makes an assumption as to where the nervous
system is affected. This often overlaps with the syndromic defini-
tion: “frontotemporal dementia” implies that the lesions are in the
frontotemporal cortices. However, this is not always the case as
some diseases and syndromes still bear the name of the physician
who was the first to describe it. While most neurologists know that
a parkinsonism (or Parkinson syndrome) is due to basal ganglia
lesions, it is not implied in its name.
2.1 General The neurological examination begins with the collection of anam-
Information on the nestic data, that is, the complete history recalled and recounted by a
Neurological patient or their entourage, including complaint, medical history,
Examination lifestyle, concurrent treatments, etc. During the collection of anam-
nestic data, the clinician also carefully observes patient’s behavior.
The neurologist then proceeds with the examination of brain func-
tion, which is oriented by the complaint, and often includes cogni-
tive screening tests and examination of motor system, sensitivity,
and autonomic nervous system. Usually, this examination has more
formal and structured parts (this can be, for example, a systematic
evaluation of reflexes always in the same order or the use of a
specific scale to assess sensory or cognitive function) and other
236 Stéphane Epelbaum and Federica Cacciamani
2.2 Clinical Interview A clinical interview precedes any objective assessment. It is adapted
to the patient’s complaint and as standardized as possible so as not
to forget any question. It consists of:
– Personal and family history with, if necessary, a family tree.
– Lifestyle (including alcohol intake and smoking).
– Past or current treatments.
– As accurate as possible description of the illness made by the
patient and/or their informant. It is important to know the
intensity of the symptoms, their frequency, the chronological
order of their appearance, the explorations already carried out,
and the treatments undertaken as well as their effectiveness.
In a formal evaluation, especially in cohort studies and clinical
trials, symptoms can be assessed thanks to different scales, some of
which will be presented in this chapter, depending on the clinical
variable of interest. These scales’ results will also be used to monitor
the disease evolution, notably in order to test new treatments.
The interview process is probably the most important part of
the whole clinical assessment. It will allow delineating the patient’s
medical issue, which in turn will determine the next steps of the
examination and management plan. It also creates a relation of trust
that is essential for the future adhesion of the patient to the physi-
cian’s propositions.
Clinical Assessment of Brain Disorders 237
2.3 Evaluation of The assessment of cognition and behavior can be carried out by the
Cognition and neurologist using more or less in-depth tests depending on the
Behavior situation, or a complete neuropsychological assessment can be
requested and carried out separately by a neuropsychologist (see
Subheading 3 of this chapter). The assessment of cognition is
guided by the cognitive complaint of the patient and/or the infor-
mant [9]. However, on the one hand, it is possible that the patient
is not fully aware of their deficits. This is a symptom called anosog-
nosia (which literally means lack of knowledge of the disease) and is
typical of various forms of dementia, including AD and frontotem-
poral dementia, but also brain damage due, for example, to stroke
in certain regions of the brain. On the other hand, a cognitive
complaint can be due to anxiety, depression, and personality traits
and may have no neurological basis. The medical doctor can use
simple tests in their daily practice such as the Mini-Mental State
Examination (MMSE) [10]. For a more detailed description of the
MMSE, please refer to Subheading 3 of this chapter.
2.4 Evaluation of The examination of motor function starts as soon as the physician
Motor System greets their patient in the waiting room. They will immediately
observe the patient’s walk and their bodily movements. Then, in
their office, the observation will continue to search, for example, for
a muscular atrophy or fascicules (i.e., muscular shudder detected by
looking at the skin of the patient). This purely observational phase
is followed by a formal examination, provoking objective signs.
One goal of motor assessment is to assess muscle strength. This
is done segmentally, that is, carried out by evaluating the function
of muscle groups that perform the same action, for example, the
muscles that allow the elbow to flex. The neurologist gives a score
ranging from 0 to 5, where 0 indicates that they did not detect any
movement and 5 indicates normal movement strength.
A second aspect which is assessed is muscle tone. It is explored
by passively mobilizing the patient joints. Hypertonia, or rigidity, is
an increase in the tone. When the neurologist moves the joint, it
may remain rigidly in that position (plastic or parkinsonian hyper-
tonia), or the limb may immediately return to the resting position
as soon as the neurologist stops manipulating it (spastic or elastic
hypertonia). Hypotonia is a reduction of muscle tone, i.e., lack of
tension or resistance to passive movement. This is observed in
cerebellar lesions and chorea.
Another goal of motor assessment is evaluating deep tendon
reflexes. Using a reflex hammer, the neurologist taps the tendons
(e.g., Achilles’ tendon for the Achillean reflex). The deep tendon
reflexes will be categorized as (1) normal, (2) increased and
polykinetic (i.e., a single tap provokes more than one movement),
(3) diminished or abolished (as in peripheral nervous system dis-
eases), and (4) pendular (as in cerebellar syndrome). Often evi-
denced in case of increased reflexes, Babinski’s sign is the lazy and
238 Stéphane Epelbaum and Federica Cacciamani
Box 2 (continued)
6. Pronation/supination
7. Toe tapping
8. Leg agility
9. Arising from chair
10. Gait
11. Freezing of gait
12. Postural stability
13. Posture
14. Global spontaneity of
movement
15. Postural tremor of hands
16. Kinetic tremor of hands
17. Rest tremor amplitude
18. Constancy of rest tremor
2.5 Evaluation of Sensitivity is the ability to feel different tactile sensations: normal
Sensitivity (or crude) tact, pain, hot, or cold. Once again, it depends on the
anatomical regions and tracts affected by a pathological process.
The anterior spinothalamic tract carries information about crude
touch. The lateral spinothalamic tract conveys pain and tempera-
ture. Assessment includes measuring:
– Epicritic sensitivity: test the patient’s ability to discriminate two
very close stimuli.
– Deep sensitivity: test the direction of position of the joints by the
blind prehension. The doctor can also ask the patient if the
vibrations of a diapason on joint bones (knee, elbow) are felt.
– Discrimination of hot and cold; sensitivity to pain.
2.6 Other The physician evaluation will also assess the autonomic nervous
Evaluations system which, when impaired, can induce tensile disorders:
hypo-/hypertension, orthostatic hypotension (without compen-
satory acceleration of pulse), diarrhea, sweating disorders, accom-
modation disorders, and sexual disorders. They will also evaluate
cerebellar functions: balance, coordination (which when impaired
causes ataxia), and tremor.
Finally, clinicians will assess cranial nerves’ functions. Cranial
nerves are those coming out of the brainstem and have various
functions including olfaction, vision, eye movements, face sensori-
motricity, and swallowing. They are tested once again in a standar-
dized way from the first one to the twelfth.
2.7 Summary of the At the end of this examination, the signs and symptoms are
Neurological described in the report, and the physician specifies:
Evaluation
– A syndromic group of signs and symptoms
– The presumed location of brain damage
240 Stéphane Epelbaum and Federica Cacciamani
3 Neuropsychological Assessment
(continued)
Clinical Assessment of Brain Disorders 241
Box 3 (continued)
Attention Selective attention is the ability to select relevant
information from the environment
Sustained attention is the ability to persist for a relatively
long time on a certain task
Visuospatial Estimation of spatial relationships between the individual
abilities and the objects and between the objects themselves and
identification of visual characteristics of a stimulus such
as its orientation
Language Oral and written production and comprehension, at a
phonological, morphological, syntactic, semantic, and
pragmatic level
Executive Superior cognitive functions such as planning,
functions organization, performance monitoring, decision-
making, mental flexibility, etc.
Social Using information previously learned more or less
cognition explicitly to explain and predict one’s own behavior and
that of others in social situations
3.2 Psychometric The use of cognitive tests is the specificity of the neuropsychologi-
Properties of cal assessment.
Neuropsychological Each new test is developed according to a rigid and rigorous
Tests methodology, trying to minimize all possible sources of error or
bias, and based on scientific evidence. For example, a test that aims
to assess learning skills might include a list of words for the partici-
pant to memorize and then recall. These words will not be ran-
domly selected but carefully chosen based on characteristics such as
frequency of use, length, phonology, etc. The procedures for
administering neuropsychological tests are also standardized. The
situation (i.e., materials, instructions, test conditions, etc.) is the
same for all individuals and dictated by the administration manuals
provided with each test.
All tests, before being published, are validated for their psycho-
metric properties and normed. A normative sample is selected
according to certain criteria which may change depending on the
situation [19]. In most cases, these are large samples of healthy
individuals from the general population, stratified by age, sex,
and/or level of education. In other cases, more specific samples
are preferred. The goal is to identify how the score is distributed in
the normative sample. In this way, we can determine if the score
obtained by a hypothetical patient is normal (i.e., around the
average of the normative distribution) or pathological (i.e., far
from the average). Establishing how far from the average an obser-
vation must be in order to be considered abnormal is a real matter
of debate [20]. Many neuropsychological scores, as well as many
biological or physical attributes, follow a normal distribution in the
general population. The most used metrics to determine pathology
thresholds are z scores and percentiles. For a given patient, the
neuropsychologist usually computes the z score by subtracting the
mean of the normative sample from the raw score obtained by the
patient and then dividing the result by the standard deviation
(SD) of the normative sample. The distribution of z scores will
have a mean of 0 and a SD of 1. We can also easily find the percentile
corresponding to the z score. Most often, a score below the fifth
percentile (or z score = -1.65) or the second percentile (or z-
score = -2) is considered pathological. As an example, intelligence,
or intelligence quotient (IQ), is an attribute that follows a normal
distribution. It is conventionally measured with the Wechsler Adult
Intelligence Scale, also known as WAIS [21], or the Wechsler
Intelligence Scale for Children, also known as WISC [22]. The
distribution of IQs has a mean of 100 and a SD of 15 points.
Around 68% of individuals in the general population achieve an
IQ of 100 ± 15 points. Scores between 85 and 115 are therefore
considered to be average IQs (therefore normal). Ninety-five per-
cent of individuals are in a range within 30 points of 100, thus
Clinical Assessment of Brain Disorders 243
3.4 The Example of a The Mini-Mental State Examination, also known as MMSE, is one
Cognitive Test: The of the most widely used tools in both clinical practice and research,
Mini-Mental State validated in many languages and adapted to administration in many
Examination (MMSE) countries. It is a screening tool for adults, which allows assessing
global cognition quickly and easily through a paper–pencil test
lasting 5–10 min.
Clinical Assessment of Brain Disorders 245
(continued)
Clinical Assessment of Brain Disorders 247
4.1.1 Neurodegenerative They include Alzheimer’s disease, Lewy body and frontotemporal
Disorders Affecting Mostly dementias, as well rarer conditions such as primary progressive
Cognition or Behavior aphasias. This field relies heavily on neuropsychological evaluation.
Although progress has been achieved in diagnosis of these condi-
tions (especially Alzheimer’s disease) these last decades, therapeutic
unmet needs remain high.
4.1.2 Movement These include Parkinson’s disease but also dystonia, myoclonus,
Disorders tics, and tremors. Different treatment options have emerged for
this group of diseases in the last years. These include drugs based on
the dopamine levels in the brain (one of the main neurotransmitters
for movement) and deep brain stimulation which requires the
implantation of electrodes to stimulate or inhibit specific regions
of the basal ganglia.
248 Stéphane Epelbaum and Federica Cacciamani
4.1.3 Epilepsy This broad term refers to the abnormal electric activity of neurons
in brain regions or in the whole brain inducing seizures. They are
defined by the co-occurrence of symptoms or signs, and these
electric abnormalities are detected by electroencephalography
(EEG). Many anti-epileptic drugs exist to decrease the seizure
frequency in these patients. Some patients present with pharma-
coresistant epilepsy. For such patients, surgery, which aims at
resecting part of the brain in order to suppress seizures, can be a
treatment option.
4.1.4 Stroke or Acute stroke is managed in stroke emergency units. A stroke can be
Neurovascular Diseases either a brain infarction or a hemorrhage. They are not primary
diseases of the brain tissue but of the arteries, capillaries, and veins
that irrigate it. Treatment options range from rapid clot removal in
ischemia (whether by thrombolysis or neuroradiological interven-
tion), anti-aggregating or anticoagulation therapy, and physical or
speech rehabilitation.
4.1.5 Neuro-oncology This specialty deals with brain tumors, which may be malignant or
benign. There are close connections with neurosurgery units and
neuropathology which play a valuable role in analyzing the micro-
structure of the tumor in order to achieve a precise diagnosis.
Treatments typically rely on a combination of surgery, radiotherapy,
and chemotherapy.
4.1.6 Peripheral Nerve They include all the diseases of the nerves outside of the brain,
Diseases brainstem, or spine. These diseases induce motor, sensory, and
autonomous impairments and are diagnosed through a combina-
tion of medical examination and electromyographic (EMG) record-
ings. Treatment options are very dependent on the cause of the
disease which can range from simple mechanic compression of a
nerve requiring mild surgery (carpal syndrome) to hepatic graft in
some rare conditions (TTR mutation causing familial transthyretin
amyloidosis).
4.1.7 Headaches Although headaches are highly prevalent, specialists are rare in
university hospital as these conditions (including migraine) are
often cared for in private practice offices, except for the most urgent
causes which are managed by emergency units. Treatments aim to
decrease the frequency of the crisis (preventative treatments) for the
most severe cases or the pain during a given crisis.
4.1.8 Sleep Disorders Sleep disorders are sometimes managed by neurologists for some
diseases (like narcolepsy) or pneumologists (since sleep apneas are
among the most frequent cause of sleep impairment) or psychia-
trists (tackling insomnia, often associated with psychiatric comor-
bidities). A sleep recording called polysomnography is sometimes
Clinical Assessment of Brain Disorders 249
4.1.9 Inflammatory and The most emblematic of this group is multiple sclerosis in which
Demyelinating Brain the autoimmune system turns against the individual, penetrates the
Diseases blood–brain barrier, and attacks the myelin which allows the rapid
diffusion of the neuronal electric signal along the axons. This is one
of the most advanced fields of neurology regarding treatment.
Since the start of the twenty-first century, specific therapies pre-
venting the crossing of the blood–brain barrier of lymphocytes
revolutionized the management of multiple sclerosis [37].
4.1.10 Neurogenetic Neurogenetic diseases are a group of rare diseases (like Hunting-
Diseases ton’s chorea) due to a genetic mutation. These diseases usually
follow a Mendelian mode of inheritance. They have the particular-
ity to be detectable (through genetic testing after a specific
counseling) which gives the opportunity to study them in their
premorbid phase (i.e., before the onset of typical symptoms in a
group of mutation carriers). Innovative gene therapies are actually
being developed in some of these neurogenetic conditions
[38]. Note that there also exist genetic forms of diseases which
are in majority sporadic (e.g., familial forms of Alzheimer’s disease).
4.1.11 Neuromuscular These are diseases affecting the motor neurons such as amyotrophic
Disorders lateral sclerosis, the neuromuscular synapse like myasthenia, or
specifically the muscles in myopathies. To the exception of myas-
thenia, few treatment options exist in this particular field of
neurology.
4.2 Importance of a Neurologists have a saying: “time is brain.” The correct and timely
Correct and Timely identification of a neurological disease is indeed crucial to be able to
Diagnostic mitigate and sometimes reverse the signs and symptoms. As such,
Classification machine learning techniques may be very useful tools both in the
context of slow-paced diseases such as Alzheimer’s which are often
diagnosed quite late or not at all [39] and to optimize the patient
flow in emergency care, in case of stroke, for instance. This frame-
work is theoretical as in practice some diseases can interact to
induce symptoms. For instance, dementia is often of mixed origin,
due to the association of degenerative (Alzheimer’s disease) and
vascular alterations. A walking deficit can be due to Parkinson’s
disease but also in part to arthrosis, etc. The correct identification
of a disease is in part probabilistic, and this can lead to heterogene-
ity in the collected data from the clinical assessment.
250 Stéphane Epelbaum and Federica Cacciamani
5 Conclusion
Acknowledgments
The research leading to these results has received funding from the
French government under management of Agence Nationale de la
Recherche as part of the “Investissements d’avenir” program, ref-
erence ANR-10-IAIHU-06 (Agence Nationale de la Recherche-
10-IA Institut Hospitalo-Universitaire-6).
References
1. Lage JMM (2006) 100 years of Alzheimer’s 5. Rodrı́guez-Gómez O, Rodrigo A, Iradier F
disease (1906–2006). J Alzheimers Dis 9(s3): et al (2019) The MOPEAD project: advancing
15–26 patient engagement for the detection of
2. McKhann G, Drachman D, Folstein M et al "hidden" undiagnosed cases of Alzheimer’s
(1984) Clinical diagnosis of Alzheimer’s dis- disease in the community. Alzheimers Dement
ease: report of the NINCDS-ADRDA Work 15(6):828–839
Group under the auspices of Department of 6. Daly T, Mastroleo I, Gorski D et al (2020) The
Health and Human Services Task Force on ethics of innovation for Alzheimer’s disease:
Alzheimer’s Disease. Neurology 34(7): the risk of overstating evidence for metabolic
939–944. https://doi.org/10.1212/wnl.34. enhancement protocols. Theor Med Bioeth
7.939. [doi] 41(5–6):223–237
3. Öhman F, Hassenstab J, Berron D et al (2021) 7. Penfield W, Rasmussen T (1950) The cerebral
Current advances in digital cognitive assess- cortex of man; a clinical study of localization of
ment for preclinical Alzheimer’s disease. Alz- function. Macmillan
heimers Dement 13(1):e12217 8. Tsai AC, Hong SY, Yao LH et al (2021) An
4. Dillenseger A, Weidemann ML, Trentzsch K efficient context-aware screening system for
et al (2021) Digital biomarkers in multiple Alzheimer’s disease based on neuropsychology
sclerosis. Brain Sci 11(11):1519 test. Sci Rep 11(1):18570
Clinical Assessment of Brain Disorders 251
36. Foreman MD (1987) Reliability and validity of therapy for FTD caused by GRN mutations.
mental status questionnaires in elderly hospita- Ann Clin Transl Neurol 7(10):1843–1853
lized patients. Nurs Res 36(4):216–220 39. Epelbaum S, Paquet C, Hugon J et al (2019)
37. Polman CH, Uitdehaag BM (2003) New and How many patients are eligible for disease-
emerging treatment options for multiple scle- modifying treatment in Alzheimer’s disease? A
rosis. Lancet Neurol 2(9):563–566 French national observational study over
38. Hinderer C, Miller R, Dyer C et al (2020) 5 years. BMJ Open 9(6)
Adeno-associated virus serotype 1-based gene
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 8
Abstract
Medical imaging plays an important role in the detection, diagnosis, and treatment monitoring of brain
disorders. Neuroimaging includes different modalities such as magnetic resonance imaging (MRI), X-ray
computed tomography (CT), positron emission tomography (PET), or single-photon emission computed
tomography (SPECT).
For each of these modalities, we will explain the basic principles of the technology, describe the type of
information the images can provide, list the key processing steps necessary to extract features, and provide
examples of their use in machine learning studies for brain disorders.
Key words Magnetic resonance imaging, Computed tomography, Positron emission tomography,
Single-photon emission computed tomography, Neuroimaging, Medical imaging, Machine learning,
Deep learning, Feature extraction, Preprocessing
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_8,
© The Author(s) 2023
253
254 Ninon Burgos
Fig. 2 Distribution by imaging modality (top) and brain disorder (bottom) of 1327 articles presenting a study
using machine learning. Note that these numbers should only be taken as rough indicators as they result from
a non-exhaustive literature search. The Scopus query and the resulting articles (after some manual filtering)
are available as a public Zotero library (https://www.zotero.org/groups/4623150/neuroimaging_with_ml_for_
brain_disorders/library)
2 Manipulating Neuroimages
alone will thus usually not provide a diagnosis. It will rather bring
arguments in favor, or against, a potential diagnosis (for instance, in
the exploration of a dementia syndrome, MRI can bring positive
arguments for a diagnosis of Alzheimer’s disease due to the
observed atrophy in specific areas or on the contrary exclude this
diagnosis by showing that the syndrome is due to a different cause
such as a brain tumor). Overall, the diagnosis will generally be made
by the neurologist or the psychiatrist based on a combination of
clinical examination and a set of multimodal data (clinical and
cognitive tests, radiological report, biomarkers, etc.).
However, the use of neuroimages goes way beyond visual
inspection and is subject to quantification using image processing
procedures. This is particularly true in research even though image
processing tools are also increasingly used in clinical routine. A
characteristic of these tools that differentiates them from general
purpose image processing tools is their ability to handle three-
dimensional (3D) images.
2.1 The Nature of 3D Most medical imaging devices acquire 3D images. This is the case
Medical Images of all the ones presented in this chapter (MRI, CT, PET, and
SPECT). If 2D images are essentially 2D arrays of elements called
pixels (for picture elements), 3D images are 3D arrays of elements
called voxels (for volume elements). Depending on the imaging
modality, voxel values will represent different properties of the
underlying tissues. For example, in a CT image, they will be pro-
portional to linear attenuation coefficients. The shape and size of a
voxel will also depend on the imaging modality (or the type of
sequence in MRI). When its three dimensions are of equal lengths,
the voxel is isotropic; otherwise, it is anisotropic (see Fig. 3). For
instance, a typical voxel size for a T1-weighted MR image is about
1 × 1 × 1 mm3, while it is about 3 × 3 × 3 mm3 for a functional MR
image. Most neuroimaging modalities will have a voxel dimension
between 0.5 mm and 5 mm.
Even though most neuroimages are 3D, they are visualized as
2D slices along different planes: axial, coronal, or sagittal (see
Fig. 4). Multiple tools exist to visualize neuroimages. Several are
available within suites such as FSLeyes,1 Freeview,2 or medInria,3
while others are independent such as Vinci,4 Mango,5 or Horos.6
Note that viewers may interpolate the images they display, which
may be misleading (see Fig. 5 for an illustration).
1
FSLeyes: https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FSLeyes.
2
Freeview: https://surfer.nmr.mgh.harvard.edu/fswiki/FreeviewGuide.
3
medInria: https://med.inria.fr.
4
Vinci: https://vinci.sf.mpg.de.
5
Mango: http://ric.uthscsa.edu/mango.
6
Horos: https://horosproject.org.
Neuroimaging in Machine Learning for Brain Disorders 257
Fig. 3 Most neuroimaging modalities are three-dimensional. Left: volume rendering of an excavated
T1-weighted MR image. Middle: voxel grid with isotropic, i.e., cubic, voxels overlaid on the MRI. Right:
voxel grid with anisotropic, i.e., rectangular, voxels overlaid on the MRI
Fig. 4 Axial, coronal, and sagittal slices extracted from a T1-weighted MR image
2.2 Extracting When using machine learning to analyze images, one will often
Features from extract features. These features can be grouped into four categories
Neuroimages that we will now describe and are illustrated in Fig. 6. Note that
these features are conceptually the same for the different modalities
but their actual content will differ (e.g., volume of a region for
anatomical MRI vs average uptake in this region for PET).
Modality-specific preprocessing and corrections often need to be
applied before neuroimages can be analyzed; these will be described
in Subheadings 3, 4, and 5.
258 Ninon Burgos
Fig. 5 Axial slice of a T1-weighted MRI with an isotropic voxel size originally of 1 × 1 × 1 mm3 (left) and
downsampled to 2 × 2 × 2 mm3 (right) displayed without interpolation or with linear interpolation. If the
difference with or without interpolation is subtle at 1 × 1 × 1 mm3, it is well visible at 2 × 2 × 2 mm3
Fig. 6 Examples of voxel, vertex, regional, and graph features that can be extracted from neuroimages. It is, for
instance, possible to extract voxel-based features from CT and SPECT images, vertex-based features from
anatomical T1-weighted (T1w) MRI or PET images, regional features from diffusion MRI, and graph-based
features from functional MRI. Note that the modalities are just examples. For instance, voxel-based features
can be extracted for any modality. See Subheadings 3, 4, and 5 for a description of the imaging modalities
260 Ninon Burgos
2.3 Neuroimaging The features described above can be obtained using neuroimaging
Software Tools software tools. However, an important step before any preproces-
sing or analysis is to properly organize data. The neuroimaging
community proposed the Brain Imaging Data Structure [20],
which specifies how to organize data in folders and sub-folders on
disk and how to name the files. It also details the metadata necessary
to describe neuroimaging experiments.
Many tools exist to process neuroimages.8 The historical
generic frameworks include SPM9 [21], FSL10 [22], FreeSurfer11
[23], or ANTs12 [24]. Some tools are modality-specific such as
MRtrix13 [25], dedicated to diffusion MRI, or AFNI14 [26], dedi-
cated to functional MRI. Recent initiatives aim to make the use of
neuroimaging tools easier by distributing them in containers (e.g.,
BIDSApps15 [27]), by providing in a single environment tools from
preprocessing to machine learning (e.g., Nilearn16 [28]), or by
providing automatic pipelines that do not require a particular
7
List of atlases: https://www.lead-dbs.org/helpsupport/knowledge-base/atlasesresources.
8
List of open source medical imaging software tools: https://idoimaging.com.
9
SPM: https://www.fil.ion.ucl.ac.uk/spm.
10
FSL: https://fsl.fmrib.ox.ac.uk.
11
FreeSurfer: https://surfer.nmr.mgh.harvard.edu.
12
ANTs: http://stnava.github.io/ANTs.
13
MRtrix: https://www.mrtrix.org.
14
AFNI: https://afni.nimh.nih.gov.
15
BIDSApps: https://bids-apps.neuroimaging.io/apps.
16
Nilearn: https://nilearn.github.io.
Neuroimaging in Machine Learning for Brain Disorders 261
3.1 Anatomical MRI In MRI, most images are obtained by exploiting a magnetic prop-
erty, called spin, of the hydrogen atomic nuclei found in the water
3.1.1 Basic Principles
molecules present in the human body. In the absence of a strong
external magnetic field, the directions of the proton’s spins are
random, thus cancelling each other out (Fig. 7a). When the spins
enter a strong external magnetic field (B0), they align either parallel
or antiparallel, and they all precess around the B0 axis, referred to as
the z axis (Fig. 7b). As a result, they cancel each other out in the
transverse (x, y) plane, but they add up along the z axis. The result
of this vector addition, called net magnetization M0, is propor-
tional to the proton density (Fig. 7c). With the application of a
radio frequency pulse denoted as B1, the system of spins and the net
magnetization are tipped by an angle determined by the strength
and duration of the radio frequency pulse. For a 90∘ radio fre-
quency pulse, the magnetization along the z axis (Mz) becomes
zero and the magnetization in the transverse plane (Mxy) becomes
equal to M0 (Fig. 7d). As this radio frequency pulse provides
energy, or excites, the spins, we also talk of radio frequency
excitation.
17
Clinica: https://www.clinica.run.
18
MONAI: https://monai.io.
19
TorchIO: https://torchio.readthedocs.io.
20
ClinicaDL: https://clinicadl.readthedocs.io.
262 Ninon Burgos
Fig. 7 MRI physics in a nutshell. (a) In the absence of a magnetic field, the directions of the proton’s spins are
random. (b) When the spins enter a strong external magnetic field (B0), they align either parallel or antiparallel,
and they all precess around the B0 axis. (c) The net magnetization M0 is proportional to the proton density. (d)
With the application of a radio frequency pulse, the system of spins is tipped
When the radio frequency pulse is then turned off, two phe-
nomena occur. First, the system of spins relaxes back to its preferred
energy state of being parallel with B0 in a time T1, called longitu-
dinal or spin-lattice relaxation time, and the longitudinal magneti-
zation Mz slowly recovers to its original magnitude M0. Second,
each spin starts precessing at a frequency that is slightly different
from the one of its neighboring spins because the field of the MRI
scanner is not uniform and because each spin is influenced by the
small magnetic fields of the neighboring spins. When the spins are
completely dephased, they are evenly spread in the transverse plane,
and Mxy becomes zero. Mxy decreases at a much faster rate than
that at which Mz recovers to M0. The transverse relaxation time T2,
also called spin-spin relaxation time, describes the Mxy decrease
because of interference from neighboring spins, while T2*
describes the decrease because of both spin-spin interactions and
nonuniformities of B0. Finally, the MRI signal is obtained by
measuring the transverse magnetization as an electrical current by
induction.
The contrast in MR images depends on three main parameters:
the proton density, the longitudinal relaxation time T1, and the
transverse relaxation time T2. These parameters can be adjusted by
changing the time at which the signal is recorded, called echo time,
and the interval between successive excitation pulses, called repeti-
tion time. A T1-weighted image is created by choosing a short
repetition time, a T2-weighted image by choosing a long echo
time, and a proton density (PD)-weighted image by minimizing
both T1 and T2 weighting of the image (long repetition time and
short echo time). The corresponding images are referred to as
T1-weighted MRI, T2-weighted MRI, and PD-weighted MRI.
Note that many variations of these sequences exist (for instance,
gradient-echo vs spin-echo) and the corresponding
Neuroimaging in Machine Learning for Brain Disorders 263
Fig. 8 Example of anatomical MR images. T1-weighted, T2-weighted, and T2-FLAIR images of a patient with
multiple sclerosis from the MSSEG MICCAI 2016 challenge data set [32, 33]
3.1.2 Extracting Features Several preprocessing steps are often necessary before analyzing
from Anatomical MRI anatomical MR images to correct imperfections and ease their
comparison.
Bias Field Correction MR images can be corrupted by a
low-frequency and smooth signal caused by magnetic field inho-
mogeneity. This bias field induces variations in the intensity of the
same tissue in different locations of the image, which deteriorates
the performance of image analysis algorithms such as registration or
segmentation. Several methods exist to correct these intensity inho-
mogeneities, the most popular being the N4 algorithm [34] avail-
able in ANTs [24].
21
HD-BET: https://github.com/MIC-DKFZ/HD-BET.
22
SynthStrip: https://surfer.nmr.mgh.harvard.edu/docs/synthstrip.
Neuroimaging in Machine Learning for Brain Disorders 265
3.1.3 Examples of T1-weighted MRI is the sequence the most commonly found in
Applications in Machine machine learning studies applied to brain disorders. Several features
Learning Studies can be extracted from T1-weighted MRI such as the volume of the
whole brain or of regions of interest; the density of a particular
tissue, e.g., gray matter; or the local cortical thickness and surface
area. All these features, as well as the raw T1-weighted MR images,
have, for example, largely been used for the computer-aided diag-
nosis of dementia, in particular Alzheimer’s disease, as they high-
light atrophy, i.e., the neuronal loss that is a marker of
neurodegenerative diseases [46–49].
T1-weighted MR images acquired with and without the injec-
tion of a contrast agent are often used in the context of brain tumor
detection and segmentation, progression assessment, and survival
prediction as they allow distinguishing active tumor structures
[50]. Such tasks also typically rely on another sequence called
T2-weighted fluid-attenuated inversion recovery (T2-FLAIR) that
allows visualizing a wide range of lesions on top of tumors [51],
such as those appearing with multiple sclerosis [52, 53] or
age-related white matter hyperintensities (also called leukoaraiosis,
which is linked to small vessel disease).
3.2 Diffusion- Diffusion MRI [54, 55] allows visualizing tissue micro-
Weighted MRI architecture, thanks to the diffusion of water molecules. Depending
on their surroundings, water molecules are able to either move
3.2.1 Basic Principles
freely, e.g., in the extracellular space, or move following surround-
ing constraints, e.g., within a neuron. In the former situation, the
diffusion is isotropic, while in the later it is anisotropic. Contrast in
a diffusion MR image originates from the fact that following the
application of an excitation pulse, water molecules that move in a
particular direction, and so the protons they contain, do not have
the same magnetic properties as the ones that move randomly but
not far from their origin point. The excitation pulse is parametrized
by a weighting coefficient b: the higher the b-value, the more
266 Ninon Burgos
Fig. 9 Example of diffusion-weighted MR images. Top: diffusion volumes acquired using different b-values
(0 and 1000 s/mm2) and gradient directions. Bottom: parametric maps resulting from diffusion tensor
modeling (fractional anisotropy, FA; axial diffusivity, AD; radial diffusivity, RD; and mean diffusivity, MD)
3.2.2 Extracting Features Diffusion MR images are typically acquired with echo-planar imag-
from Diffusion MRI ing, a technique that spatially encodes the MRI signal in a way that
enables fast acquisitions with a relatively high signal-to-noise ratio.
However, echo-planar imaging induces geometric distortions and
signal losses known as magnetic susceptibility artifacts. Other arti-
facts include eddy currents (due to the rapid switching of diffusion
gradients), intensity inhomogeneities (as for anatomical MRI), and
potential movements of the subject during the acquisition. These
artifacts need to be corrected before further analyzing the images.
Various methods exist to do so; they are reviewed in [56]. Two
widely used tools enabling the preprocessing of diffusion MR
images are FSL [22] and MRtrix [25], but others exist23 [56].
23
List of tools and software packages to process diffusion MRI: https://github.com/dmripreprocessing/
neuroimage-review-2022.
Neuroimaging in Machine Learning for Brain Disorders 267
3.2.3 Examples of Machine learning studies have mainly used diffusion MRI to assess
Applications in Machine white matter integrity. This has been done in a very wide variety of
Learning Studies disorders. For example, fractional anisotropy and mean diffusivity
have been used to differentiate cognitively normal subjects from
patients with mild cognitive impairment or Alzheimer’s disease
[60, 61]. Diffusion MRI has also been exploited to perform
tumor grading or subtyping [62] following the assumption that
the cellular structure may differ between cancerous and healthy
tissues.
3.3 Functional MRI When a region of the brain gets activated by a cognitive task, two
phenomena occur: a local increase in cerebral blood flow and
3.3.1 Basic Principles
changes in oxygenation concentration [63]. Functional MRI
(fMRI) is used to measure the latter phenomenon. The blood-
oxygen-level-dependent (BOLD) contrast originates from the fact
that hemoglobin molecules that carry oxygen have different mag-
netic properties than hemoglobin molecules that do not carry
oxygen.
268 Ninon Burgos
3.3.2 Extracting Features The preprocessing of functional MR images has two main objec-
from Functional MRI tives: limit the effect of nonneural sources of variability and correct
acquisition-related artifacts [65]. Preprocessing steps can, for
example, include susceptibility distortion correction (as for diffu-
sion MRI); head motion correction, by registering each volume in
the time series to a reference volume (e.g., the first volume); slice-
timing correction, to eliminate differences between the time of
acquisition of each slice in the volume; or physiologic noise correc-
tion, by temporal filtering [63, 65]. These preprocessing steps can
be performed using tools such as SPM [21], FSL [22], or AFNI
[26], but also using the dedicated fMRIPrep workflow [65].
The majority of machine learning studies in brain disorders
focuses on resting-state rather than task fMRI [66]. This can be
explained by the fact that the resting-state protocol is simpler and
allows multi-site studies (as it is less sensitive to changes in local
experimental settings) [66], which should result in larger samples.
Depending on the application, preprocessed resting-state fMRI
data may be further processed to extract features. One can directly
use voxel-based features (or vertex-based features by projecting the
functional MRI signal onto the cortical surface) [67]. Nevertheless,
to the best of our knowledge, the most common features are graph-
based. Indeed, most supervised algorithms for classification or
regression use brain networks extracted from resting-state time
series. In these networks, also called connectomes, the vertices
correspond to brain regions, which size may vary, and the edges
encode the functional connectivity strength, which corresponds to
the correlation between time series.
3.3.3 Examples of Machine learning methods exploiting resting-state fMRI data have
Applications in Machine been used to investigate brain development and aging, but also
Learning Studies neurodegenerative and psychiatric disorders [66]. Functional con-
nectivity patterns have, for instance, been used to distinguish
patients with schizophrenia from healthy controls [68] or discrimi-
nate schizophrenia and bipolar disorder from healthy controls [69].
Neuroimaging in Machine Learning for Brain Disorders 269
4 X-Ray Imaging
4.1 X-Ray and When an X-ray beam passes through the body, part of its energy is
Angiography absorbed or scattered: the number of X-ray photons is reduced by
attenuation (Fig. 10, left). On the opposite side of the body,
detectors capture the remaining X-ray photons, and an image is
generated. In an X-ray image, the contrast, defined as the relative
intensity change produced by an object, originates from the varia-
tions in linear attenuation coefficient with tissue type and density.
X-ray imaging provides excellent contrast between bone, air,
and soft tissue but very little contrast between the different types of
soft tissue, hence its limited use when studying brain disorders.
However, coupled with the injection of an iodine-based contrast
agent, X-ray imaging enables visualizing cerebral blood vessels and
detecting potential abnormalities such as an aneurysm. This tech-
nique is called X-ray angiography.
4.2 Computed Although the X-ray images produced were originally in 2D, X-ray
Tomography computed tomography enables the reconstruction of 3D images by
rotating the X-ray source and detectors around the body (Fig. 10,
4.2.1 Basic Principles
right). Rather than using the absolute values of the linear attenua-
tion coefficients, CT image intensities are expressed in a standard
Fig. 10 Left: attenuation of X-rays by matter. As it passes through a material of thickness Δx and linear
attenuation coefficient μ, the X-ray beam is attenuated. Its intensity decreases exponentially with the distance
travelled: Io = Ii e-μΔx, where Ii and Io are the input and output X-ray intensities. Right: third-generation CT. A
3D image is created by rotating the X-ray source and detectors around the body
270 Ninon Burgos
4.2.2 Extracting Features Contrary to MRI, CT images usually do no require extensive pre-
from CT Images processing steps [71]. It can however be useful to extract the head
from the hardware elements visible on the image (e.g., the bed or
pillow) or extract the brain. This can be done using thresholding
and morphological operators. Another common step is spatial
normalization.
In the context of stroke, non-contrast CT is useful to detect an
intracranial hemorrhage, which appears brighter than the sur-
rounding tissues, or to estimate the extent of early ischemic injury,
which results in a loss of gray-white matter differentiation. CT
angiography can help identify a potential intracranial arterial occlu-
sion, and CT perfusion allows differentiating the regions with
nonviable/non-salvageable tissue, which have very low cerebral
blood flow and volume, from the viable and potentially salvageable
regions [70]. These techniques may also be employed in the con-
text of brain tumors. In particular, contrast-enhanced CT can
detect areas presenting a blood-brain barrier breakdown [72]. An
example of CT acquired before and after contrast injection is dis-
played in Fig. 11.
To the best of our knowledge, CT is most often used in
machine learning in the form of voxel-based features (the image
intensities after some minimal preprocessing steps).
Fig. 11 Example of CT images. Non-contrast CT images, whose window levels were adjusted to better
visualize bone or brain tissues and contrast-enhanced CT image of a patient with lymphoma. Case courtesy of
Dr Yair Glick, Radiopaedia.org, rID: 94844
5 Nuclear Imaging
Fig. 12 PET annihilation. When a positron (e+) and an electron (e-) collide, they
annihilate and create a pair of collinear gamma rays (γ)
Fig. 13 Illustration of PET data detection. Without time-of-flight, the annihilation is located with equal
probability along the line of response, while with time-of-flight it is located in a limited portion of the line of
response
the tracer time to diffuse in the body and accumulate in the target
regions. The subject is then placed in the scanner and the acquisi-
tion lasts typically around 15 min. In the dynamic protocol, the
subject is first installed in the scanner, and the acquisition starts at
the same time the tracer is being injected. This allows recording
how the tracer diffuses in the body. Dynamic acquisitions are less
common than static ones because of their duration of 60–90 min,
which reduces patient throughput. In both static and dynamic
protocols, the acquisition is often split in frames of fix (in the static
case) or increasing (in the dynamic case) duration. A static acquisi-
tion of 15 min can typically be split into three frames of 5 min,
resulting in three PET volumes, each corresponding to the average
amount of radioactivity detected at each voxel during the time
frame.
18
F-fluorodeoxyglucose (FDG) is the most widely used PET
radiopharmaceutical [77, 78]. As an analogue of glucose, FDG is
transported to a cell, but, unlike glucose, it remains trapped in the
cell. This radiopharmaceutical is an excellent marker of changes in
glucose metabolism. In the brain, FDG acts as an indirect marker of
synaptic dysfunction and is part of the diagnosis of epilepsy and
neurodegenerative diseases, such as Alzheimer’s disease [79].
If 18F-FDG is a nonspecific tracer, other radiopharmaceuticals
target specific molecular or biological processes and are thus pref-
erentially used for studying specific diseases. Amyloid tracers, such
as the 11C Pittsburgh compound B, 18F-florbetapir, 18F-florbeta-
ben and 18F-flutemetamol, which bind to fibrillar Aβ plaques, or
tau tracers, such as 18F-flortaucipir, and 18F-MK-6240, which bind
to neurofibrillary tangles, are, for example, used in the diagnosis of
dementia syndromes [80]. Examples are displayed in Fig. 14. Of
note, the so-called amyloid tracers are in fact not specific of amyloid
and also bind to myelin in the white matter, making them of
Fig. 14 Example of PET images. Left: 18F-FDG PET displaying brain glucose metabolism. Middle: 18F-flortau-
cipir PET displaying the presence of tau neurofibrillary tangles. Right: 18F-florbetapir PET displaying the
presence of amyloid plaques. All the images correspond to the same Alzheimer’s disease patient from the
ADNI study [83]
274 Ninon Burgos
5.1.2 Extracting Features The reconstruction procedure of the PET signal already includes
from PET Images several corrections (e.g., attenuation and scatter corrections), but
several processing steps can be performed before further analyzing
PET images. The first one is often motion correction. This is
typically done by rigidly registering each frame to a reference
frame. The registered frames are then averaged to form a single
volume. To allow for intersubject comparison, brain PET images
need to be intensity normalized, for example, to compensate for
variations in the patients’ weight or dose injected. Standardized
uptake value ratios (SUVRs) are generated by dividing a PET image
by the mean uptake in a reference region. This region can be
obtained from an atlas, and in this case chosen depending on the
tracer and disorder suspected, or in a data-driven manner [84]. Par-
tial volume correction can be performed to limit the spill out of
activity outside of the region where the tracer is meant to accumu-
late [85] using tools such as PETPVC [86]. Finally, PET images
can also be spatially normalized. If an anatomical image (preferably
MRI but also CT) of the subject is available, the PET image is
rigidly registered to the anatomical image, and the anatomical
image is registered to a template, often in standard space. By
composing the two transformations, the PET image is spatially
normalized. Alternatively, if no anatomical image is available, the
PET image can directly be registered to a PET template, for exam-
ple, as implemented in SPM [87]. Dynamic PET images are further
processed to extract quantitative physiological data using kinetic
modeling, which is introduced in [77, 78].
One can then obtain different types of features, as described in
Subheading 2.2. Voxel-based features will very often be the SUVR
at each voxel, usually after spatial normalization. Vertex-based fea-
tures will generally be the SUVR projected onto the cortical surface
[88]. Regional features will usually correspond to the average
SUVR in each region of a parcellation. Graph-based features are
less used than for diffusion or functional MRI but are still employed
to study the so-called metabolic connectivity [89].
5.1.3 Examples of Machine learning studies have mainly exploited brain PET images
Applications in Machine in the context of dementia [90]. For example, the usefulness of
18
Learning Studies F-FDG PET to differentiate patients with Alzheimer’s disease
from healthy controls and patients with stable mild cognitive
impairment from those who subsequently progressed to Alzhei-
mer’s disease has been shown in [48, 91, 92]. 18F-FDG PET has
also been used to differentiate frontotemporal dementia from
Neuroimaging in Machine Learning for Brain Disorders 275
Fig. 16 Examples of SPECT images. Left: 99mTc-HMPAO SPECT images of a normal control and an epileptic
patient (http://spect.yale.edu) [100]. Right: 123I-FP-CIT SPECT images of a normal control and a patient with
Parkinson’s disease from the PPMI study [101]
5.2.2 Extracting Features After the reconstruction of a SPECT image, which includes several
from SPECT Images corrections, two processing steps are typically performed: intensity
normalization and spatial normalization [97, 98]. As for PET, the
intensity of a SPECT image can be normalized using a reference
region, and the image can be spatially normalized by directly regis-
tering it to a SPECT template or by registering it first to an
anatomical image.
As for PET, the most common feature types are voxel-based
(the normalized signal at each voxel) and regional features (often
the average normalized signal within a region). To the best of our
knowledge, vertex-based and graph-based features are rarely used
although they could in principle be computed.
5.2.3 Examples of Machine learning studies have mainly exploited brain SPECT
Applications in Machine images for the computer-aided diagnosis of Parkinsonian syn-
Learning Studies dromes [102]. 123I-FP-CIT SPECT was, for instance, used to
distinguish Parkinson’s disease from healthy controls [103, 104],
predict future motor severity [105], discriminate Parkinson’s dis-
ease from non-Parkinsonian tremor [104], or identify patients
clinically diagnosed with Parkinson’s disease but who have scans
without evidence of dopaminergic deficit [104].
In studies targeting dementia, both 99mTc-HMPAO [106] and
99m
Tc-ECD [107] tracers were used to differentiate between
images from healthy subjects and images from Alzheimer’s disease
patients.
Neuroimaging in Machine Learning for Brain Disorders 277
Fig. 17 Example of 18F-FDG PET, CT, T1-weighted MRI, and fused images
6 Conclusion
Acknowledgements
The author would like to thank the editor for useful suggestions.
The research leading to these results has received funding from the
French government under management of Agence Nationale de la
Recherche as part of the “Investissements d’avenir” program, ref-
erence ANR-19-P3IA-0001 (PRAIRIE 3IA Institute) and refer-
ence ANR-10-IAHU-0006 (Agence Nationale de la Recherche-
10-IA Institut Hospitalo-Universitaire-6).
References
1. Ambrose J (1973) Computerized transverse 2. Hounsfield GN (1973) Computerized trans-
axial scanning (tomography): Part 2. Clinical verse axial scanning (tomography): Part
application. Br J Radiol 46(552):1023–1047. 1. Description of system. Br J Radiol
https://doi.org/10.1259/0007-1285-46- 46(552):1016–1022. https://doi.org/10.12
552-1023 59/0007-1285-46-552-1016
278 Ninon Burgos
3. Röntgen WC (1896) On a New Kind of Rays. attenuation maps. J Nucl Med 41(8):
Science 3(59):227–231 1364–1368
4. Ter-Pogossian MM, Phelps ME, Hoffman EJ, 14. Hutton BF, Occhipinti M, Kuehne A,
Mullani NA (1975) A Positron-Emission Máthé D, Kovács N, Waiczies H,
Transaxial Tomograph for Nuclear Imaging Erlandsson K, Salvado D, Carminati M, Mon-
(PETT). Radiology 114(1):89–98. https:// tagnani GL et al (2018) Development of clin-
doi.org/10.1148/114.1.89 ical simultaneous SPECT/MRI. Br J Radiol
5. Jaszczak RJ, Murphy PH, Huard D, Burdine 91(1081):20160690. https://doi.org/10.
JA (1977) Radionuclide emission computed 1259/bjr.20160690
tomography of the head with 99mTc and a 15. Smith-Bindman R, Kwan ML, Marlow EC,
Scintillation Camera. J Nucl Med 18(4): Theis MK, Bolch W, Cheng SY, Bowles EJA,
373–380 Duncan JR, Greenlee RT, Kushi LH, Pole JD,
6. Keyes JW, Orlandea N, Heetderks WJ, Leo- Rahm AK, Stout NK, Weinmann S, Miglior-
nard PF, Rogers WL (1977) The etti DL (2019) Trends in use of medical imag-
Humongotron—A Scintillation-Camera ing in US Health Care Systems and in
Transaxial Tomograph. J Nucl Med 18(4): Ontario, Canada, 2000–2016. JAMA
381–387 322(9):843–856. https://doi.org/10.1001/
7. Becquerel H (1903) Recherches Sur Une jama.2019.11456
Propriété Nouvelle de La Matière: Activité 16. Evans AC, Janke AL, Collins DL, Baillet S
Radiante Spontanée Ou Radioactivité de La (2012) Brain templates and atlases. Neuro-
Matière. Mémoires de l’Académie Des i m a g e 6 2 ( 2 ) : 9 1 1 – 9 2 2 . h t t p s : // d o i .
Sciences de l’Institut de France, L’Institut de org/10.1016/j.neuroimage.2012.01.024
France 17. Dale AM, Fischl B, Sereno MI (1999) Corti-
8. Young IR, Bailes DR, Burl M, Collins AG, cal surface-based analysis: I. Segmentation
Smith DT, McDonnell MJ, Orr JS, Banks and surface reconstruction. Neuroimage
LM, Bydder GM, Greenspan RH, Steiner 9(2):179–194. https://doi.org/10.1006/
RE (1982) Initial clinical evaluation of a nimg.1998.0395
whole body nuclear magnetic resonance 18. Fischl B, Sereno MI, Dale AM (1999) Corti-
(NMR) Tomograph. J Comput Assist cal surface-based analysis: II: inflation, flatten-
Tomogr 6(1):1–18. https://doi.org/10.10 ing, and a surface-based coordinate system.
97/00004728-198202000-00001 Neuroimage 9(2):195–207. https://doi.
9. Bloch F (1946) Nuclear induction. Phys Rev org/10.1006/nimg.1998.0396
70(7-8):460–474. https://doi.org/10. 19. de Vico Fallani F, Richiardi J, Chavez M,
1103/PhysRev.70.460 Achard S (2014) Graph analysis of functional
10. Townsend DW, Beyer T, Blodgett TM (2003) brain networks: practical issues in translational
PET/CT scanners: A hardware approach to neuroscience. Philos Trans R Soc B Biol Sci
image fusion. Semin Nucl Med 33(3): 369(1653):20130521. https://doi.org/10.
193–204. https://doi.org/10.1053/snuc. 1098/rstb.2013.0521
2003.127314 20. Gorgolewski KJ, Auer T, Calhoun VD, Crad-
11. Schlemmer HP, Pichler B, Wienhard K, dock RC, Das S, Duff EP, Flandin G, Ghosh
Schmand M, Nahmias C, Townsend D, SS, Glatard T, Halchenko YO, Handwerker
Heiss WD, Claussen C (2007) Simultaneous DA, Hanke M, Keator D, Li X, Michael Z,
MR/PET for brain imaging: First patient Maumet C, Nichols BN, Nichols TE,
scans. J Nucl Med 48(supplement 2):45P– Pellman J, Poline JB, Rokem A, Schaefer G,
45P Sochat V, Triplett W, Turner JA,
12. Schmand M, Burbar Z, Corbeil J, Zhang N, Varoquaux G, Poldrack RA (2016) The
Michael C, Byars L, Eriksson L, Grazioso R, brain imaging data structure, a format for
Martin M, Moor A, Camp J, Matschl V, organizing and describing outputs of neuro-
Ladebeck R, Renz W, Fischer H, Jattke K, imaging experiments. Sci Data 3:
Schnur G, Rietsch N, Bendriem B, Heiss sdata201644. https://doi.org/10.1038/
WD (2007) BrainPET: First human tomo- sdata.2016.44
graph for simultaneous (functional) PET and 21. Penny WD, Friston KJ, Ashburner JT, Kiebel
MR imaging. J Nucl Med 48(supplement 2): SJ, Nichols TE (2011) Statistical parametric
45P–45P mapping: the analysis of functional brain
13. Patton JA, Delbeke D, Sandler MP (2000) images. Elsevier, Amsterdam
Image fusion using an integrated, dual-head 22. Jenkinson M, Beckmann CF, Behrens TE,
coincidence camera with X-ray tube-based Woolrich MW, Smith SM (2012) FSL.
Neuroimaging in Machine Learning for Brain Disorders 279
next in diffusion MRI preprocessing. Neuro- 66. Khosla M, Jamison K, Ngo GH, Kuceyeski A,
Image 249:118830. https://doi.org/10.101 Sabuncu MR (2019) Machine learning in
6/j.neuroimage.2021.118830 resting-state fMRI analysis. Magn Reson
57. Soares JM, Marques P, Alves V, Sousa N I m a g i n g 6 4 : 1 0 1 – 1 2 1 . h t t p s : // d o i .
(2013) A hitchhiker’s guide to diffusion ten- org/10.1016/j.mri.2019.05.031
sor imaging. Front. Neurosci. 7:31. https:// 67. Fischl B, Sereno MI, Tootell RB, Dale AM
doi.org/10.3389/fnins.2013.00031 (1999) High-resolution intersubject averag-
58. Jeurissen B, Descoteaux M, Mori S, Leemans ing and a coordinate system for the cortical
A (2019) Diffusion MRI fiber tractography of surface. Hum Brain Mapp 8(4):272–284.
the brain. NMR Biomed 32(4):e3785. https://doi.org/10.1002/(SICI)1097-0193
https://doi.org/10.1002/nbm.3785 (1999)8:4%3C272::AID-HBM10%3E3.0.
59. Zhang H, Schneider T, Wheeler-Kingshott CO;2-4
CA, Alexander DC (2012) NODDI: practical 68. Kim J, Calhoun VD, Shim E, Lee JH (2016)
in vivo neurite orientation dispersion and den- Deep neural network with weight sparsity
sity imaging of the human brain. NeuroImage control and pre-training extracts hierarchical
61(4):1000–1016. https://doi.org/10.101 features and enhances classification perfor-
6/j.neuroimage.2012.03.072 mance: evidence from whole-brain resting-
60. Maggipinto T, Bellotti R, Amoroso N, state functional connectivity patterns of
Diacono D, Donvito G, Lella E, Monaco A, schizophrenia. NeuroImage 124:127–146.
Scelsi MA, Tangaro S (2017) DTI measure- https://doi.org/10.1016/j.neuroimage.201
ments for Alzheimer’s classification. Phys 5.05.018
M e d B io l 6 2 ( 6 ) : 2 3 61 . htt ps: //doi. 69. Rashid B, Arbabshirani MR, Damaraju E,
org/10.1088/1361-6560/aa5dbe Cetin MS, Miller R, Pearlson GD, Calhoun
61. Wen J, Samper-González J, Bottani S, VD (2016) Classification of schizophrenia
Routier A, Burgos N, Jacquemont T, and bipolar patients using static and dynamic
Fontanella S, Durrleman S, Epelbaum S, resting-state fMRI brain connectivity. Neuro-
Bertrand A, Colliot O (2021) Reproducible I m a g e 1 3 4 : 6 4 5 – 6 5 7 . h t t p s : // d o i .
evaluation of diffusion MRI features for auto- org/10.1016/j.neuroimage.2016.04.051
matic classification of patients with Alzhei- 70. Wannamaker R, Buck B, Butcher K (2019)
mer’s disease. Neuroinformatics 19(1): Multimodal CT in Acute Stroke. Curr Neurol
57–78. https://doi.org/10.1007/s12021- Neurosci Rep 19(9):63. https://doi.
020-09469-5 org/10.1007/s11910-019-0978-z
62. Park YW, Oh J, You SC, Han K, Ahn SS, Choi 71. Muschelli J (2019) Recommendations for
YS, Chang JH, Kim SH, Lee SK (2019) processing head CT data. Front Neuroinform
Radiomics and machine learning may accu- 13:61. https://doi.org/10.3389/fninf.
rately predict the grade and histological sub- 2019.00061
type in meningiomas using conventional and 72. Chourmouzi D, Papadopoulou E, Marias K,
diffusion tensor imaging. Eur Radiol 29(8): Drevelegas A (2014) Imaging of Brain
4068–4076. https://doi.org/10.1007/ Tumors. Surg Oncol Clin 23(4):629–684.
s00330-018-5830-3 https://doi.org/10.1016/j.soc.2014.0
63. Glover GH (2011) Overview of functional 7.004
magnetic resonance imaging. Neurosurg 73. Yeo M, Tahayori B, Kok HK, Maingard J,
C l i n 2 2 ( 2 ) : 1 3 3 – 1 3 9 . h t t p s : // d o i . Kutaiba N, Russell J, Thijs V, Jhamb A, Chan-
org/10.1016/j.nec.2010.11.001 dra RV, Brooks M, Barras CD, Asadi H
64. Fox MD, Raichle ME (2007) Spontaneous (2021) Review of deep learning algorithms
fluctuations in brain activity observed with for the automatic detection of intracranial
functional magnetic resonance imaging. Nat hemorrhages on computed tomography
Rev Neurosci 8(9):700–711. https://doi. head imaging. Journal of NeuroInterven-
org/10.1038/nrn2201 tional Surgery 13(4):369–378. https://doi.
65. Esteban O, Markiewicz CJ, Blair RW, Moodie org/10.1136/neurintsurg-2020-017099
CA, Isik AI, Erramuzpe A, Kent JD, 74. Buchlak QD, Milne MR, Seah J, Johnson A,
Goncalves M, DuPre E, Snyder M, Oya H, Samarasinghe G, Hachey B, Esmaili N,
Ghosh SS, Wright J, Durnez J, Poldrack RA, Tran A, Leveque JC, Farrokhi F,
Gorgolewski KJ (2019) fMRIPrep: A robust Goldschlager T, Edelstein S, Brotchie P
preprocessing pipeline for functional MRI. (2022) Charting the potential of brain com-
Nat Methods 16(1):111–116. https://doi. puted tomography deep learning systems. J
org/10.1038/s41592-018-0235-4
282 Ninon Burgos
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 9
Abstract
In this chapter, we present the main characteristics of electroencephalography (EEG) and magnetoenceph-
alography (MEG). More specifically, this chapter is dedicated to the presentation of the data, the way they
can be acquired and analyzed. Then, we present the main features that can be extracted and their applica-
tions for brain disorders with concrete examples to illustrate them. Additional materials associated with this
chapter are available in the dedicated Github repository.
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_9,
© The Author(s) 2023
285
286 Marie-Constance Corsi
2 Basic Principles
2.1 Origin of the Neurons create electrical signals, transmitted to other cells via
Signals synapses. First, an action potential (AP) arrives at a synaptic cleft
(step 1 in Fig. 1) where it will transmit chemical information via
neurotransmitters (step 2 in Fig. 1) that generate postsynaptic
potentials (PSPs) and local currents (step 3 in Fig. 1). A PSP will
create a current sink and will propagate until the cell body to
generate a current source (step 4 in Fig. 1). As a result, the PSP
creates an electrical dipole consisting in a negative pole (i.e., the
sink) and a positive pole (i.e., the source). This dipole will generate
primary (intracellular) currents and secondary (extracellular) cur-
rents. M/EEG signals result from postsynaptic potentials. More
specifically, M/EEG signals result from the spatial and temporal
summation of the activity of a large population of synchronous
neurons. But notable differences exist between MEG and EEG.
Table 1
Main features to compare MEG and EEG
2.2 Evoked and There are two main types of electrophysiological activity of interest
Oscillatory Activity that are exploited in the M/EEG domain: the evoked and the
oscillatory activity. Evoked responses are weak variations of electro-
magnetic activity resulting from a stimulation (for instance, in
response to a task performance by the participant). Given their
amplitude, it is often necessary to average signals over chunks of
signals, referred as epochs, to reduce noise. To identify and describe
these evoked responses, there is a specific way to name them
according to their latency, their amplitude, their shape, and the
288 Marie-Constance Corsi
Fig. 2 Evoked activity. Results from a simulation where two sources, located in the visual area, generated an
activity after a stimulus. On the right, we plotted the associated time course over the scalp (synthetic signals),
resulting from the averaging of 1000 repetitions. One can observe notably a positive wave around t = 300
ms. The code to generate this figure is accessible via the dedicated Github repository
polarity. Let’s take an example (see Fig. 2), which represent evoked
responses from a study where we simulated a visual stimulation. We
first see a positive deflection occurring 300 ms after the presenta-
tion of the stimulation, which is referred to as P300. These waves
can reflect different mechanisms: the early components are mostly
exogenous and are related to the stimulus characteristics; the late
components are endogenous and are related to the performed task
and to the subject’s state.
The oscillatory activity, or induced activity, results from the
summation of the activity in a given brain region. These rhythms
are mainly defined by their frequency, their amplitude, their shape,
their location, and their duration. In Fig. 3, we provided examples
of the main rhythms found in the literature. Each frequency band is
referred to by a Greek letter. Delta ([0.5–3 Hz]) and Theta ([3–7
Hz]) rhythms are, respectively, detected in deep and slight sleeps.
Alpha ([8–12 Hz] in posterior areas) and Mu ([7–13 Hz] in central
areas) rhythms are both observed in quiet watch and resting state
(with the eyes closed for Alpha). Beta ([13–30 Hz]) rhythm is
detected during the active watch and during cognitive tasks such
as motor imagery, for instance. Gamma rhythm (divided into two
sub-rhythms: slow in 30–70 Hz and fast beyond 70 Hz) is observed
during specific cognitive processing.
3 M/EEG Experiments
Fig. 3 We plotted the time course associated with the main rhythms that one can observe from M/EEG
recordings. These plots were obtained from synthetic signals. The code to generate this figure is accessible
via the dedicated Github repository
3.1 Instrumentation EEG signals are recorded through the use of electrodes placed over
the scalp. The EEG relies on the difference of potentials. The first
3.1.1 EEG
EEG recordings have been performed by Hans Berger in 1924. He
described the oscillatory activity at 8 Hz occurring in the posterior
area of the scalp when the subject is awake with his eyes closed.
There are different types of electrodes: wet/dry electrodes and
active/passive electrodes. Wet electrodes are generally made of
tin, silver, or silver chloride material (Ag/AgCl). They need an
electrolytic gel to enable the conduction between the skin and the
electrode. Dry electrodes are made of stainless steel that behaves as
a conductor between the skin and the electrode. The active electro-
des contain an electronic module that performs a pre-amplification
of the signal to ensure the stability of the system toward changes in
impedance and noise. The passive electrodes do not use a
pre-amplification module.
Fig. 4 M/EEG instrumentation. (a) EEG experimental setup. (b) Example of EEG montage. For an illustrative
purpose, each color corresponds to a brain area. Each circle represents either a sensor or a landmark. Sensors
appear in color while landmarks appear in gray. Sensors are designated with a letter and a number. The letter
is indicative of the brain region. Odd numbers correspond to the left hemisphere and even ones to the right
hemisphere. (c) MEG experimental setup
EEG and MEG 291
3.1.2 MEG Sensors and Main Devices The difficulty here is to detect signals
that are 109 weaker than the Earth magnetic field. The current
devices rely on superconducting quantum interference devices
(SQUIDs) that can detect small MEG signals [3]. One of the first
proof of concept was made by D. Cohen in the 1970s [4]. The
SQUIDs present a sensitivity, defined here as the smallest variation
p
of magnetic field that can be detected by the sensor, of 1 f T = H z .
To obtain such performance, a magnetic shielding room is required
to remove the environmental noise, and a part of the device needs
to be cooled via a cryogenic system (see Fig. 4c). Two types of
292 Marie-Constance Corsi
3.2 Data Acquisition Depending on the tasks and on the hardware used, the duration of
an M/EEG experiment may vary. This section aims to present the
main steps that constitute the data acquisition.
The first step consists in preparing all the materials to perform
the experiment. For EEG, it will consist in cleaning the locations
where electrodes will be in contact with the skin (e.g., forehead and
mastoids). The electrodes and the EEG cap are then placed. Several
key distances can be measured to verify that the cap is well-placed or
to record fiducial points to be matched to other modalities after-
ward (e.g., MRI). Then, the experimenter needs to ensure that the
communication between the electrodes and the scalp is established.
For that purpose, an assessment of the impedance is made for each
electrode. The lower it is, the better it is. In the case of wet
electrodes, the experimenter has to inject gel at each sensor loca-
tion. Once the impedances are lower than a certain threshold,
typically a few kOhms, then the experiment can start. Regarding
MEG, the experimenter places head-tracking coils to measure the
head position before each recording. It helps preventing from large
EEG and MEG 293
Fig. 5 Examples of artifacts in M/EEG. (a) Cardiac artifacts recorded with magnetometers. (b) Ocular artifacts
recorded with EEG. (c) Power line noise recorded with gradiometers. Given its characteristics, plotting the
power spectra enables to elicit it easily. The code to generate this figure is accessible via the dedicated Github
repository
294 Marie-Constance Corsi
4 Data Analysis
4.1 Types of The notion of artifacts depends strongly on the signal of interest.
Artifacts/Noise Here, we consider as artifacts the signals that make the recording
more difficult and may hamper the analysis of the brain activity
recorded with EEG and/or MEG. Such artifacts can be divided
into two categories: the neurophysiological artifacts and the envi-
ronmental noise. This section aims at presenting their main
features.
Raw signals
MEG only
Head movement compensation & M/EEG
environmental noise removal +QC
Data inspection
Artifacts removal + QC
Individual MRI
or template
Source reconstruction + QC
Advanced analysis
Connectivity Networks
Fig. 6 Data analysis in M/EEG: general workflow. QC stands for quality check. Source reconstruction is not
compulsory but advisable in specific cases. The code to generate this figure is accessible via the dedicated
Github repository
296 Marie-Constance Corsi
4.1.2 Environmental This category refers to the artifacts generated by the environment
Noise that surrounds the experimental setup. They can be magnetic (e.g.,
magnetized devices that can interfere with the MEG sensors),
linked with mechanical vibrations (e.g., presence of a tramway
nearby), or simply associated with power line (occurring at 50 Hz
or 60 Hz; see Fig. 5c). We do not aim at being exhaustive. We
simply want the reader to be aware of the possible sources of
environmental noise when analyzing M/EEG signals even though
the Faraday cage and the shielded room, used respectively, in EEG
and MEG, can partly prevent them.
4.1.3 System Noise This category refers to artifacts generated by the sensors them-
selves. For example, in MEG, one can observe SQUID jumps or
saturation. In both MEG and EEG, one can have broken sensors.
4.2 Preprocessing This section aims at presenting the main steps that constitute the
preprocessing pipeline, dedicated to artifact removal. This is prob-
ably the most crucial part when analyzing M/EEG data. Indeed,
the point here is to remove noise without eliminating information
of interest or distorting the signal. Attention must be paid to build
the pipeline the most suited to the dataset and to the scientific
question to be addressed. As such, the first thing to do when
working with a new dataset is to extensively study it, in particular,
inspecting the M/EEG signals but also the associated broadband
power spectra. This preliminary step enables to identify most of the
artifacts and, more importantly, if they have a specific temporal
and/or frequency signature (e.g., presence of periodic artifacts).
From this point, it is possible to choose a specific strategy to
remove the observed noise. In the case of cardiac and ocular arti-
facts, given their clear pattern, an efficient way to isolate and reduce
them consists in applying independent component analysis (ICA)
[13]. One can visually identify the components to be removed from
both the temporal and the topographies (to avoid removing too
many components) and manually select them. Another possibility,
more reproducible, consists of using biosignals (e.g., electrocardio-
gram and electrooculogram) and to compute correlations between
time series. This technique enables to ensure the robustness of the
decision of removing a component.
In the case of artifacts at a specific frequency (e.g., power line
noise at 50 Hz or 60 Hz), one can consider applying notch filters.
With the same philosophy, in the case of muscular activity, applying
a low-pass filter with a cutoff frequency at 40 Hz can be of interest.
EEG and MEG 297
4.3 Source It is possible to directly analyze the signals recorded by the sensors.
Reconstruction In such a case, one will say that the analysis is performed in the space
of the sensors. However, it is also possible to go one step further
and estimate the activity within the brain. This processing step is
called source reconstruction and consists in estimating the neural
correlates M/EEG signal location. It can be performed when one
wants to have access to a higher spatial resolution to provide a more
accurate description, and interpretation, of the neurophysiological
phenomena occurring. For that purpose, both direct and inverse
problems need to be solved [15, 16].
298 Marie-Constance Corsi
4.3.1 Direct Problem Here, we aim at modeling the electromagnetic field produced by a
cerebral source with known characteristics. For that purpose, it is
necessary to consider both a physical model of the sources and a
model that predicts the way that these sources will generate elec-
tromagnetic fields at the scalp level. The simplest model is the
spherical model, which considers the head as an ensemble of
spheres. Each sphere corresponds to a given tissue (brain, cerebro-
spinal fluid, skull, or skin) characterized by a given conductivity.
Even though it is possible to adjust the spheres to the geometry of
the head or restrict them to a limited number of regions of interest,
this model is an oversimplification of the head geometry. More
realistic models rely on geometrical reconstruction of the different
layers that form the head tissues, directly extracted from the anato-
mical magnetic resonance imaging (MRI) data of, ideally, the par-
ticipant (the MRI thus needs to be acquired separately) or a
dedicated template (e.g., MNI Colin 27). They consist in building
meshes of the interfaces between different tissues. We can cite three
approaches: the boundary element method (BEM) [17] that is the
most widely used, the finite difference method (FDM), and the
finite element method (FEM). Another model, called overlapping
spheres [18], consists of fitting a given sphere under each sensor.
Even though there are no guidelines regarding the choice of
the method, we could provide some elements of recommendations:
given the high sensitivity of the EEG toward variations in terms of
conductivity, the BEM model can be a tool of choice. As for the
MEG, being less sensitive to changes in conductivity, the overlap-
ping spheres can be considered.
4.3.2 Inverse Problem One of the main challenges of the inverse problem lies in the
nonuniqueness of its solution. In other words, a large number of
brain activity patterns could generate the same signature detected at
the sensor level. Therefore, some constraints or assumptions are
essential to lead to a unique solution that reflects the best the
acquired data [15, 16]. In this section, we aim at providing a
short overview of the methods that are the most used in routine.
The dipole modeling methods rely on a source modeling via a
reduced number of equivalent dipoles where each of them repre-
sents a source activity. As a result, such methods are based on an a
priori hypothesis on the required number of sources.
Scanning methods, such as the MUSIC approach [19], consist
in estimating the probability of presence of a current dipole inside
each voxel. Among them are the beamformer methods [20], which
consist in applying a spatial filtering to estimate the source activity
at each location. We can cite the linearly constrained minimum
variance (LCMV) and the synthetic aperture magnetometry
(SAM) [21] as examples of beamformer methods [22].
EEG and MEG 299
1
Originating from the mixture of signals engendered by different sources recorded at a given sensor.
EEG and MEG 301
6.1 Clinical The spatial and temporal resolutions of M/EEG enable the obser-
Applications of M/EEG vation of a large number of processes. Notably, they can detect both
evoked responses and oscillatory activity. As such, using these
information could pave the way to biomarkers of brain disorders.
To illustrate this point, we will focus our presentation on two
specific clinical applications: epilepsy and Alzheimer disease. Never-
theless, M/EEG can be useful for a wider range of applications
both in neurological and psychiatric disorders [42, 43].
6.1.2 Alzheimer Disease Alzheimer disease is the most common dementia with 60–80% of
the cases. The first symptoms are a deficit in short-term memory
and concentration, followed later by a decline of linguistic skills,
visuospatial orientation, and abstract reasoning judgment. As the
pathophysiological process of the disease starts many years before
the occurrence of symptoms [52], it is crucial to elicit biomarkers to
provide a diagnosis as soon as possible. Efforts have been put
together to describe mild cognitive impairment (MCI) and Alzhei-
mer disease (AD) with M/EEG. These studies are essentially
focused on oscillatory activity and on interactions between brain
areas [53, 54]. In particular, patients present a reduced synchrony
[55], and a decrease of the alpha power (i.e., between 8 and 12 Hz)
correlates with lower cognitive status and hippocampal atrophy.
Studies performed with MEG in preclinical and prodromal stages
of AD showed that the effects of amyloid-beta deposition were
associated with an increment of the prefrontal alpha power and
that altered connectivity in the default mode network was present
in normal individuals at risk for AD [56, 57].
A recent EEG work showed that effects of neurodegeneration
were focused in frontocentral regions with an increase in high-
frequency bands (beta and gamma) and a decrease in lower-
frequency bands (delta) [58]. In particular, EEG patterns differ
depending on the degree of amyloid burden, suggesting a compen-
satory mechanism: following a U-shape curve in delta power and an
inverted U-shape curve for other tested metrics.
6.2.3 Current Challenges Despite being beneficial for patients, controlling a BCI system is a
and Perspectives learned skill that 15–30% of the users cannot develop even after
several training sessions. This phenomenon, called “BCI ineffi-
ciency” [73], has been presented as one of the main limitations to
a wider use of BCI. From the machine learning perspective, the
main challenges to overcome in current BCI paradigms relying on
EEG recordings are the low signal-to-noise ratio of signals, the
non-stationarity over time mainly resulting from the difference
between calibration and feedback sessions, the reduced amount of
available data to train the classifier explained by the number of
classes to be discriminated and/or the need to avoid the subject’s
tiredness, and the lack of robustness and reliability of the BCI
systems, in particular when decoding the users’ mental command.
To tackle these challenges, efforts have been put to improve the
classification algorithms. They can be divided into three main
groups: the adaptive classifiers, the transfer learning techniques,
and the matrix- or tensor-based algorithms. The adaptive classifiers
aim at dealing with EEG non-stationarity by taking into account
changes in signal properties, and feature distribution, over time.
Their parameters are updated when new EEG signals are available
[74]. Even though most of the adaptive classifiers can rely on a
supervised approach, the unsupervised one has proven to outper-
form the classifiers that cannot catch temporal dynamics
[75]. Besides, it can be a valuable tool to reduce the training
duration and potentially to remove the calibration part. Neverthe-
less, the adaptive classifiers present one main pitfall: their lack of
online validation with a user in most of the current literature. This
leads to two potential issues: the difficulty to find a trade-off
between fully retraining the classifier and updating some key para-
meters and the adaptation that may not follow the actual user’s
intent by being too fast or too slow [76].
Transfer learning consists here in exploiting changes in EEG
signal properties over time and subjects to extract knowledge.
More specifically, it relies on learned classifiers that are trained on
one task (called domain here) and are adapted to another task with
little or no new training data [77]. For example, it can be applied to
a dataset formed by two motor imagery tasks performed by two
different subjects. There is plethora of methods to solve the transfer
learning problem [78]. The most common in the EEG-based BCI
domain consists in learning the transformation to correct the mis-
match between the domains, occurring when one domain corre-
sponds to a hand motor imagery and the other to a foot motor
imagery, for instance, finding a common feature representation for
the domains, or learning a transformation of the data to make their
306 Marie-Constance Corsi
Box: 2 To go further
(continued)
2
See, for example, the 6 competitions won by A. Barachant: http://alexandre.barachant.org/challenges/.
EEG and MEG 307
Box 2 (continued)
• Puce, A. and H€am€al€ainen, M. S. (2017). A Review of Issues
Related to Data Acquisition and Analysis in EEG/MEG
Studies. Brain Sci, 7(6).
7 Conclusion
EEG and MEG are key modalities for the study of brain disorders.
In particular, EEG is relatively cheap and widely available and is
thus a widely used tool in neurology. When dealing with EEG and
MEG data, it is important to understand the origin of the signals as
well as the different steps in their preprocessing and feature extrac-
tion. Machine learning is increasingly used on EEG and MEG data,
in particular for BCI but also for computer-aided diagnosis and
prognosis of brain disorders.
Acknowledgements
References
1. Seeck M, Koessler L, Bast T, Leijten F, 65(2):413–497. https://doi.org/10.1103/
Michel C, Baumgartner C, He B, Beniczky S RevModPhys.65.413. https://link.aps.org/
(2017) The standardized EEG electrode array doi/10.1103/RevModPhys.65.413
of the IFCN. Clinical Neurophysiology: Offi- 4. Cohen D (1972) Magnetoencephalography:
cial Journal of the International Federation of detection of the brain’s electrical activity with
Clinical Neurophysiology 128(10): a superconducting magnetometer. Science
2070–2077. https://doi.org/10.1016/j. 175(4022):664–666
clinph.2017.06.254 5. Vrba J, Robinson SE (2001) Signal Processing
2. Casson AJ (2019) Wearable EEG and beyond. in Magnetoencephalography. Methods 25(2):
Biomed Eng Lett 9(1):53–71. https://doi. 249–271. https://doi.org/10.1006/meth.
org/10.1007/s13534-018-00093-6. https:// 2001.1238. https://www.sciencedirect.com/
w w w. n c b i . n l m . n i h . g o v / p m c / a r t i c l e s / science/article/pii/S1046202301912381
PMC6431319/ 6. Corsi MC (2015) Magnétomètres à pompage
3. H€am€al€ainen M, Hari R, Ilmoniemi RJ, optique à Hélium 4: développement et preuve
Knuutila J, Lounasmaa OV (1993) de concept en magnétocardiographie et en
Magnetoencephalography-theory, instrumen- magnétoencéphalographie. PhD thesis, Greno-
tation, and applications to noninvasive studies ble Alpes. http://www.t heses.fr/201
of the working human brain. Rev Mod Phys 5GREAT082
308 Marie-Constance Corsi
23. Fuchs M, Wagner M, Köhler T, Wischmann 32. Yger F, Berar M, Lotte F (2017) Riemannian
HA (1999) Linear and nonlinear current den- approaches in brain-computer interfaces: a
sity reconstructions. Journal of Clinical Neuro- review. IEEE Trans Neural Syst Rehabil Eng
physiology: Official Publication of the 25(10):1753–1762. https://doi.org/10.
American Electroencephalographic Society 1109/TNSRE.2016.2627016
16(3):267–295 33. Gonzalez-Astudillo J, Cattai T, Bassignana G,
24. Lin FH, Witzel T, Ahlfors SP, Stufflebeam SM, Corsi MC, De Vico Fallani F (2020) Network-
Belliveau JW, H€am€al€ainen MS (2006) Asses- based brain computer interfaces: principles and
sing and improving the spatial accuracy in applications. J Neural Eng 18(1):011001.
MEG source localization by depth-weighted https://doi.org/10.1088/1741-2552/
minimum-norm estimates. NeuroImage abc760. http://iopscience.iop.org/ar ti
31(1):160–171. https://doi.org/10.1016/j. cle/10.1088/1741-2552/abc760
neuroimage.2005.11.054 34. Bastos AM, Schoffelen JM (2016) A tutorial
25. Pascual-Marqui RD (2002) Standardized review of functional connectivity analysis meth-
low-resolution brain electromagnetic tomogra- ods and their interpretational pitfalls. Front
phy (sLORETA): technical details. Methods Syst Neurosci 9:175. https://doi.org/10.
Find Exp Clin Pharmacol 24(Suppl D):5–12 3389/fnsys.2015.00175. https://www.
26. Brodu N, Lotte F, Lécuyer A (2011) Compar- frontiersin.org/articles/10.3389/fnsys.201
ative study of band-power extraction techni- 5.00175/full
ques for Motor Imagery classification. In: 35. Fornito A, Zalesky A, Bullmore E (2016) Fun-
2011 IEEE symposium on computational damentals of brain network analysis, reprint
intelligence, cognitive algorithms, mind, and edizione edn. Academic Press, Amsterdam
brain (CCMB), pp 1–6. https://doi.org/10. 36. Gramfort A, Luessi M, Larson E, Engemann
1109/CCMB.2011.5952105 DA, Strohmeier D, Brodbeck C, Parkkonen L,
27. Lotte F, Bougrain L, Cichocki A, Clerc M, H€am€al€ainen MS (2014) MNE software for
Congedo M, Rakotomamonjy A, Yger F processing MEG and EEG data. NeuroImage
(2018) A review of classification algorithms 86:446–460. https://doi.org/10.1016/j.
for EEG-based brain-computer interfaces: a neuroimage.2013.10.027
10-year update. J Neural Eng 15(3):031005. 37. Jayaram V, Barachant A (2018) MOABB:
https://doi.org/10.1088/1741-2552/aab2 trustworthy algorithm benchmarking for
f2 BCIs. J Neural Eng 15(6):066011. https://
28. McFarland DJ, McCane LM, David SV, Wol- doi.org/10.1088/1741-2552/aadea0. Pub-
paw JR (1997) Spatial filter selection for lisher: IOP Publishing
EEG-based communication. Electroencepha- 38. Delorme A, Makeig S (2004) EEGLAB: an
logr Clin Neurophysiol 103(3):386–394. open source toolbox for analysis of single-trial
https://doi.org/10.1016/S0013-4694(97) EEG dynamics including independent compo-
00022-2. http://www.sciencedirect.com/sci nent analysis. J Neurosci Methods 134(1):
ence/article/pii/S0013469497000222 9 – 2 1 . h t t p s : // d o i . o r g / 1 0 . 1 0 1 6 / j .
29. Ramoser H, Muller-Gerking J, Pfurtscheller G jneumeth.2003.10.009
(2000) Optimal spatial filtering of single trial 39. Oostenveld R, Fries P, Maris E, Schoffelen JM
EEG during imagined hand movement. IEEE (2010) FieldTrip: open source software for
Trans Rehabil Eng 8(4):441–446. https://doi. advanced analysis of MEG, EEG, and invasive
org/10.1109/86.895946 electrophysiological data. Comput Intell Neu-
30. Rivet* B, Souloumiac A, Attina V, Gibert G rosci 2011:e156869. https://doi.org/10.11
(2009) xDAWN Algorithm to enhance evoked 55/2011/156869
potentials: application to brain–computer 40. Tadel F, Baillet S, Mosher J, Pantazis D, Leahy
interface. IEEE Trans Biomed Eng 56(8): R (2011) Brainstorm: a user-firendly applica-
2035–2043. https://doi.org/10.1109/ tion for MEG/EEG analysis. Comput Intell
TBME.2009.2012869. Conference name: Neurosci 2011:1–13. https://doi.org/10.11
IEEE Trans Biomed Eng 55/2011/879716
31. Ang KK, Chin ZY, Zhang H, Guan C (2008) 41. Penny WD, Friston KJ, Ashburner JT, Kiebel
Filter Bank Common Spatial Pattern (FBCSP) SJ, Nichols TE (eds) (2006) Statistical para-
in Brain-Computer Interface. In: 2008 IEEE metric mapping: the analysis of functional
international joint conference on neural net- brain images, illustrated edn. Academic Press,
works (IEEE world congress on computational Amsterdam
intelligence), pp 2390–2397. https://doi.org/ 42. Hari R, Baillet S, Barnes G, Burgess R, Forss N,
10.1109/IJCNN.2008.4634130. iSSN: Gross J, H€am€al€ainen M, Jensen O, Kakigi R,
2161-4407
310 Marie-Constance Corsi
58. Gaubert S, Raimondo F, Houot M, Corsi MC, 66. Pfurtscheller G, Lopes da Silva FH (1999)
Naccache L, Diego Sitt J, Hermann B, Event-related EEG/MEG synchronization
Oudiette D, Gagliardi G, Habert MO, and desynchronization: basic principles. Clin
Dubois B, De Vico Fallani F, Bakardjian H, Neurophysiol 110(11):1842–1857. https://
Epelbaum S (2019) EEG evidence of compen- doi.org/10.1016/S1388-2457(99)00141-8.
satory mechanisms in preclinical Alzheimer’s http://www.sciencedirect.com/science/arti
disease. Brain 142(7):2096–2112. https:// cle/pii/S1388245799001418
doi.org/10.1093/brain/awz150. https://aca 67. Cervera MA, Soekadar SR, Ushiba J, Millán
demic.oup.com/brain/article/142/7/2096/ JdR, Liu M, Birbaumer N, Garipelli G (2018)
5519996 Brain-computer interfaces for post-stroke
59. Kashihara K (2014) A brain-computer inter- motor rehabilitation: a meta-analysis. Ann
face for potential non-verbal facial communica- Clin Transl Neurol 5(5):651–663. https://
tion based on EEG signals related to specific doi.org/10.1002/acn3.544. https://www.
emotions. Front Neurosci 8:244. https://doi. ncbi.nlm.nih.gov/pmc/articles/PMC594
org/10.3389/fnins.2014.00244 5970/
60. King CE, Wang PT, Chui LA, Do AH, Nenadic 68. Sharma N, Pomeroy VM, Baron JC (2006)
Z (2013) Operation of a brain-computer inter- Motor imagery: a backdoor to the motor sys-
face walking simulator for individuals with spi- tem after stroke? Stroke 37(7):1941–1952.
nal cord injury. J Neuroeng Rehabil 10:77. 10.1161/01.STR.0000226902.43357.fc
https://doi.org/10.1186/1743-0003-10-77 69. Pichiorri F, Morone G, Petti M, Toppi J,
61. Feigin VL, Forouzanfar MH, Krishnamurthi R, Pisotta I, Molinari M, Paolucci S,
Mensah GA, Connor M, Bennett DA, Moran Inghilleri M, Astolfi L, Cincotti F, Mattia D
AE, Sacco RL, Anderson L, Truelsen T, (2015) Brain-computer interface boosts
O’Donnell M, Venketasubramanian N, motor imagery practice during stroke recovery.
Barker-Collo S, Lawes CMM, Wang W, Ann Neurol 77(5):851–865. https://doi.org/
Shinohara Y, Witt E, Ezzati M, Naghavi M, 10.1002/ana.24390. https://www.readcube.
Murray C (2014) Global and regional burden com/articles/10.1002/ana.24390
of stroke during 1990–2010: findings from the 70. Sharma N, Baron JC, Rowe JB (2009) Motor
Global Burden of Disease Study 2010. Lancet imagery after stroke: relating outcome to
383(9913):245–254. https://www.ncbi.nlm. motor network connectivity. Ann Neurol
nih.gov/pmc/articles/PMC4181600/ 66(5):604–616. https://doi.org/10.1002/
62. Houwink A, Roorda LD, Smits W, Molenaar ana.21810. https://onlinelibrary.wiley.com/
IW, Geurts AC (2011) Measuring upper limb doi/abs/10.1002/ana.21810
capacity in patients after stroke: reliability and 71. Dubovik S, Pignat JM, Ptak R, Aboulafia T,
validity of the stroke upper limb capacity scale. Allet L, Gillabert N, Magnin C, Albert F,
Arch Phys Med Rehabil 92(9):1418–1422. Momjian-Mayor I, Nahum L, Lascano AM,
https://doi.org/10.1016/j.apmr.2011.03.02 Michel CM, Schnider A, Guggisberg AG
8. https://linkinghub.elsevier.com/retrieve/ (2012) The behavioral significance of coherent
pii/S0003999311002218 resting-state oscillations after stroke. Neuro-
63. Kavanagh S, Knapp M, Patel A (1999) Costs I m a g e 6 1 ( 1 ) : 2 4 9 – 2 5 7 . h t t p s : // d o i .
and disability among stroke patients. J Public org/10.1016/j.neuroimage.2012.03.024
Health 21(4):385–394. https://doi.org/10. 72. Mottaz A, Corbet T, Doganci N, Magnin C,
1093/pubmed/21.4.385. https://academic. Nicolo P, Schnider A, Guggisberg AG (2018)
oup.com/jpubhealth/ar ticle-lookup/ Modulating functional connectivity after stroke
doi/10.1093/pubmed/21.4.385 with neurofeedback: effect on motor deficits in
64. de Santé HA (2012) Accident vasculaire céré- a controlled cross-over study. NeuroImage:
bral: méthodes de rééducation de la fonction Clinical 20:336–346. https://doi.org/10.101
motrice chez l’adulte. Tech. rep. https://www. 6/j.nicl.2018.07.029. https://www.ncbi.nlm.
has-sante.fr/jcms/c_1334330/fr/accident- nih.gov/pmc/articles/PMC6091229/
vasculaire-cerebral-methodes-de-reeducation- 73. Allison BZ, Neuper C (2010) Could Anyone
de-la-fonction-motrice-chez-l-adulte Use a BCI? In: Tan DS, Nijholt A (eds) Brain-
65. Prasad G, Herman P, Coyle D, McDonough S, computer interfaces, human-computer interac-
Crosbie J (2010) Applying a brain-computer tion series. Springer London, pp 35–54.
interface to support motor imagery practice in http://link.springer.com/chapter/10.1007/
people with stroke for upper limb recovery: a 978-1-84996-272-8_3
feasibility study. J Neuroeng Rehabil 7:60. 74. Schlögl A, Vidaurre C, Müller KR (2010)
https://doi.org/10.1186/1743-0003-7-60 Adaptive methods in BCI research—an
312 Marie-Constance Corsi
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 10
Abstract
Nowadays, generating omics data is a common activity for laboratories in biology. Experimental protocols
to prepare biological samples are well described, and technical platforms to generate omics data from these
samples are available in most research institutes. Furthermore, manufacturers constantly propose technical
improvements, simultaneously decreasing the cost of experiments and increasing the amount of omics data
obtained in a single experiment. In this context, biologists are facing the challenge of dealing with large
omics datasets, also called “big data” or “data deluge.” Working with omics data raises issues usually
handled by computer scientists, and thus cooperation between biologists and computer scientists has
become essential to efficiently study cellular mechanisms in their entirety, as omics data promise. In this
chapter, we define omics data, explain how they are produced, and, finally, present some of their applications
in fundamental and medical research.
Key words Genomics, Transcriptomics, Proteomics, Metabolomics, Big data, Computer science,
Bioinformatics
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_10,
© The Author(s) 2023
313
314 Thibault Poinsignon et al.
Fig. 1 The four main -omes and an analogy of their functions. The genome designates all cell’s DNA
molecules. The transcriptome, the proteome, and the metabolome refer, respectively, to the cell’s whole
set of RNA, proteins, or metabolites at a given time
Omics Data 315
2.1 Results from To describe the files used to store omics information, it is necessary
High-Throughput to consider genomics and transcriptomics on one side and proteo-
Studies Written in mics and metabolomics on the other side. Indeed, these files are
Multiple Binary and generated by different experimental techniques, which are, respec-
Text Files tively, sequencing (for genomics and transcriptomics) and mass
spectrometry (for proteomics and metabolomics) (see Fig. 2). For
each group, two types of files must be distinguished: the ones that
are directly obtained after the applications of experimental proto-
cols, i.e., the raw omics data files, and the ones that are generated by
downstream informatic analyses, i.e., the processed omics data files
(see Fig. 2). Experimental protocols and the informatic treatments
applied to raw data files will be detailed in the next section.
Genomics and transcriptomics raw data files are essentially
nucleotide sequence files. In that respect, the FASTA and the
FASTQ text formats are commonly used. FASTA was created by
Lipman and Pearson in 1985 as an input for their software [2] and
became a de facto standard, without any clear statement acknowl-
edging it [3]. This probably explains the absence of a common file
extension (e.g., .fasta, .fna, .faa) even if FASTA is a unified file type.
FASTA files contain one or several sequences. A sequence begins
with a description line starting with the character “>”. NCBI
databases (see next sections) have unified rules to write this line.1
Fig. 2 Omics data are assessments of -ome populations. Raw omics data are generated through sequencing
(for DNA and cDNA) or mass spectrometry (for proteins and metabolites)
1
https://www.ncbi.nlm.nih.gov/genbank/fastaformat/
318 Thibault Poinsignon et al.
2
Sequence Alignment/Map Format Specification
3
HUPO Proteomics Standards Initiative
4
mzML 1.1.0 Specification | HUPO Proteomics Standards Initiative|mzML 1.1.0 Specification | HUPO
Proteomics Standards Initiative
5
mzIdentML | HUPO Proteomics Standards Initiative|mzIdentML | HUPO Proteomics Standards Initiative
6
mzTab Specifications | HUPO Proteomics Standards Initiative|mzTab Specifications | HUPO Proteomics
Standards Initiative
Omics Data 319
Note that many other file formats exist. One of the most critical
for omics data analyses concerns the annotations of features on a
DNA, RNA, or protein sequence. They are shared through the
General Feature Format (GFF7) that is a text file with nine tabu-
lated separated fields: sequence, source of the annotation, feature,
start of the feature on the sequence, end of the feature, score,
strand, phase, and attributes.
2.2 Results from The set of public biological databases hosting omics data is large
High-Throughput and constantly evolving. Omics terminology started being regularly
Studies Shared used in the 2000s. Between 1991 and 2016 (25 years), more than
Through Multiple 1500 “molecular biology” databases were presented in publica-
Public Databases tions, with a proliferation rate of more than 100 new databases
each year [12]. These numbers are only the visible part of existing
databases. How many have been created without being published?
Around 500 of those databases are roughly co-occurrent with the
apparition of the World Wide Web, the very Internet application
allowing the creation of online databases. The availability of molec-
ular biology databases decreased by only 3.8% per year from 2001
to 2016 [12]. This shows a sustained motivation from the commu-
nity to create and maintain public platforms to share data. But it
also highlights that this motivation comes more from a shared need
for easy access to data rather than a supervised effort to coordinate
approaches and unify sources. Such efforts indeed exist, for exam-
ple, the ELIXIR project started in 2013 as an effort to unify all
European centers and core bioinformatics resources into a single,
coordinated infrastructure [13]. This notably produces the ELIXIR
Core Data Resources (created in 2017), a set of selected European
databases, meeting defined requirements, and the website “bio.
tools,” i.e., a comprehensive registry of available software programs
and bioinformatics tools. The US National Center for Biotechnol-
ogy Information (NCBI8) databases are also main references.
Given the “raw” nature of omics dataset, they are stored in
archive data repositories: raw data from scientific articles, shared on
databases easily accessible for reproducibility. Except for the
Sequence Read Archive (SRA), the databases cited here are
mixed ones: they host raw archive data and knowledge extracted
from them. For genomics dataset, NCBI database Genome [14]
and EMBL-EBI (member of ELIXIR) database Ensembl [15] are
references. They organize genome sequences together with anno-
tations and include sequence comparison and visual exploration
tools. Transcriptomics data can be deposited into several databases,
like Gene Expression Omnibus (GEO) [16] initially dedicated to
microarray datasets, which is structured into samples forming
7
GFF/GTF File Format
8
NCBI
320 Thibault Poinsignon et al.
9
https://zenodo.org/
10
https://www.go-fair.org/fair-principles/
11
NCBI Insights: The wait is over. . . NIH’s Public Sequence Read Archive is now open access on the cloud
Omics Data 321
Fig. 3 Illumina and MinION sequencing technologies. Illumina is a sequencing by synthesis technology that
allows massive parallel sequencing of small DNA molecules. MinION is a nanopore-based technology that
allows the sequencing of longer DNA molecules
3.2 Mass Since the first use of a mass spectrometer for protein sequencing in
Spectrometry 1966 by Biemann,12 the improvement of mass spectrometer is
Technologies closely linked to proteomics and metabolomics development
[40]. Metabolites and proteins cannot be read as templates like
DNA or RNA, and so they neither can be amplified nor sequenced
by synthesis. To access their sequence, the main tool is the mass
spectrometer. In the classical bottom-up approach, proteins are
digested into small peptides, which pass through a chromatography
column. They are then sequentially sprayed as ions into the spec-
trometer. Migration through the spectrometer allows separation of
the peptides according to their mass-to-charge ratio. For each
fraction exiting the column, an abundance is calculated. In a data-
dependent acquisition (DDA), a few peptides with an intensity
superior to a given threshold are isolated one at the time. They
are fragmented, and additional spectra (mass-to-charge ratio and
intensity) are generated for each fragmented ion. In a data-
independent acquisition (DIA), a spectrum is generated for all
fractions coming out of the chromatography column. Obtained
spectra are a combination of spectra corresponding to each peptide
present in each original fraction. Comparison with a peptide spec-
trum library generated in silico is therefore required to allow the
deconvolution of those complex spectra. All this information
(abundances in fractions, mass-to-charge ratios, intensities) is
stored into .raw files, which can only be read by dedicated software
(see Subheading 2.1).
3.3 Single-Cell Most omics experiments are bulked, and they are an average mea-
Strategies sure done on a population of cells, which is more or less homoge-
neous. Single-cell omics allow a more precise measurement,
highlighting the plasticity of the cell system. Single-cell techniques
started with manual separation of a single cell under a microscope in
2009 [41] and quickly evolved toward techniques allowing the
12
HUPO—Proteomics Timeline
324 Thibault Poinsignon et al.
13
Augustus/ABOUT.md at master
326 Thibault Poinsignon et al.
14
https://www.cancer.gov/tcga
Omics Data 327
5 Conclusion
Acknowledgments
References
1. Hasin Y, Seldin M, Lusis A (2017) Multi-omics 16. Barrett T, Wilhite SE, Ledoux P et al (2012)
approaches to disease. Genome Biol 18:83 NCBI GEO: archive for functional genomics
2. Lipman DJ, Pearson WR (1985) Rapid and data sets—update. Nucleic Acids Res 41:
sensitive protein similarity searches. Science D991–D995
227:1435–1441 17. Leinonen R, Sugawara H, Shumway M et al
3. Cock PJA, Fields CJ, Goto N et al (2010) The (2011) The sequence read archive. Nucleic
Sanger FASTQ file format for sequences with Acids Res 39:D19–D21
quality scores, and the Solexa/Illumina 18. Perez-Riverol Y, Csordas A, Bai J et al (2019)
FASTQ variants. Nucleic Acids Res 38:1767– The PRIDE database and related tools and
1771 resources in 2019: improving support for
4. Li H, Handsaker B, Wysoker A et al (2009) quantification data. Nucleic Acids Res 47:
The sequence alignment/map format and D442–D450
SAMtools. Bioinformatics 25:2078–2079 19. Haug K, Cochrane K, Nainala VC et al (2019)
5. Danecek P, Bonfield JK, Liddle J et al (2021) MetaboLights: a resource evolving in response
Twelve years of SAMtools and BCFtools. Giga- to the needs of its scientific community.
Science 10:giab008 Nucleic Acids Res gkz1019
6. Hsi-Yang Fritz M, Leinonen R, Cochrane G 20. Rigden DJ, Fernández XM (2021) The 2021
et al (2011) Efficient storage of high through- Nucleic Acids Research database issue and the
put DNA sequencing data using reference- online molecular biology database collection.
based compression. Genome Res 21:734–740 Nucleic Acids Res 49:D1–D9
7. Deutsch EW (2012) File formats commonly 21. Lan Y, Sun R, Ouyang J et al (2021) AtMAD:
used in mass spectrometry proteomics. Mol Arabidopsis thaliana multi-omics association
Cell Proteomics 11:1612–1621 database. Nucleic Acids Res 49:D1445–D1451
8. Deutsch EW, Lane L, Overall CM et al (2019) 22. Aging Atlas Consortium, Liu G-H, Bao Y et al
Human proteome project mass spectrometry (2021) Aging Atlas: a multi-omics database for
data interpretation guidelines 3.0. J Proteome aging biology. Nucleic Acids Res 49:D825–
Res 18:4108–4116 D830
9. Martens L, Chambers M, Sturm M et al (2011) 23. Karger BL, Guttman A (2009) DNA sequenc-
mzML—a community standard for mass spec- ing by CE. Electrophoresis 30:S196–S202
trometry data. Mol Cell Proteomics 10(R110): 24. International Human Genome Sequencing
000133 Consortium (2004) Finishing the euchromatic
10. Ma B, Zhang K, Hendrie C et al (2003) sequence of the human genome. Nature 431:
PEAKS: powerful software for peptide de 931–945
novo sequencing by tandem mass spectrome- 25. Schena M, Shalon D, Davis RW et al (1995)
try. Rapid Commun Mass Spectrom 17:2337– Quantitative monitoring of gene expression
2342 patterns with a complementary DNA microar-
11. V€alikangas T, Suomi T, Elo LL (2017) A com- ray. Science 270:467–470
prehensive evaluation of popular proteomics 26. Gershon D (1997) Bioinformatics in a post-
software workflows for label-free proteome genomics age. Nature 389:417–418
quantification and imputation. Brief Bioinform 27. Spellman PT, Sherlock G, Zhang MQ et al
12. Imker HJ (2018) 25 years of molecular biology (1998) Comprehensive identification of cell
databases: a study of proliferation, impact, and cycle–regulated genes of the yeast Saccharomy-
maintenance. Front Res Metr Anal 3:18 ces cerevisiae by microarray hybridization. Mol
13. Harrow J, Drysdale R, Smith A et al (2021) Biol Cell 9:25
ELIXIR: providing a sustainable infrastructure 28. DeRisi J, Penland L, Brown PO, Bittner ML,
for life science data at European scale. Bioinfor- Meltzer PS, Ray M, Chen Y, Su YA, Trent JM
matics 37:2506–2511 (1996) Use of a cDNA microarray to analyse
14. Benson DA, Karsch-Mizrachi I, Lipman DJ gene expression patterns in human cancer. Nat
et al (2010) GenBank. Nucleic Acids Res 38: Genet 14(4):457–460
D46–D51 29. Metzker ML (2010) Sequencing technologies
15. Howe KL, Achuthan P, Allen J et al (2021) — the next generation. Nat Rev Genet 11:31–
Ensembl 2021. Nucleic Acids Res 49:D884– 46
D891
Omics Data 329
30. Deamer D, Akeson M, Branton D (2016) 44. Zheng GXY, Terry JM, Belgrader P et al
Three decades of nanopore sequencing. Nat (2017) Massively parallel digital transcriptional
Biotechnol 34:518–524 profiling of single cells. Nat Commun 8:14049
31. Dobin A, Davis CA, Schlesinger F et al 45. Stein CM, Weiskirchen R, Damm F et al
(2013) STAR: ultrafast universal RNA-seq (2021) Single-cell omics: overview, analysis,
aligner. Bioinformatics 29:15–21 and application in biomedical science. J Cell
32. Kim D, Pertea G, Trapnell C et al (2013) Biochem 122:1571–1578
TopHat2: accurate alignment of transcrip- 46. Jehan Z (2019) Single-cell omics: an
tomes in the presence of insertions, deletions overview. In: Single-cell omics. Elsevier, pp
and gene fusions. Genome Biol 14:R36 3–19
33. Patro R, Duggal G, Love MI et al (2017) 47. Kelly RT (2020) Single-cell proteomics: prog-
Salmon provides fast and bias-aware quantifica- ress and prospects. Mol Cell Proteomics 19:
tion of transcript expression. Nat Methods 14: 1739–1748
417–419 48. Stanke M, Morgenstern B (2005) AUGUS-
34. Liao Y, Smyth GK, Shi W (2014) feature- TUS: a web server for gene prediction in eukar-
Counts: an efficient general purpose program yotes that allows user-defined constraints.
for assigning sequence reads to genomic fea- Nucleic Acids Res 33:W465–W467
tures. Bioinformatics 30:923–930 49. Quest for Orthologs consortium, Altenhoff
35. Trapnell C, Roberts A, Goff L et al (2012) AM, Boeckmann B et al (2016) Standardized
Differential gene and transcript expression benchmarking in the quest for orthologs. Nat
analysis of RNA-seq experiments with TopHat Methods 13:425–430
and Cufflinks. Nat Protoc 7:562–578 50. Yuan Y, Zhang Y, Zhang P et al (2021) Com-
36. Teng M, Love MI, Davis CA et al (2016) A parative genomics provides insights into the
benchmark for RNA-seq quantification pipe- aquatic adaptations of mammals. Proc Natl
lines. Genome Biol 17:74 Acad Sci 118:e2106080118
37. The RGASP Consortium, Engström PG, Steij- 51. Van den Berge K, Hembach KM, Soneson C,
ger T et al (2013) Systematic evaluation of Tiberi S, Clement L, Love M, Patro R, Robin-
spliced alignment programs for RNA-seq data. son MD (2019) RNA sequencing data: Hitch-
Nat Methods 10:1185–1191 hiker’s guide to expression analysis. Annu Rev
38. Schaarschmidt S, Fischer A, Zuther E et al Biomed Data Sci 2:139–173
(2020) Evaluation of seven different 52. Chen G, Shi T, Shi L (2017) Characterizing
RNA-Seq alignment tools based on experimen- and annotating the genome using RNA-seq
tal data from the model plant Arabidopsis thali- data. Sci China Life Sci 60:116–125
ana. Int J Mol Sci 21:1720 53. Arivaradarajan P, Misra G (eds) (2018) Omics
39. Baruzzo G, Hayer KE, Kim EJ et al (2017) approaches, technologies and applications:
Simulation-based comprehensive benchmark- integrative approaches for understanding
ing of RNA-seq aligners. Nat Methods 14: OMICS data. Springer Singapore, Singapore
135–139 54. Veenstra TD (2021) Omics in systems biology:
40. Biemann K, Tsunakawa S, Sonnenbichler J et al current progress and future outlook. Proteo-
(1966) Structure of an odd nucleoside from mics 21:2000235
serine-specific transfer ribonucleic acid. 55. Tam V, Patel N, Turcotte M et al (2019) Ben-
Angew Chem Int Ed Engl 5:590–591 efits and limitations of genome-wide associa-
41. Tang F, Barbacioru C, Wang Y et al (2009) tion studies. Nat Rev Genet 20:467–484
mRNA-Seq whole-transcriptome analysis of a 56. Uitterlinden A (2016) An introduction to
single cell. Nat Methods 6:377–382 genome-wide association studies: GWAS for
42. Svensson V, Vento-Tormo R, Teichmann SA dummies. Semin Reprod Med 34:196–204
(2018) Exponential scaling of single-cell 57. Hampel H, Nisticò R, Seyfried NT et al (2021)
RNA-seq in the past decade. Nat Protoc 13: Omics sciences for systems biology in Alzhei-
599–604 mer’s disease: state-of-the-art of the evidence.
43. Macosko EZ, Basu A, Satija R et al (2015) Ageing Res Rev 69:101346
Highly parallel genome-wide expression 58. Kohler BA, Sherman RL, Howlader N et al
profiling of individual cells using nanoliter dro- (2015) Annual report to the nation on the
plets. Cell 161:1202–1214 status of cancer, 1975–2011, featuring
330 Thibault Poinsignon et al.
incidence of breast cancer subtypes by race/ 61. Overmyer KA, Shishkova E, Miller IJ et al
ethnicity, poverty, and state. JNCI J Natl Can- (2021) Large-scale multi-omic analysis of
cer Inst 107 COVID-19 severity. Cell Syst 12:23–40.e7
59. Liu J, Adhav R, Xu X (2017) Current pro- 62. Naumova OY, Lee M, Rychkov SY et al (2013)
gresses of single cell DNA sequencing in breast Gene expression in the human brain: the cur-
cancer research. Int J Biol Sci 13:949–960 rent state of the study of specificity and spatio-
60. Zhang J, Bajari R, Andric D et al (2019) The temporal dynamics. Child Dev 84:76–88
international cancer genome consortium data
portal. Nat Biotechnol 37:367–369
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 11
Abstract
Electronic health records (EHRs) are the collection of all digitalized information regarding individual’s
health. EHRs are not only the base for storing clinical information for archival purposes, but they are also
the bedrock on which clinical research and data science thrive. In this chapter, we describe the main aspects
of good quality EHR systems, and some of the standard practices in their implementation, to then conclude
with details and reflections on their governance and private management.
Key words Electronic health records, Data science, Machine learning, Data quality, Coding schemes,
SNOMED-CT, ICD, UMLS, Data governance, GDPR
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_11,
© The Author(s) 2023
331
332 Wenjuan Wang et al.
Primary Care Database (NIVEL-PCD) [6, 7]. Belgium also has its
Intego Network [7, 8]. France has the Système National des Données
de Santé [9, 10] and the data warehouse of Assistance Publique-
Hôpitaux de Paris (AP-HP) [11]. Sweden has numerous and
extensive nationwide registries [12]. These databases provide valu-
able information about the use of health services and developments
in population health. In the United States, there has not been a
tradition of using routine anonymized data, largely because the
Health Insurance Portability and Accountability Act (HIPAA) reg-
ulations place restrictions on the linkage of health data from differ-
ent sources without consent [13–15] and because small office
practices have not been widely computerized. Instead, the focus
has been mainly on secondary care (hospital) data, facilitated by the
National Institutes of Health’s (NIH) Clinical Translational Sci-
ence Awards (CTSA) [16]. Use or reuse of administrative data for
research purposes is becoming more restricted in Europe as well,
partly as a consequence of the European General Data Protection
Regulation (GDPR) that was established in 2016 [17, 18]. In addi-
tion, data owners increasingly want control over the use of their
data, making it more difficult to construct large centralized
databases.
2.1 Data Accuracy Data accuracy can be conceptualized as how accurate or truthful the
data captured through the EHR system is. In other words, it is the
degree to which the value in the EHR is a true representation of the
real-world value [20, 23, 24] (e.g., whether a medication list accu-
rately reflects the number, dose, and specific drugs a patient is
currently taking [21]). A pilot study evaluated information accuracy
in a primary care setting in Australia and confirmed that errors and
inaccuracies exist in EHR [26]. This pilot study showed that high
levels of accuracy were found in the area of demographic informa-
tion and moderately high levels of accuracy were reported for
allergies and medications. A considerable percentage of
non-recorded information was also present. The sources of data
inaccuracy could be mistakes made by clinicians (e.g., clinicians
improperly use the “cut and paste” function in electronic systems
[27]), error, loss or destruction of data during a data transfer
[27]. Ways to improve data accuracy at collection include avoiding
EHR pitfalls (e.g., fine-tuning preference lists, being careful when
copying data, modifying templates as needed, documenting what
was done, etc.) and being proactive (e.g., conducting regular inter-
nal audits, training staff, maintaining a compliance folder, etc.).
Data accuracy can be assessed via different approaches
[20]. One can compare a given variable within the dataset to
other variables which is referred to as internal validity, e.g., using
medication to confirm the status of the disease. Internal validation
can also be done by looking for unrealistic values (a blood pressure
that is too high or low [28]) which could be checked by identifying
outliers. One can also use different data sources or datasets to cross-
check the data accuracy which is referred to as external validity, e.g.,
a patient was registered in a stroke registry but recorded as not
334 Wenjuan Wang et al.
2.2 Data Data completeness is referred to as the degree and nature of the
Completeness absence of certain data fields for certain variables or participants.
Generally, these absent values are called missing data. Missing data
is very common in all kinds of studies, which can limit the outcomes
to be studied, the number of explanatory factors considered, and
even the size of the population included [28] and thus reduce the
statistical power of a study and produce biased estimates, leading to
invalid conclusions [29]. Data may be missing due to a variety of
reasons. Some data might not be collected due to the design of the
study. For example, in some questionnaires, certain questions are
only for females to answer which leads to a blank for males for that
question. Some data may be missing simply because of the break-
down of certain machines at a certain time. Data can also be missing
because the participant did not want to answer. Some data might be
missing due to mistakes during data collection or data entry. Thus,
knowing how and why the data are missing is important for
subsequent handling and for analyzing the mechanism underlying
missing data.
Depending on the underlying reason, missing data can be
categorized into three types [30] (Fig. 1): missing completely at
random (MCAR), missing at random (MAR), and missing not at
random (MNAR). MCAR is defined as data to be missing not
related to any other variables or the variable itself. Examples of
MCAR are failures in recording observations due to random fail-
ures with experimental instruments. The reasons for its absence are
normally external and not related to the observations themselves.
For MCAR, it is typically safe to remove observations with missing
values. The results will not be biased but the test might not be
powerful as the number of cases is reduced. This assumption is
unrealistic and hardly happens in practice. For missing data that
are MAR, missingness is not random and can be related to the
observed data but not to the value of this given variable [31]. For
example, a male participant may be less likely to complete a survey
about depression severity than a female participant [32]. The data is
missing because of gender rather than because of the depression
severity itself. In this case, the results will be biased if we remove
patients with missing values as most completed observations are
Electronic Health Records 335
Fig. 1 Summary on missing mechanisms with definitions, examples, and acceptable approaches for handling
the missing values
2.3 Other Challenges There are other challenges in EHR data. For example, some data
and General Practices may be recorded without specifying units of measurement which
Recommendations makes these data hard to interpret [28]. In this case, an understand-
ing of the data collection process and background knowledge can
be helpful in interpreting the data. There might be inconsistencies
in data collection and coding across institutions and over time
[28]. Some inconsistencies can be easily identified from the data,
e.g., a measure was started to be recorded only after a certain time.
On the other hand, some inconsistencies may be hard to identify
and require an understanding of how data are collected geographi-
cally and over time. Last but not least, unstructured text data
residing in the EHR causes poor accessibility and other data quality
issues such as a lack of objectivity, consistency, or completeness
[28]. Data extraction techniques such as natural language proces-
sing (NLP) are being used to identify information directly from text
notes.
Quality data is the basis for a valid research outcome and
whether the quality is enough depends on the purpose of the
study. Currently, there are no certain criteria for deciding whether
the quality of the data is sufficient, but careful analysis of the data
quality should help the researchers decide if the data at hand is
useful for the study [28]. Three general practices were
338 Wenjuan Wang et al.
3.1 Motivation Recording clinical data using free text and local terminology incurs
major barriers to conducting effective data analysis for health
research [43]. Clinical coding systems significantly alleviate this
problem, and so are of great usefulness to researchers and analysts
when carrying out such work. Medical concepts are naturally
described by linguistic terminology and are often associated with
a descriptive text. Linguistic data is however loosely structured, and
the same underlying medical concept might be expressed differ-
ently by different healthcare professionals. Clinical concepts can
usually be expressed in a multitude of ways, both due to synonyms
in individual terms and simply through different ways of combining
and arranging words into a description. Processing large amounts
of such data in order to perform modern computer-assisted data
analysis, such as training machine learning models, would therefore
require the use of natural language processing (NLP) techniques
Electronic Health Records 339
3.2 Common Clinical coding systems can vary significantly in their descriptive
Characteristics scope, depending on their intended usage. The DSM-5 [45], for
instance, limits its scope entirely to psychiatric diagnoses, while
SNOMED-CT [46, 47] seeks to be as comprehensive as possible,
including concept codes relating to, for example, body structure,
physical objects, and environment. Both of these coding schemes
describe concepts relevant at the level of individual patients, though
codes can exist for broader or more fine-grained scopes such as
public health or microbiology.
Typically, clinical coding schemes are arranged hierarchically, as
this reflects the categorical relationship between clinical concepts
well while also providing an intuitive means to find relevant con-
cepts. This hierarchical structuring can be reflected in the identifiers
used to encode clinical concepts, further aiding in their compre-
hension. In the ICD scheme [48], for example, codes begin with a
character that identifies the relevant chapter in the ICD manual,
and subsequent characters provide identification of finer and finer
degrees of specification.
Another property of clinical coding systems that can be useful
to classify is whether it is compositional or enumerative [49,
Chapter 22]. In a compositional scheme, concepts can be encoded
by combining more basic conceptual units together. This reduces
the burden to specify large enough lists of distinct concepts to
comprehensively cover all necessary clinical concepts required by
scheme designers. This is in contrast to enumerative systems, which
instead aim to achieve completeness by having a unique identifier
for every concept within the scope of the scheme.
Clinical coding schemes can encode many kinds of relationships
between concepts that are more specific than the simple parent-
child relationship in basic hierarchies. These reflect the more
nuanced kinds of relationships present in clinical concepts. Coiera
[49, Chapter 22] outlines three main kinds of conceptual relation-
ships: Part-Whole, Is-A, and Causal. Part-Whole
340 Wenjuan Wang et al.
3.3 Notable Coding Here we provide summaries of commonly used coding systems that
Systems are likely to be encountered when performing analysis on EHR
data. However, this is by no-means an exhaustive list. Many more
are in use, and some datasets or corpora might use their own coding
systems. In these cases, the data provider will usually specify map-
pings to more common systems such as ICD or SNOMED-CT. For
example, in the case of the Clinical Practice Research Datalink
(CPRD) [50], unique codes are provided for medical terms with
mappings to Read Codes (a now largely legacy coding system in the
United Kingdom), and unique treatment codes with links to the
NHS Dictionary of Medicines and Devices (dm+d) [51] and the
British National Formulary (BNF) [52], which provide codes relat-
ing specifically to medical products and prescribing.
Table 1
The top-level hierarchical categories in the SNOMED-CT system
Hierarchy
Body structure
Clinical finding
Event
Observable entity
Organism
Pharmaceutical/biologic product
Physical object
Procedure
Qualifier value
Situation with explicit context
Social context
Substance
Table 2
The chapters of ICD-10
Table 3
The hierarchical structure of multiple sclerosis within the ICD-11
3.3.3 UMLS “The Unified Medical Language System (UMLS) is something like the
Rosetta Stone of international terminologies”—Coeira [49, Chapter 23]
1
https://www.nlm.nih.gov/research/umls/index.html.
344 Wenjuan Wang et al.
tools that rely on the UMLS, such as PubMed2 and other clinical
software systems such as EHR software and analysis pipelines. The
most common uses are for extracting clinical terminologies from
text and translating between coding systems [55].
3.3.4 Read Codes Read Codes [56, 57] were used exclusively by the United Kingdom
until 2018, when they were replaced by SNOMED-CT. Read
Codes are organized hierarchically; however, the identifiers them-
selves do not indicate where in a hierarchy the concept belongs as
they do in ICD. Version 3 (CTV-3) is the most recent version of
Read Codes, and introduced compositionality to the system, while
becoming less strictly hierarchical. Read Codes were intended to
provide digital operability in primary care settings, but are no
longer used in primary care in England (though they are still in
use in Scotland at the time of writing and may be used in secondary
care in England). Read Codes map well to ICD concepts. The Read
Codes Drug and Appliance Dictionary is an extension of the Read
Codes system to include pharmacological products, foods, and
medical appliances for use in EHR software and prescribing
systems.
3.4 Challenges and The usefulness of clinical coding schemes is dependent upon their
Limitations usage by healthcare professionals being thorough and appropriate.
Improper usage of coding systems can occur, contributing to data
quality issues such as incompleteness, inconsistency, and inaccuracy
[58]. Further challenges can arise for researchers where data may
contain multiple coding systems; this can happen if the data is
collected from multiple different sources where different coding
systems are in use, or if the period of data collection covers a change
in the preferred coding system, such as the change from CVT-3 to
SNOMED-CT in the United Kingdom. In these cases, the
researcher must ensure that they consider relevant concepts from
each different scheme or implement a mapping from one scheme to
another. Most coding schemes provide good mapping support to
ICD codes, and the UMLS coding system is designed to provide a
means of translating between different schemes. Additionally, some
sources of data may provide their own coding schemes that are not
in usage (and thus not documented) elsewhere.
2
https://pubmed.ncbi.nlm.nih.gov/.
Electronic Health Records 345
4.1 Data Protection The explosive evolution of digital technologies and our ability to
in a Nutshell collect, store, and elaborate data is dramatically changing how we
should consider privacy and data protection; particularly, the
advent of artificial intelligence (AI) and advanced mathematical
modeling tools made it necessary to reform the national and inter-
national data protection and governance rules to better protect
people who generated such data and give them more control on
what can be done with it. Although it is worth mentioning valuable
independent contributions to the healthcare data protection guide-
lines like the Goldacre Review [59, 60], we will focus mainly on the
most recent and structured action published at international level in
terms of data protection and governance, the European General
Data Protection regulation, or GDPR [17].
The GDPR was published by the European Commission in
2016 to set the guidelines that all member states must apply in
their national legislation in terms of data protection. Although its
legal validity is limited to the members of the European Economic
Area (EEA), its effects expanded also to European Union
(EU) candidate countries and the United Kingdom which
embraced the new GDPR regulation through the UK GDPR
[18] and maintained it part living of the legislation even after
renouncing to the EU membership. It is worth mentioning that
the effects of GDPR are not limited to the data management and
governance executed within the countries that embrace the regula-
tion, but is strictly related to the persons to whom the data belong;
this means that the GDPR guidelines must be followed by any
entity worldwide when dealing with data belonging to individuals
from countries where the GDPR applies. GDPR defines as personal
data any single information that is relatable to a person; in Box 1 we
enumerate the three main agents required in any endeavor involv-
ing personal data management.
To contextualize these concepts in an healthcare scenario, if a
non-European controller (e.g., an Australian hospital) aims at col-
lecting, storing, or elaborating healthcare data from an individual
protected by the GDPR or equivalent legislation for an interna-
tional multicenter clinical trial, they still must respect all dictations
of GDPR on that data specifically.
The GDPR reads: personal data processing should be designed to
serve mankind and the right to the protection of such data is not an
absolute right, but must be considered in relation to its function in
society. Let’s then consider this from the two angles of data gover-
nance and operation, and its purpose in the AI era.
346 Wenjuan Wang et al.
4.1.1 Governance and One of the main dictations of GDPR is that data should be as
Operation anonymized (or, de-identified) and minimal as possible for a
given application. This means that the data controller shall
specify in details which data will be needed and why and collect
only this required data, possibly in an anonymous way. More-
over, the data should be stored as long as the application
requires it but not longer unless authorized by the data subject.
This process should minimize as much as possible the identifia-
bility of individuals, especially in those cases in which the con-
tent of data carries very sensitive information like health status,
religious faith, political affiliation, and similar. Indeed, one of
the main reasons why the use of free-text clinical notes in natural
language processing (NLP) applications carries additional com-
plications is that information that could identify individuals are
often expressed in a nonstructured way in text (e.g., a specific
reference to a person’s habits, rare diseases, physical aspect, etc.)
[61]. A similar issue arises with imaging applications, where the
content of the imaging medical examination could contain per-
sonal information of its owner (e.g., the name written on an
X-ray printing).
With a closer focus to EHR in a common tabular structure,
identification of individuals can go beyond their names and unique
identifiers. If the combination of other information can lead to their
identification (e.g., the address, the sex, physical characteristics,
profession, etc.), then the EHR is not technically anonymized. A
step forward is the pseudo-anonymization, a process where the
identifiable information fields are replaced with artificially created
alternatives that encode or encrypt these information without
direct disclosure. It is important to note that albeit this approach
is valid in healthcare applications, it still allows a post hoc recon-
struction of the identifiable data and should be implemented
Electronic Health Records 347
carefully. Note that, in the specific case of brain images, the medical
image may in principle allow reidentification of the patient (for
instance, mainly through recognition of facial features such as
the nose). For this reason, “defacing” (a procedure that modifies
the image to remove facial features while preserving the content of
the brain) is increasingly used. According to the Health Insurance
Portability and Accountability Act of 1996 (HIPAA)3 issued by
the US Department of Health and Human Services, 18 elements
have to be deleted for an electronic health record to be considered
de-identified; these include names, geographic subdivision smaller
than a State, all elements of dates (except year) for dates directly
related to an individual, telephone numbers, social security num-
bers, and license numbers. This practice can be exported interna-
tionally and used as a rule of thumb to ensure appropriate
anonymization in all healthcare-related applications.
With respect to the many stages that comprise the analysis and
elaboration of healthcare data, data protection can be handled in
different and more flexible ways. Assuming a high level of inter-
nal protection of healthcare institutions (e.g., firewalls and
encrypted servers), as long as the data remains within the insti-
tution secured information system, the majority of threats can be
blocked and mitigated at an institutional level. Examples of
threats are malicious access to and modification of data with
the objective of compromising individual’s health or disrupting
the operation of the hospital itself. The main exposure happens
in case the data need to be transferred to another institution to
carry out the required analyses. In this rather common case, the
anonymization (or pseudo-anonymization) process should be
carefully applied and data should never reside in a non-secured
storage device or communication channel. To prevent this expo-
sure to happen but at the same time to leave the possibility of
leveraging the collected data for the purpose of AI applications
and statistical studies, the federated learning methodology has
been developed in recent years. This will be described further in
Subheading 4.2.
The data subject has the right to get its own data deleted from
the controller when, for example, the accuracy of the data is con-
tested by the data subject, or when the controller no longer needs
the data for its purposes. Similarly, the data subject has also the
right to receive their personal data from the controller in a com-
monly used and machine-readable format and have then the right
to transfer such data to another controller, when technically feasi-
ble, in a direct way. These aspects introduce operational constraints
in EHR management as they require to be stored in an identifiable
way (so as to allow its post hoc management, deletion, or
3
https://www.hhs.gov/hipaa/index.html.
348 Wenjuan Wang et al.
4.1.2 The Purpose of EHR The main conundrum here is whether a specific use of healthcare
in the Era of AI data is functional to a societal benefit, which is a very difficult
problem given its highly subjective interpretation. Indeed, as we
continue producing beneficial applications, the opportunities to
develop malevolent ones increase. Hostile actors may use private
healthcare data and AI for personal profits, policy control, and
other malicious cases. The availability of new tools suddenly sheds
light on problems we didn’t know we had and this is happening
with AI and its application to healthcare. Machine learning and
deep learning are by far the most successful technologies that are
changing how we conceive data value and the importance of its
quality [62], and when it comes to these computing tools, the more
data, the better, but not only that; for each application, the data
collected and elaborated should be as representative as possible of
the learning task, which is a rather challenging issue considering the
amount of human intervention in clinical data collection (especially
in free-text annotations) and inherent biases in the data distribution
over the available population. Current regulations are imposed to
the data controllers to clearly communicate and have the explicit
agreement of the data subject for any use they may do with it, and
this is a fundamental protection of each individual’s right to choose
when and where their data can be used. This becomes particularly
stressed in healthcare scenarios where misuse and abuses of
patients’ data can result in unethical advantage and/or enrichment
of the institutions or individuals capable of making the most out of
such abundant data.
Ethical approvals for the use of clinical datasets are usually
granted by the hospitals’ ethic committee, through detailed pro-
cesses that every study has to undertake in its design phase. How-
ever, with increased focus on the use of AI technologies in
medicine, the challenge becomes to contextualise within these
ethics frameworks new technologies, the potential they carry, and
the risks they may represent. Therefore, an integrated approach is
needed between clinical experts, and AI/ML specialists to give
more transparency, cohesion, and consistency to the use of data in
health research.
Electronic Health Records 349
4.2 Privacy- The training of any kind of AI-based predictive model requires as
Preserving EHR Data much data as possible, and given the nature of clinical data (costly
Analysis and with high human intervention), it is often the case that a single
healthcare institution is not enough to produce the data needed for
the creation of a predictive model. This is particularly true in those
cases in which the distribution of the population of patients within
the hospital is not representative of the general population or at
least of the possible population of patients for which those predic-
tive models will be used.
The most straightforward practice to overcome this limitation
consists in gathering data from multiple institutions in one single
center and pre-process the data so as to integrate everything in one
single training dataset. This allows the unification of the contribu-
tion of all healthcare institutions and therefore a more comprehen-
sive, heterogeneous, and representative training dataset.
Transferring clinical data from one hospital to another is a proce-
dure that brings many privacy- and security-related problems,
including the proper anonymization, or pseudo-anonymization,
of clinical records and the encryption of the data en route to
another institution.
The technical difficulties here dominate over the potential of a
scalable, efficient, and secure data science pipeline that properly
uses EHR to extract new knowledge and train predictive models.
One of the most brilliant solutions to solve these problems was
initially proposed by Google with the federated learning method-
ology [63]. According to this approach designed primarily for deep
neural networks, instead of transferring the data between institu-
tions and collect everything in one unique dataset, a more efficient
choice is to send the models to be trained to every institution that
participates in the federation and, once one or more training steps
are executed, gather the trained models in one central computing
node (which can be one of the institutions) and compile the trained
models in one comprehensive unique solution that represents the
common knowledge produced.
Federated learning was designed for a task very different from
clinical applications, i.e., the automatic completion of smartphones’
keyboard, but its principles can be translated to the healthcare
environment very effectively. The main benefits are that clinical
data will never leave the owner’s secured information system and
anonymization and encryption of the data itself are not major
problems. Moreover, the ability to involve the contribution of
multiple centers for one training process requires a software infra-
structure that can be utilized many more times for learning tasks.
4.3 Challenges In the context of federated learning for EHR analysis, we find many
Ahead challenges to be addressed in terms of both data quality and gover-
nance and learning methodologies. Here are listed some of the
most relevant:
350 Wenjuan Wang et al.
5 Conclusion
Acknowledgements
References
1. CPRD (n.d.) Clinical practice research data- 24:545–562. https://doi.org/10.1007/S103
link. https://cprd.com/ 89-016-0753-4/TABLES/3. https://link.
2. QResearch (n.d.) QResearch. https://www. springer.com/article/10.1007/s10389-016-
qresearch.org/ 0753-4
3. ResearchOne (n.d.) Transforming data into 8. Bartholomeeusen S, Kim CY, Mertens R,
knowledge. http://www.researchone.org/ Faes C, Buntinx F (2005) The denominator
4. Alliance UHDR (2020) Hdruk innovation in general practice, a new approach from the
gateway — homepage. https://www. intego database. Fam Pract 22:442–447.
healthdatagateway.org/ h t t p s : // d o i . o r g / 1 0 . 1 0 9 3 / F A M P R A /
CMI054. https://academic.oup.com/
5. Verheij R, van der Zee J (2018) Collecting infor- fampra/article/22/4/442/662730
mation in general practice: “just by pressing a
single button”? Morbidity, Performance and 9. SNDS (n.d.) Système national des données de
Quality in Primary Care pp 265–272. https:// santé. https://www.bordeauxpharmacoepi.
doi.org/10.1201/9781315383248-36 eu/en/snds-presentation/
6. Nivel (n.d.) Nivel primary care database. 10. Bezin J, Duong M, Lassalle R, Droz C,
h t t p s : // w w w . n i v e l . n l / e n / n i v e l - Pariente A, Blin P, Moore N (2017) The
zorgregistraties-eerste-lijn/nivel-primary-care- national healthcare system claims databases in
database France, SNIIRAM and EGB: powerful tools
for pharmacoepidemiology. Pharmacoepide-
7. Schweikardt C, Verheij RA, Donker GA, Cop- miol Drug Saf 26(8):954–962
pieters Y (2016) The historical development of
the dutch sentinel general practice network 11. Daniel C, Salamanca E (2020) Hospital Data-
from a paper-based into a digital primary care bases: AP-HP data warehouse. In:
monitoring system. J Public Health (Germany) Nordlinger B, Villani C, Rus D (eds)
352 Wenjuan Wang et al.
Healthcare and artificial intelligence. Springer, research. J Am Med Inform Assoc 20(1):
Berlin, pp 57–67 144–151. https://doi.org/10.1136/amiajnl-
12. Ludvigsson JF, Almqvist C, Bonamy AKE, 2011-000681. https://academic.oup.com/
Ljung R, Michaëlsson K, Neovius M, jamia/article-pdf/20/1/144/9517051/20-
Stephansson O, Ye W (2016) Registers of the 1-144.pdf
Swedish total population and their use in med- 25. Ahmad F, Rasmussen L, Persell S,
ical research. Eur J Epidemiol 31(2):125–136 Richardson J, Liss D, Kenly P, Chung I,
13. Serda M (2013) Synteza i aktywność biolo- French D, Walunas T, Schriever A, Kho A
giczna nowych analogów tiosemikarbazono- (2019) Challenges to electronic clinical quality
wych chelatorów żelaza measurement using third-party platforms in
14. Gliklich RE, Dreyer NA, Leavy MB (2014) primary care practices: The healthy hearts in
Registries for evaluating patient outcomes. the heartland experience. JAMIA Open 2(4):
AHRQ Publication 1:669. https://www.ncbi. 4 2 3 – 4 2 8 . h t t p s : // d o i . o r g / 1 0 . 1 0 9 3 /
nlm.nih.gov/books/NBK208616/ jamiaopen/ooz038
15. Fleurence RL, Beal AC, Sheridan SE, Johnson 26. Tse J, You W (2011) How accurate is the elec-
LB, Selby JV (2017) Patient-powered research tronic health record?—a pilot study evaluating
networks aim to improve patient care and information accuracy in a primary care setting.
health research. Health Aff 33(7):1212–1219. Stud Health Technol Inform 168:158–64
https://doi.org/10.1377/HLTHAFF.2014. 27. Ozair F, Nayer J, Sharma A, Aggarwal P (2015)
0113 Ethical issues in electronic health records: A
16. CTSA (n.d.) CTSA Central. http://www. general overview. Perspect Clin Res 6:73–6.
ctsacentral.org/ https://doi.org/10.4103/2229-3485.153
997
17. GDPR (2016) EU General Data Protection
Regulation. http://data.europa.eu/eli/ 28. Bayley K, Belnap T, Savitz L, Masica A, Shah N,
reg/2016/679/oj Fleming N (2013) Challenges in using elec-
tronic health record data for CER: Experience
18. UK GDPR (2018) UK General Data Protec- of 4 learning organizations and solutions
tion Regulation Updated for Brexit — UK applied. Med Care 51:S80-S86. https://doi.
GDPR. https://uk-gdpr.org/ org/10.1097/MLR.0b013e31829b1d48
19. Fundation TM (2006) Background issues on 29. Hyun K (2013) The prevention and handling
data quality. In: The connecting for health of the missing data. Korean J Anesthesiol
common framework https://bok.ahima.org/ 64(5):402–406. https://doi.org/10.4097/
PdfView?oid=63654 kjae.2013.64.5.402. http://ekja.org/journal/
20. Feder SL (2018) Data quality in electronic view.php?number=7569
health records research: quality domains and 30. Rubin DB (1976) Inference and missing data.
assessment methods. West J Nurs Res 40(5): Biometrika 63(3):581–592. http://www.jstor.
753–766. https://doi.org/10.1177/019394 org/stable/2335739
5916689084
31. Sterne JAC, White IR, Carlin JB, Spratt M,
21. Chan KS, Fowles JB, Weiner JP (2010) Review: Royston P, Kenward MG, Wood AM, Carpen-
Electronic health records and the reliability and ter JR (2009) Multiple imputation for missing
validity of quality measures: A review of the data in epidemiological and clinical research:
literature. Med Care Res Rev 67(5):503–527. potential and pitfalls. BMJ 338. https://doi.
https://doi.org/10.1177/10775587093 org/10.1136/bmj.b2393. https://www.bmj.
59007 com/content/338/bmj.b2393. https://www.
22. Kahn M, Raebel M, Glanz J, Riedlinger K, bmj.com/content
Steiner J (2012) A pragmatic framework for 32. Smith WG (2008) Does gender influence
single-site and multisite data quality assessment online survey participation? A record-linkage
in electronic health record-based clinical analysis of university faculty online survey
research. Med Care 50(Suppl):S21–9. response behavior. Online Submission
h t t p s : // d o i . o r g / 1 0 . 1 0 9 7 / M L R .
0b013e318257dd67 33. Little R, Rubin D (2002) Statistical analysis
with missing data. In: Wiley series in probabil-
23. Wand Y, Wang RY (1996) Anchoring data ity and mathematical statistics. Probability and
quality dimensions in ontological foundations. mathematical statistics. Wiley, London. http://
Commun ACM 39(11):86–95. https://doi. books.google.com/books?id=
org/10.1145/240455.240479 aYPwAAAAMAAJ
24. Weiskopf NG, Weng C (2013) Methods and 34. Dempster AP, Laird NM, Rubin DB (1977)
dimensions of electronic health record data Maximum likelihood from incomplete data via
quality assessment: enabling reuse for clinical
Electronic Health Records 353
the em algorithm. J R Stat Soc Ser B Methodol J Am Med Inform Assoc 26(4):364–379.
39(1):1–38. http://www.jstor.org/stable/2 https://doi.org/10.1093/jamia/ocy173
984875 45. Association AP, Association AP (eds) (2013)
35. Dziura JD, Post LA, Zhao Q, Fu Z, Peduzzi P Diagnostic and statistical manual of mental dis-
(2013) Strategies for dealing with missing data orders: DSM-5, 5th edn. American Psychiatric
in clinical trials: from design to analysis. Yale J Association, Arlington, VA, oCLC:830807378
Biol Med 86(3):343–358. https://europepmc. 46. SNOMED International (2022) SNOMED
org/articles/PMC3767219 CT. https://www.nlm.nih.gov/healthit/
36. Jakobsen JC, Gluud C, Wetterslev J, Winkel P snomedct/index.html. publisher: U.-
(2017) When and how should multiple impu- S. National Library of Medicine
tation be used for handling missing data in 47. Lee D, de Keizer N, Lau F, Cornet R (2014)
randomised clinical trials—a practical guide Literature review of SNOMED CT use. J Am
with flowcharts. BMC Med Res Methodol Med Inform Assoc 21(e1):e11–e19. https://
17(1):1–10 doi.org/10.1136/amiajnl-2013-001636
37. Zhang Y, Flórez ID, Lozano LEC, Aloweni 48. World Health Organisation (2022) Interna-
FAB, Kennedy SA, Li A, Craigie SM, tional Classification of Diseases (ICD).
Zhang S, Agarwal A, Lopes LC, Devji T, h t t p s : // w w w. w h o . i n t / s t a n d a r d s /
Wiercioch W, Riva JJ, Wang M, Jin X, Fei Y, classifications/classification-of-diseases
Alexander PE, Morgano GP, Zhang Y, 49. Coiera E (2015) Guide to health informatics.
Carrasco-Labra A, Kahale LA, Akl EA, Schüne- CRC Press, Boca Raton. google-Books-ID:
mann HJ, Thabane L, Guyatt GH (2017) A 1ngZBwAAQBAJ
systematic survey on reporting and methods
for handling missing participant data for con- 50. Medicines and Healthcare products Regulatory
tinuous outcomes in randomized controlled Agency (2022) Clinical Practice Research
trials. J Clin Epidemiol 88:57–66 Datalink | CPRD. https://www.cprd.com
38. Jørgensen AW, Lundstrøm LH, Wetterslev J, 51. NHS (2022) Dictionary of medicines and
Astrup A, Gøtzsche PC (2014) Comparison of devices (dm+d) — nhsbsa. https://www.
results from different imputation techniques nhsbsa.nhs.uk/pharmacies-gp-practices-and-
for missing data from an anti-obesity drug appliance-contractors/dictionary-medicines-
trial. PLoS One 9(11):1–7. https://doi.org/ and-devices-dmd
10.1371/journal.pone.0111964 52. Committee JF (2022) BNF (British national
39. Sinharay S, Stern H, Russell D (2001) The use formulary) — nice. https://bnf.nice.org.uk/
of multiple imputation for the analysis of miss- 53. Organisation WH (ed) (2019) International
ing data. Psychol Methods 6:317–29. https:// statistical classification of diseases and related
doi.org/10.1037/1082-989X.6.4.317 health problems, 11th edn. World Health
40. Azur M, Stuart E, Frangakis C, Leaf P (2011) Organization, New York. https://icd.who.int/
Multiple imputation by chained equations: 54. Bodenreider O (2004) The Unified Medical
what is it and how does it work? Int J Methods Language System (UMLS): integrating bio-
Psychiatr Res 20:40–9. https://doi.org/10. medical terminology. Nucleic Acids Res 32
1002/mpr.329 (Database issue):D267–D270. https://doi.
41. Horton NJ, Lipsitz SR (2001) Multiple impu- org/10.1093/nar/gkh061. https://www.
tation in practice: comparison of software ncbi.nlm.nih.gov/pmc/ar ticles/PMC30
packages for regression models with missing 8795/
variables. Am Stat 55(3):244–254. http:// 55. Amos L, Anderson D, Brody S, Ripple A,
www.jstor.org/stable/2685809 Humphreys BL (2020) UMLS users and uses:
42. Little RJA, Wang Y (1996) Pattern-mixture a current overview. J Am Med Inform Assoc
models for multivariate incomplete data with 27(10):1606–1611. https://doi.org/10.
covariates. Biometrics 52(1):98–111. http:// 1093/jamia/ocaa084
www.jstor.org/stable/2533148 56. NHS (2020) Read Codes. https://digital.nhs.
43. Elkin PL, Trusko BE, Koppel R, Speroff T, uk/services/terminology-and-classifications/
Mohrer D, Sakji S, Gurewitz I, Tuttle M, read-codes
Brown SH (2010) Secondary use of clinical 57. N B (1994) What are the Read Codes? Health
data. Stud Health Technol Inform 155:14–29 Libr Rev 11(3):177–182. https://doi.
44. Koleck TA, Dreisbach C, Bourne PE, Bakken S org/10.1046/j.1365-2532.1994.1130177.x.
(2019) Natural language processing of symp- h t t p s : // o n l i n e l i b r a r y. w i l e y. c o m / d o i /
toms documented in free-text narratives of abs/10.1046/j.1365-2532.1994.1130177.x
electronic health records: a systematic review. 58. Botsis T, Hartvigsen G, Chen F, Weng C
(2010) Secondary use of EHR: data quality
354 Wenjuan Wang et al.
issues and informatics opportunities. Summit 65. Zhu H, Xu J, Liu S, Jin Y (2021) Federated
on Translational Bioinformatics 2010:1–5. learning on non-IID data: A survey. Neuro-
https://www.ncbi.nlm.nih.gov/pmc/articles/ computing 465:371–390. https://doi.
PMC3041534/ org/10.1016/j.neucom.2021.07.098,
59. Ben Goldacre ea (2022a) Better, broader, safer: 2106.06843
using health data for research and analysis— 66. Geiping J, Bauermeister H, Dröge H, Moeller
gov.uk. https://www.gov.uk/government/ M (2020) Inverting gradients—how easy is it
publications/better-broader-safer-using- to break privacy in federated learning? In:
health-data-for-research-and-analysis Larochelle H, Ranzato M, Hadsell R,
60. Ben Goldacre ea (2022b) Home — goldacre Balcan M, Lin H (eds) Advances in neural
review. https://www.goldacrereview.org/ information processing systems, vol 33. Curran
61. Pan X, Zhang M, Ji S, Yang M (2020) Privacy Associates Inc, New York, pp 16937–16947.
risks of general-purpose language models. https://proceedings.neurips.cc/paper/2020/
Proceedings—IEEE Symposium on Security file/c4ede56bbd98819ae6112b20ac6bf145-
and Privacy 2020(May):1314–1331. https:// Paper.pdf
doi.org/10.1109/SP40000.2020.00095 67. Bagdasaryan E, Veit A, Hua Y, Estrin D, Shma-
62. Esteva A, Robicquet A, Ramsundar B, tikov V (2020) How to backdoor federated
Kuleshov V, DePristo M, Chou K, Cui C, learning. In: Chiappa S, Calandra R (eds) Pro-
Corrado G, Thrun S, Dean J (2019) A guide ceedings of the twenty third international con-
to deep learning in healthcare. Nat Med 25(1): ference on artificial intelligence and statistics,
24–29. https://doi.org/10.1038/s41591-01 PMLR, proceedings of machine learning
8-0316-z. https://www.nature.com/articles/ research, vol 108, pp 2938–2948. https://
s41591-018-0316-z proceedings.mlr.press/v108/bagdasaryan20a.
html
63. McMahan B, Moore E, Ramage D,
Hampson S, Arcas BAy (2017) 68. Lyu L, Yu H, Zhao J, Yang Q (2020) Threats
Communication-Efficient Learning of Deep to federated learning. Lecture Notes in Com-
Networks from Decentralized Data. In: puter Science (including subseries Lecture
Singh A, Zhu J (eds) Proceedings of the 20th Notes in Artificial Intelligence and Lecture
international conference on artificial intelli- Notes in Bioinformatics) 12500 LNCS:3–16.
gence and statistics, PMLR, Proceedings of https://doi.org/10.1007/978-3-030-63076-
Machine Learning Research, vol 54, pp 8_1. https://arxiv.org/abs/2003.02133v1,
1273–1282. https://proceedings.mlr.press/ 2003.02133
v54/mcmahan17a.html 69. Mothukuri V, Parizi RM, Pouriyeh S, Huang Y,
64. McCloskey M, Cohen NJ (1989) Catastrophic Dehghantanha A, Srivastava G (2021) A survey
interference in connectionist networks: the on security and privacy of federated learning.
sequential learning problem. In: Bower GH Futur Gener Comput Syst 115:619–640.
(ed) Psychology of learning and motivation, h t t p s : // d o i . o r g / 1 0 . 1 0 1 6 / j .
vol 24, Academic Press, New York, pp f u t u r e . 2 0 2 0 . 1 0 . 0 0 7 . h t t p s : // w w w.
109–165. https://doi.org/10.1016/S0079- sciencedirect.com/science/article/pii/S01
7 4 2 1 ( 0 8 ) 6 0 5 3 6 - 8 . h t t p s : // w w w. 67739X20329848
sciencedirect.com/science/article/pii/S00
79742108605368
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 12
Abstract
Brain disorders are a leading cause of global disability. With the increasing global proliferation of smart
devices and connected objects, the use of these technologies applied to research and clinical trials for brain
disorders has the potential to improve their understanding and create applications aimed at preventing,
early diagnosing, monitoring, and creating tailored help for patients. This chapter provides an overview of
the data these technologies offer, examples of how the same sensors are applied in different applications
across different brain disorders, and the limitations and considerations that should be taken into account
when designing a solution using smart devices, connected objects, and sensors.
Key words Smartphone, Mobile devices, Wearables, Connected objects, Brain disorders, Digital
psychiatry, Digital neurology, Digital phenotyping, Machine learning, Human activity recognition
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_12,
© The Author(s) 2023
355
356 Sirenia Lizbeth Mondragón-González et al.
2 Data Available from Mobile Device, Sensors, and Connected Objects for Brain
Disorders
Fig. 1 Active and passive sensing. Mobile devices and wearable sensors provide
metrics on various aspects of the mental and behavioral states through active
(requiring an action from the user, often following a prompt) or passive (auto-
matically without intentional action from the user) data collection. This is
possible through (a) direct interaction with the device, (b) active use of a device
for assessment of internal insight, (c) passive use of inertial and positioning
systems, (d) passive interaction with sensors embedded in the environment, and
(e) passive or active interaction between users through devices
358 Sirenia Lizbeth Mondragón-González et al.
2.1 Active Data In active data probing, the person of interest must execute a specific
Probing action to supply the data, meaning that the quantity and the quality
of acquired data directly depend on the user’s compliance. These
actions usually involve direct interaction with the device. To maxi-
mize compliance, the subject needs to spend time and energy
collecting the data; therefore, the number of action steps necessary
to enter the data must be optimized to avoid user fatigue. It is also
essential to care for feature overload by focusing on usability instead
of utility and thoughtfully circumscribing the scope of questions or
inputs. The amount of information requested and the response
frequency are essential aspects to think ahead to maximize the
continuous use of the device. If there is an intermediate user
interface, following standard UX/UI (user experience/user inter-
face) guidelines is a good starting point for optimization but might
not be sufficient according to the target population group. It is
crucial to design without making assumptions but by getting
patients’ early feedback through co-construction or participatory
design [3–5]. In summary, there are several considerations that one
needs to plan before deploying a solution-using active probing that
involves the device itself but also how the user interacts with it.
2.1.1 Interaction with the Recording the subject’s response can provide unique information
User about the occurrence of experiences and the cognitive processes
that unfold over time. We can record the user’s feedback at specific
points in time or continuously by taking advantage of the interac-
tion between the user and a device (see Fig. 1a).
Manual devices: response buttons, switches, and touchscreens.
These devices capture conventional key or screen presses via
switches or touchscreens, usually operated by hand. A switch con-
nects or disconnects the conducting path in an electrical circuit,
allowing the current to pass through contacts. They allow a subject
to send a control or log signal to a system. They have been largely
used, for several decades, in computer-based experiments for psy-
chology, psychophysiology, behavioral, and functional magnetic
resonance imaging (fMRI) research. The commonly obtained
metrics are specific discrete on/off responses (pressed or not) and
reaction time [6]. It is usually necessary to measure a person’s
reaction time to the nearest millisecond which requires dedicated
response pads. Indeed, general-purpose commercial keyboards and
mice have variable response delays ranging from 20 to 70 ms, a
range comparable to or lower than human reaction time in a simple
detection task [7]. On the other hand, dedicated computerized
testing devices seek to have less variable and smaller response
delay. They introduce less variation and biases in timing measure-
ments [7] by addressing problems such as mechanical lags, deboun-
cing, scanning, polling, and event handling. Commercially available
response-button boxes (e.g., Psychology Software Tools, Inc.,
Sharpsburg, PA, USA; Cedrus Corporation, San Pedro, CA,
USA; Empirisoft Corporation, New York, NY, USA; Engineering
Mobile Devices, Connected Objects, and Sensors 359
2.1.2 Subjective With current knowledge and technologies, data that reflect psycho-
Assessments logical states such as emotions and thoughts can only be obtained
by active data probing of the patient or an informer, usually a
partner, family member, or caregiver (see Fig. 1b). The long history
of psychological assessment provides rich conceptual and method-
ological frameworks for collecting valid measures of subjective
360 Sirenia Lizbeth Mondragón-González et al.
2.2 Passive Data In passive data probing, the data is collected without explicitly
Probing asking the subject to provide the data. It provides an objective
representation of the subject’s state in time. In scenarios where
the data needs to be acquired multiple times a day, passively collect-
ing the data is a more valuable and ecological way to proceed. It
allows objectively measuring the duration and frequency of specific
events and their evolution in time. In contrast with active data,
probing can provide more samples over a period. Since meaningful
events might be embedded in the collected data, this probing type
requires reviewing historical loggings or computer applications to
extract the information of interest.
2.2.1 Inertial and Detection of whole-body activities (such as walking, running, and
Positioning Systems bicycling), as well as fine-grained hand activity (such as smartphone
scrolling, typing, and handwashing), can allow the arduous task of
studying and monitoring human behavior, which is of great value
to understand, prevent, and diagnose brain diseases as well as to
provide care and support to the patient. The change in physical
activity and its intensity, the detection of sleep disorders, fall detec-
tion, and the evolution or detection of a particular behavior are
some possibilities that can be assessed with inertial sensors.
Identifying specific activities of a person based on sensor data is
the main focus of the broad field of study called human activity
recognition (HAR). A widely adapted vehicle for achieving HAR’s
goal is passive sensor-based systems that use inertial sensors (see
Fig. 1c), which transduce inertial force into electrical signals to
measure the acceleration, inclination, and vibration of a subject or
object (see Fig. 2a). These systems are commonly included in
today’s portable electronic devices such as mobile phones, smart-
watches, videogame controllers, clothes, cameras, and
non-portable objects like cars and furniture. Besides offering the
advantage, due to their reduced size, of being embeddable in
Mobile Devices, Connected Objects, and Sensors 361
Fig. 2 Inertial sensors. (a) Representation of an inertial measurement unit (IMU) depicting the sensing axes and
the corresponding yaw, pitch, and roll rotations. (b) Exemplar accelerometer profiles of two hand gestures
(hand rubbing and key locking) for three subjects showing the similar periodic nature of the hand movements.
(c) Operating principle of an MEM accelerometer. When a force is detected due to a compressive or extensive
movement, it is possible to determine the displacement x and acceleration since the mass and spring
constants are known. (d) Representation of a simple gyroscope model. (e) The magnetic field generated by
electric currents, magnetic materials, and the Earth’s magnetic force exerts a magnetic force detectable by a
magnetometer sensor
vm ðt Þ = A x ðt Þ2 þ A y ðt Þ2 þ A z ðt Þ2 ð1Þ
When the changes in angular velocity are faster than the sam-
pling frequency, one will not be able to detect them, and the error
will continue to increase with time. This error is called drift. There-
fore, the sampling rate value should be carefully chosen since gyro-
scopes are vulnerable to drifting over the long term.
364 Sirenia Lizbeth Mondragón-González et al.
Magnetic Sensitive Magnetic sensors measure the strength and direction of the Earth’s
Sensors (e.g., Hall Sensor) magnetic field and are affected by electric currents and magnetic
materials (see Fig. 2e). Most MEM’s magnetic sensors are based on
magnetoresistance to measure the surrounding magnetic field,
meaning that the resistance changes due to changes in the Earth’s
and nearby magnetic fields. They can detect the vector character-
ized by strength and direction toward the Earth’s magnetic north,
and with it, one can estimate one’s heading. This vector is vertical at
the Earth’s magnetic pole and has an inclination angle of 0°. When
used with accelerometers and gyroscopes, it can help to determine
the absolute heading.
Table 1
Accelerometer features for machine learning applied to human activity recognition
Data Fusion Data fusion is the concept of combining data from multiple sources
to create a result with an accuracy that is higher than that obtained
from a single source. IMUs can be used to simultaneously provide
linear acceleration and angular velocity of the same event, as well as
the device’s heading. Data fusion techniques provide complemen-
tary information to improve human activity recognition. Impor-
tantly, they can also be used to correct each other since each IMU
sensor has different strengths and weaknesses that can be combined
for a better solution. Accelerometers can measure gravity for long
terms but are more sensitive to certain scenarios, such as spikes.
Gyroscopes can be trusted for a few seconds of relative orientation
changes, but the output will drift over longer time intervals, and
magnetometers are less stable in environments with magnetic
interferences.
Data fusion techniques can be divided into three levels of
applications: sensor-level fusion, feature-level fusion, and
decision-level fusion [20]. In sensor-level fusion, the raw signals
from multiple sensors are combined before feature extraction. For
instance, accelerometers are sensitive to sharp jerks, while gyro-
scopes tend to drift over the long term; thus, sensor-level fusion
helps with these problems. This is achieved via signal processing
algorithms, where the most popular algorithms are the Kalman
filter [21] and the complementary filter [22]. The first, an iterative
filter that correlates between current and previous states, consists of
low- and high-pass filtering to remove gyroscope drift and acceler-
ometer spikes. Feature fusion refers to the combination of multiple
features from different sensors before entering them into a machine
learning algorithm through feature selection and reduction meth-
ods such as the principal component analysis (PCA) and singular
value decomposition (SVD). Feature fusion helps in identifying the
correlation between features and working with a smaller set of
variables. The models’ results (e.g., multiple classifiers) are com-
bined in decision fusion to have a more accurate single decision.
The aim is to implement fusion rules to get a consensus that would
help in improving the algorithm’s accuracy and have a better gen-
eralization. These rules include majority voting, boosting, and
stacking [23].
Mobile Devices, Connected Objects, and Sensors 367
Benchmark Databases In the context of HAR, there are several advantages to having access
to inertial databases. The most obvious one is that it allows for
comparing several solutions. Inertial databases can be used to rap-
idly focus on the development of signal processing and machine
learning solutions before spending time on the development and
deployment of hardware sensing solutions. In the past years, several
public databases have appeared in the literature for smartphone and
wearables studies. The list of the main databases of inertial sensors
used in research work is gathered in Lima and colleague’s review
[19] and in Sprager and Juric’s review [24].
Global Positioning System: The Global Positioning System (GPS) is a global navigation system
Geospatial Activity based on a network of GPS satellites, ground control stations, and
receivers that work together to determine an accurate geographic
position at any point on the Earth’s surface. The widespread inte-
gration of GPS into everyday objects such as smartphones, naviga-
tion systems, and wearables (GPS watches) has enabled the
objective measurement of a person’s location and mobility with
minimal retrieval burden and recall bias [25]. At a basic level, raw
data from GPS provide latitude, longitude, and time [26]. These
data can be further processed to provide objective measurements of
location and time, such as measurements of trajectories and loca-
tions in specific environments. Newer GPS can provide variables
such as elevation, indoor/outdoor states, and speed. GPS devices
have proven to be useful tools for studying and monitoring physical
activity [27]. When combined with inertial sensors, it is possible to
identify activity patterns and their spatial context [28]. The spatial
analysis can then be contextualized with environmental attributes
(presence of green space, street connectivity, cycling infrastructure,
etc.). The data is often analyzed using commercial or open-source
geographic information systems (GIS), software for data manage-
ment, spatial analysis, and cartographic design. According to Krenn
and colleagues [28], the main limitation of using GPS in health
research is the loss of data quality. Indeed, urban architecture and
dense vegetation can lead to signal dropouts.
2.2.2 Interaction with the When a residence uses a controller to integrate various connected
Environment objects or home automation systems, we refer to the home as a
smart home. In a smart home, the role of the home controller is to
integrate the home automation systems and enable them to com-
municate with each other. In this approach, the subject does not
need to carry a device; instead, the environment is equipped with
devices that can collect the required data (see Fig. 1d). In smart
homes, we can find diverse appliances with some degree of automa-
tion. Perhaps the most popular commercial device is the smart
speaker, equipped with a virtual assistant that responds to voice
commands. More and more common virtual assistant technologies
have expanded the use of speech processing, and the so-called vocal
368 Sirenia Lizbeth Mondragón-González et al.
2.2.3 Interaction The massive adoption of Internet of Things (IoT) devices has made
Between Users it possible to have a network of interconnected devices that interact
to collect and analyze data using an Internet connection for remote
computing (see Fig. 1e). Interaction between devices may be used as
a proxy to inform about the collective and individual behavior of
the user(s) carrying them. This interaction is possible due to
numerous wireless technologies that enable communication
among devices, such as Wi-Fi and Bluetooth. Moreover, a richer
picture of the social world may be obtained from the traces of
interactions in cyberspace, such as the analysis of individual devices’
communication. Research using the phone call detail records of a
sample of elderly participants in France demonstrated that such
Mobile Devices, Connected Objects, and Sensors 369
3.1 Prevention The blooming market of mobile technologies in the field of well-
being and self-quantization, from basic logging to deep personal
analytics, represents an opportunity to promote and assist health-
enhancing behaviors. For instance, as much as 85% of US adults
own a smartphone [42] and 21% an activity tracker [42]. Digital
prevention uses these mobile technologies to advise and anticipate a
decline in health, the goal being to prevent health threats and
predict event aggravation by monitoring continuous patient status
and warning indications.
An example of digital prevention in the psychiatric domain
includes specific tools to prevent burnout, depression, and suicide
rates. Web-based and mobile applications have been shown to be
interesting tools for mitigating these severe psychiatric issues. For
instance, a recent study [43] showed how the combination of a
smartphone app with a wearable activity tracker was put into use to
prevent the recurrence of mood disorders. With passive monitoring
of the patient’s circadian rhythm behaviors, their ML algorithm was
able to detect irregular life patterns and alert the patients, reducing
by more than 95% the amount and duration of depressive episodes,
maniac or hypomanic episodes, and mood episodes.
In specific contexts known for being risk-prone with respect to
mental health, e.g., high psychological demand jobs, as well as in
more general professional settings, organizations have been start-
ing to deploy workplace prevention campaigns using digital tech-
nologies [44]. In a study by Deady and colleagues [45], the authors
developed a smartphone app designed to reduce and prevent
depressive symptoms among a group of workers. The control
group had a version of the app with a monitoring component,
and the intervention group had the app version that included a
behavioral activation and mindfulness intervention besides the
monitoring component. Their study showed how the smartphone
app helped prevent incident depression in the intervention group
by showing fewer depression symptoms and less prevalence of
depression over 12 months compared to the control group. Both
these examples show how using smartphones and wearable devices
can reduce symptoms and potentially prevent mental health
decline.
Mobile Devices, Connected Objects, and Sensors 371
3.2 Early Diagnosis Although, in the brain care literature, most applications for diag-
nosis with ML use anatomical, morphological, or connectivity data
derived from neuroimaging [46], there is a growing body of evi-
dence indicating that common sensors could be used in some cases
to detect behavioral and/or motor changes preceding clinical man-
ifestations of diverse brain diseases by several years. In contrast, in
neurodegenerative diseases like AD [47], PD [48], and motor
neuron disease (MND) [49], the symptoms manifest when a sub-
stantial loss of neurons has already occurred, making early diagnosis
challenging. Because of this, with the increasing adoption of ML in
research and clinical trials, directed efforts have been made to
diagnose neurodegenerative diseases early. As an example, in PD,
a study used IMU in smartphones to characterize gait in the senior
population, detecting gait disturbances, an early sign of PD, and
showing the feasibility of the approach with a patient who showed
step length and frequency disturbances and who was later formally
diagnosed with PD [50]. Apathy, conventionally defined as an
“absence or lack of feeling, emotion, interest or concern” [51], is
one of the most frequent behavioral symptoms in neurological and
psychiatric diseases. In the daily life of patients, apathy results in
reduced daily activities and social interactions. These behavioral
alterations may be detected as a reduction in the second-order
moment (variance) of location data (as tracked with GPS measure-
ments [52]) and in the first-order moment (average quantity) of
accelerometer measures (e.g., [53] in the context of schizophrenia
patients).
Sensors can also be used to differentiate between disorders that
have shared symptoms, accelerating diagnosis and treatments. For
instance, a study [54] that used wrist-worn devices containing
accelerometers analyzed measures of sleep, circadian rhythmicity,
and amplitude fluctuations to distinguish with 83% accuracy pedi-
atric bipolar disorder (BD) and attention-deficit hyperactivity dis-
order (ADHD), two common psychiatric disorders that share
clinical features such as hyperactivity.
3.3 Symptom and Monitoring day-to-day activities and the evolution of symptoms is
Treatment Monitoring impossible for health providers outside the clinic without auto-
mated detection of events of interest and deployment of mobile
interventions. Much like apathy, described above, many other psy-
chological constructs may be sensed from continuous monitoring
of behavioral parameters, such as agitation or aberrant mobile
behavior [55, 56].
Sleeping is one activity that cannot be monitored in any other
way than with passive data probing in an ecological manner. Moni-
toring sleep is relevant when studying sleep disturbance, a core
diagnostic feature of depressive disorder, anxiety disorders, bipolar
disorder, and schizophrenia spectrum disorder. In this sense, sleep
patterns have been scored using light sensors in mobile devices and
372 Sirenia Lizbeth Mondragón-González et al.
3.4 Tailored Help for Personalized or precision medicine consists in using collected data
Patients and to refine the diagnosis and treatment of individual patients. In this
Augmented Therapies sense, connected devices and mobile technologies could contribute
to tailoring patients’ care. Moreover, personalized or augmented
therapies can benefit from using smart devices and connected
objects to add additional assistance to classic therapeutic
approaches.
An example of this is epilepsy, a central system disorder that
causes seizures. Not only the unpredictability of seizure occurrence
is distressing for patients and contributes to social isolation, but for
unattended patients with recurrent generalized tonic–clonic sei-
zures (GTCS), this may lead to severe injuries and constitute the
main risk factor of sudden unexpected death. This is why, in the
epilepsy research field, much effort has been put into developing
ambulatory monitoring with alarms for automated seizure detec-
tion, with most real-time application studies using wrist acceler-
ometers, video monitoring, surface electromyography (sEMG), or
under-mattress movement monitors based on electromechanical
films [68]. The general purpose of using these sensors is to detect
unpredictable changes in motor activity or changes in autonomic
parameters characteristic of seizures.
Another illustrative case of the interest in mobile technology
for helping patients in their everyday life concerns fall detection in
older and /or gait-disabled persons: wireless versions of inertial and
374 Sirenia Lizbeth Mondragón-González et al.
4.1 Related to Sensor We refer to sample rate stability as the homogenous regularity time
Function spans between consecutive samples. In a reliable device, the differ-
ence between different time spans between successive measure-
4.1.1 Sample Rate
ments is close to zero. When this is not the case, the true measure
Stability
by the sensor and the timestamp registered by the application
differs. Common sources of sample rate instability are the inherent
jitter by non-real-time operating systems that cannot guarantee
critical execution time or access to resources and the additional
communication delay between the devices and applications.
4.1.2 The Choice of Sensors are usually input devices that take part in a bigger system,
Technology sending information to a processing unit so that the signals can be
analyzed. When choosing a technology to work with, a careful
choice of all of the parts must be pre-studied to avoid issues in
usability and signal quality since these will have an impact on the
difficulty of development and deployment, as well as on the long-
term use of the technology. For instance, if we need to record
Mobile Devices, Connected Objects, and Sensors 375
4.1.3 Power One of the main problems preventing the massive expansion and
Consumption adoption of HAR applications is excessive battery power consump-
tion [75]. Indeed, the major problems that lead to data loss are
empty batteries, where the main sources of high power consump-
tion are the high data processing load and the continuous use of
sensors. Some strategies can be adopted to minimize energy con-
sumption, although these imply a tradeoff between energy con-
sumption, signal richness, and the accuracy of classification models.
The first strategy consists of on-demand activation of sensors only
when necessary, in contrast to continuous sampling; this requires a
continuous supplementary routine that automatically determines
when the timing is appropriate to interrogate the sensor(s). Tech
companies have dealt with this problem by integrating “sensor
hubs,” i.e., low-power coprocessors that are dedicated to reading,
buffering, and processing continuous sensor data for specific func-
tions such as step counting and spoken word detection (for
instance, the specific function of detecting the famous popular
voice commands “hello google” or “Alexa” for Google’s and Ama-
zon’s vocal assistants). The second strategy consists of choosing the
sampling frequency of data collection. The higher the frequency in
sampling data, the more energy the sensors, the processor, and the
memory unit use. Previous knowledge of the signals and the fre-
quency necessary to capture events is needed to select a sampling
frequency which is a good tradeoff between capturing relevant
signal information and avoiding an unnecessary battery drop. The
third strategy focuses on the applications where the data is pro-
cessed on the device by strategically selecting lightweight features
376 Sirenia Lizbeth Mondragón-González et al.
4.2 Related to the Data collection consists of data acquisition, data labeling, and
Data existing database improvement. It is one critical challenge in
machine learning and often the most time-consuming step in an
4.2.1 Data Collection
end-to-end machine learning application due to the time spent
collecting the data, cleaning, labeling (for supervised learning),
and visualizing it. The data required by the machine learning mod-
els can be experimental, retrospective, observational, and, in some
cases, synthetic data. While retrospective data collection methods
such as surveys and interviews are easy to deploy, they are subject to
recall and to self-selection bias, and they might add tedious collec-
tion logistical issues if tools and programs in mobile devices are not
deployed. Retrospective data collection is sometimes the only
means to capture subjective experiences in daily life. Observation
methods such as video-camera surveillance can be impractical for
large-scale deployment and are often primarily used in small sample
applications. Generation of synthetic data is sometimes necessary to
overcome the lack of data in some domains, notably annotated
medical data. This kind of data is created to improve AI models
through data augmentation from models that simulate outcomes
given specific inputs such as bio-inspired data [74], physical simula-
tions, or AI-driven generative models [75]. The issue with this is
that there is a lack of regulatory frameworks involving synthetic
data and their monitoring. Their evaluation could be done with a
Turing test, yet this may be prone to inter- and intra-observer
variabilities. Plus, data curation protocols can be as tedious and
laborious as collecting and labeling real data.
The availability of large-scale, curated scientific datasets is cru-
cial for developing helpful machine learning benchmarks for scien-
tific problems [76], especially for supervised learning solutions
where data volume and modality are relevant [77]. Even though
machine learning has been used in many domains, there is still a
broad panel of applications and fields, such as neuroscience and
psychiatry, with few or even inexistent training databases. This is
the case for connected devices’ and sensors’ derived datasets for
brain disorder research. In contrast, there are nowadays larger
neuroimaging and biological databases available, e.g., the Alzhei-
mer’s Disease Neuroimaging Initiative (ADNI) and the Allen Brain
Mobile Devices, Connected Objects, and Sensors 377
4.2.2 Database For applications in medicine and healthcare, the datasets used to
Validation and Signal train the ML models should undergo detailed examination because
Richness they are central to understanding the model’s biases and pitfalls.
Before adopting an openly available dataset or creating one, there
are some considerations that we have to keep in mind. Firstly,
ensure a minimal chance of sample selection biases in the database
(for instance, data acquired with particular equipment or with a
particular setting). Errors from sample selection biases become
evident when the model is deployed in settings different from
those used for training. Secondly, we must be aware of the class
imbalance problem that often occurs in cases where the data is rare
(for instance, in low samples associated with rare diseases), which
could negatively affect models designed for prognosis and early
diagnosis. A few techniques can be adopted to help with class
imbalance, such as resampling, adding synthetic data, or working
directly with the model, such as weighting the cost function of
neural networks.
The data, often obtained from scientific experiments, should be
rich enough to allow different analysis and exploration methods
and carefully labeled when required. For instance, a semantic dis-
crepancy in the labels can dilute the training pool and confuse the
classifier [78, 79]. In contrast with free-form text or audio to mark
the activities, imperfect labeling by the users can occur when scor-
ing the samples with fixed labels. For instance, labeling similar
activities from IMU systems, such as running and jogging under
the fixed label-running, can induce errors in feature extraction
because of their interactivity similarity.
It is also important for signal richness to consider subject
variability and consider differences between gender, age, and any
other characteristic that could lead to improper data representation.
Naı̈ve assumptions can cause actual harm by stigmatizing a popula-
tion subgroup when there is an implicit bias in data collection,
selection, and processing [80]. These can be addressed by expand-
ing the solutions to inclusion at all levels and carefully auditing all
stages of the development pipeline.
378 Sirenia Lizbeth Mondragón-González et al.
4.3 Deployment for Real-work applicability requires that the accuracy be tested off
Real-Life Applications laboratory settings, considering real-life factors besides technology
function and data collection. Before deploying a solution using
connected objects and sensors, these real-life use considerations
should be addressed without deprecation. Indeed, factors such as
user acceptance and behavior around the devices might even be
more important than having a high positive prospect of technology.
Here we briefly present three factors to keep in mind. These include
thinking ahead of privacy issues and how to handle them, the
potential degree of adoption, and wearability and instrumentation
unobtrusiveness.
4.3.1 Health Privacy Using mobile devices, connected objects, and sensors to collect
data for machine learning for health applications is a process that
generates data from human lives. In this sense, privacy is a common
concern with health data. The concept of privacy in health refers to
the contextual rules around generated data or information: how it
flows depending on the actors involved, what is the process by
which it is accessed, the frequency of the access, and the purpose
of the access [81].
The machine learning community has generally valued and
embraced the concept of openness. It is common for code and
datasets to be publicly released and paper preprints to be available
on dedicated archival services before an article is published (despite
rejection). Therefore, regulatory bodies should encourage and
enforce data holders to collect and provide data under clear legal
protection. To ensure data security, these regulations might suggest
adopting different solutions: not transmitting raw data, having an
isolated sensor network, transmitting encrypted data, and
controlling data access authorizations [82]. While individual
countries decide where to draw the line regarding regulations,
sometimes, depending on the data type, this is more or less difficult
to define. For instance, there might be clearer limits on the exploi-
tation and use of patient video recordings because there is explicit
reasoning that the patient’s identity is easily accessible with image
processing. In contrast, this reasoning is less straightforward with
other types of data. For instance, even though inertial sensor data
might be sufficient to obtain information about a person based on
their biometric movement patterns, these sensors are currently not
perceived as particularly sensitive by the public. Part of this is
because their privacy implications are less well-understood
[83]. Thus, they tend to be much less protected (e.g., in wearable
devices and mobile apps) compared to other sensors such as GPS,
cameras, and microphones. Therefore, requiring proper permis-
sion, conscious advertised participation, and explicit consent from
the user is essential, no matter the nature of the data collected.
Mobile Devices, Connected Objects, and Sensors 379
4.3.2 Perception and The perception and adoption of mobile devices, connected objects,
Adoption and sensors refer to the negative or positive way the deployed
solutions are regarded, understood, and interpreted by the users.
This degree of perception directly affects the adherence to a proto-
col and the solution’s use or adoption over the long term. It is one
of the most important factors to evaluate ahead of deployment in
real-life scenarios. It implies a conscious effort to understand the
patient’s situation and point of view. It can be overseen by devel-
opers and researchers who could focus more on the technical or
scientific challenges to overcome or who, because of naivety or
distance to the patient’s reality, might unwarily not include these
considerations in their designs.
The Technology Acceptance Model (TAM) [84], which can be
applied to mobile devices, connected objects, and sensors, postu-
lates that two factors predict technology acceptance. The first one is
the perceived usefulness or the degree to which a person believes a
particular solution will enhance or improve the performance of a
specific task. The second factor is perceived ease of use or the
degree to which a person believes the solution proposed will be
free of effort. The perceived ease of use and usefulness might vary
according to the population target and should be studied carefully
before deployment. For instance, the perceived ease of use is essen-
tial for the elderly [85], who are not core consumers of mobile
wireless healthcare technology. There are, of course, other models
and theories [86, 87] that have been published since the TAM was
proposed, and they include other essential factors to take into
account, such as social influence, performance and effort expec-
tancy, and facilitating conditions, or the perceptions of the
resources and support available. Although these models have sev-
eral limitations [88], the identified factors are a good starting point
to consider when designing a solution involving wearable, mobile
devices, and connected objects. In addition to those factors, clear
limits in the cost and benefit ratio of the technology must be
communicated since it is one of the main barriers to their accep-
tance. In that sense, the scientific and healthcare community is
responsible for efficiently approaching the patients and clearly
explaining the expected positive outcomes and the advantages and
disadvantages of the device’s ecosystems.
4.3.3 Wearability and Wearability refers to the locations where the sensors are placed and
Instrumentation how they are attached to those locations. Wearable devices are
Unobtrusiveness typically attached to the body or embedded in clothes and acces-
sories. They are smartwatches and bracelets for activity trackers,
smart jewelry, smart clothing, head-mounted devices, and ear
devices [89]. Wearability is an aspect to consider because of its
direct impact on data collection, signal richness, and quality. The
goal is to ensure the device’s prolonged and correct use.
380 Sirenia Lizbeth Mondragón-González et al.
4.4 Incorporation Although there is great potential for connected devices and sensors
into Clinical Care to prevent, early diagnose, monitor, and create tailored help for
patients suffering from brain diseases, there is still a gap to fill to
drive transformational changes in health. Besides the challenges
mentioned in this section, significant barriers to clinical adoption
include the lack of evidence in support of clinical use, the rapid
technological development and obsolescence, and the lack of reim-
bursement models. These problems are often highlighted in pre-
liminary reports of government proposals [92–94] and
publications related to mobile health challenges [95, 96].
There is a need for an extensive collection of real-world patient-
generated data to reinforce clinical evidence that will change health-
care delivery. To date, there is a limitation due to an underpowered
number of available pilot datasets that make the comparability of
studies difficult and therefore the adoption of these new technolo-
gies into the clinical field. Indeed, sensor datasets come mainly from
actigraphy and are not as numerous as available neuroimaging,
MEG, or EEG datasets.
Opposite to the few large patient-generated evidence, the num-
ber of solutions for connected devices and sensors with added
features continues to grow every year. This rapid development of
technologies represents a challenge to clinicians who might per-
ceive difficulty in the feasibility and scalability of real-word imple-
mentations within the clinical workflow, especially since it is
noticeable that devices become obsolete, outdated, or no longer
useful very quickly. Another negative impact of the higher number
Mobile Devices, Connected Objects, and Sensors 381
5 Discussion
Acknowledgments
References
1. Sim I (2019) Mobile devices and health. N Psychiatry 9. Accessed: 04 Oct 2022. [Online].
Engl J Med 381(10):956–968. https://doi. Available: https://www.frontiersin.org/arti
org/10.1056/NEJMra1806949 cles/10.3389/fpsyt.2018.00481
2. Laput G et al (2021) Methods and apparatus 6. Szul MJ, Bompas A, Sumner P, Zhang J (2020)
for detecting individual health related events. The validity and consistency of continuous joy-
US20210063434A1, 04 Mar 2021. Accessed: stick response in perceptual decision-making.
04 Oct 2022. [Online]. Available: https:// Behav Res Methods 52(2):681–693. https://
patents.google.com/patent/US202100 doi.org/10.3758/s13428-019-01269-3
63434A1/en 7. Li X, Liang Z, Kleiner M, Lu Z-L (2010)
3. Merkel S, Kucharski A (2019) Participatory RTbox: a device for highly accurate response
design in gerontechnology: a systematic litera- time measurements. Behav Res Methods
ture review. The Gerontologist 59(1):e16–e25. 42(1):212–225. https://doi.org/10.3758/
https://doi.org/10.1093/geront/gny034 BRM.42.1.212
4. SENSE-PARK Consortium et al (2015) Partic- 8. Spivey MJ, Dale R (2006) Continuous dynam-
ipatory design in Parkinson’s research with ics in real-time cognition. Curr Dir Psychol Sci
focus on the symptomatic domains to be 15(5):207–211. https://doi.org/10.1111/j.
measured. J Parkinsons Dis 5(1):187–196. 1467-8721.2006.00437.x
https://doi.org/10.3233/JPD-140472 9. Piwek L, Ellis DA, Andrews S, Joinson A
5. Thabrew H, Fleming T, Hetrick S, Merry S (2016) The Rise of Consumer Health Wear-
(2018) Co-design of eHealth interventions ables: Promises and Barriers. PLoS Med 13(2):
with children and young people. Front
384 Sirenia Lizbeth Mondragón-González et al.
2019 IEEE international conference on perva- technologies as a tool for multiple sclerosis
sive computing and communications work- management: narrative review. JMIR Rehabil
shops (PerCom Workshops), Mar 2019, pp Assist Technol 5(1):e7805. https://doi.org/
943–948. https://doi.org/10.1109/PER 10.2196/rehab.7805
COMW.2019.8730876 41. Torous J, Onnela J-P, Keshavan M (2017) New
31. De Miguel K, Brunete A, Hernando M, Gam- dimensions and new tools to realize the poten-
bao E (2017) Home camera-based fall detec- tial of RDoC: digital phenotyping via smart-
tion system for the elderly. Sensors 17(12):12. phones and connected devices. Transl
https://doi.org/10.3390/s17122864 Psychiatry 7(3):e1053. https://doi.org/10.
32. Koutli M, Theologou N, Tryferidis A, Tzovaras 1038/tp.2017.25
D (2019) Abnormal behavior detection for 42. Pew Research Center (2021) Mobile fact sheet.
elderly people living alone leveraging IoT Pew Research Center: Internet, Science &
sensors. In: 2019 IEEE 19th international con- Te c h . 0 7 A p r 2 0 2 1 . h t t p s : // w w w.
ference on Bioinformatics and Bioengineering pewresearch.org/internet/fact-sheet/mobile/
(BIBE), Oct 2019, pp 922–926. https://doi. (accessed 06 Oct 2022).
org/10.1109/BIBE.2019.00173 43. Cho C-H et al (2020) Effectiveness of a smart-
33. Aubourg T, Demongeot J, Renard F, phone app with a wearable activity tracker in
Provost H, Vuillerme N (2019) Association preventing the recurrence of mood disorders:
between social asymmetry and depression in prospective case-control study. JMIR Ment
older adults: a phone Call Detail Records anal- Health 7(8):e21283. https://doi.org/10.
ysis. Sci Rep 9(1):1. https://doi.org/10. 2196/21283
1038/s41598-019-49723-8 44. Brassey J, Güntner A, Isaak K, Silberzahn T
34. Davies N, Friday A, Newman P, Rutlidge S, (2021) Using digital tech to support employ-
Storz O (2009) Using bluetooth device ees’ mental health and resilience. McKinsey &
names to support interaction in smart Company
environments. In: Proceedings of the 7th inter- 45. Deady M et al (2022) Preventing depression
national conference on mobile systems, appli- using a smartphone app: a randomized con-
cations, and services, New York, NY, USA, pp trolled trial. Psychol Med 52(3):457–466.
1 5 1 – 1 6 4 . h t t p s : // d o i . o r g / 1 0 . 1 1 4 5 / h t t p s : // d o i . o r g / 1 0 . 1 0 1 7 /
1555816.1555832 S0033291720002081
35. Barthe G et al (2022) Listening to bluetooth 46. Vogels EA (2020) About one-in-five Ameri-
beacons for epidemic risk mitigation. Sci Rep cans use a smart watch or fitness tracker. Pew
12(1):1. https://doi.org/10.1038/s41598- Research Center. 09 Jan 2020. https://www.
022-09440-1 pewresearch.org/fact-tank/2020/01/09/
36. Box-Steffensmeier JM et al (2022) The future about-one-in-five-americans-use-a-smart-
of human behaviour research. Nat Hum Behav watch-or-fitness-tracker/ (accessed 06 Oct
6(1):15–24. https://doi.org/10.1038/ 2022).
s41562-021-01275-6 47. Donev R, Kolev M, Millet B, Thome J (2009)
37. Kourtis LC, Regele OB, Wright JM, Jones GB Neuronal death in Alzheimer’s disease and
(2019) Digital biomarkers for Alzheimer’s dis- therapeutic opportunities. J Cell Mol Med
ease: the mobile/wearable devices opportunity. 13(11–12):4329–4348. https://doi.org/10.
NPJ Digit Med 2(1):9. https://doi.org/10. 1111/j.1582-4934.2009.00889.x
1038/s41746-019-0084-2 48. Michel PP, Hirsch EC, Hunot S (2016)
38. Channa A, Popescu N, Ciobanu V (2020) Understanding dopaminergic cell death path-
Wearable solutions for patients with Parkin- ways in Parkinson disease. Neuron 90(4):
son’s disease and neurocognitive disorder: a 675–691. https://doi.org/10.1016/j.neu
systematic review. Sensors 20(9):2713. ron.2016.03.038
https://doi.org/10.3390/s20092713 49. Boillée S, Vande Velde C, Cleveland DW
39. Asadi-Pooya AA, Mirzaei Damabi N, (2006) ALS: a disease of motor neurons and
Rostaminejad M, Shahisavandi M, Asadi- their nonneuronal neighbors. Neuron 52(1):
Pooya A (2021) Smart devices/mobile phone 39–59. https://doi.org/10.1016/j.neuron.
in patients with epilepsy? A systematic review. 2006.09.018
Acta Neurol Scand 144(4):355–365. https:// 50. Lan K-C, Shih W-Y (2014) Early diagnosis of
doi.org/10.1111/ane.13492 Parkinson’s disease using a smartphone. Proce-
40. Marziniak M, Brichetto G, Feys P, Meyding- dia Comput Sci 34:305–312. https://doi.org/
Lamadé U, Vernon K, Meuth SG (2018) The 10.1016/j.procs.2014.07.028
use of digital and remote communication
386 Sirenia Lizbeth Mondragón-González et al.
51. Levy R, Dubois B (2006) Apathy and the func- efficacy. Front Aging Neurosci 7:102. https://
tional anatomy of the prefrontal cortex–basal doi.org/10.3389/fnagi.2015.00102
ganglia circuits. Cereb Cortex 16(7):916–928. 62. Cullen A, Mazhar MKA, Smith MD, Lithander
https://doi.org/10.1093/cercor/bhj043 FE, Breasail MÓ, Henderson EJ (2022) Wear-
52. Saeb S et al (2015) Mobile phone sensor cor- able and portable GPS solutions for monitor-
relates of depressive symptom severity in daily- ing mobility in dementia: a systematic review.
life behavior: an exploratory study. J Med Sensors 22(9):3336. https://doi.org/10.
Internet Res 17(7):e175. https://doi.org/10. 3390/s22093336
2196/jmir.4273 63. Boukhechba M, Chow P, Fua K, Teachman
53. Kluge A et al (2018) Combining actigraphy, BA, Barnes LE (2018) Predicting social anxiety
ecological momentary assessment and neuro- from global positioning system traces of college
imaging to study apathy in patients with students: feasibility study. JMIR Ment Health
schizophrenia. Schizophr Res 195:176–182. 5(3):e10101. https://doi.org/10.2196/
https://doi.org/10.1016/j.schres.2017. 10101
09.034 64. Chen B-R et al (2011) A web-based system for
54. Faedda GL et al (2016) Actigraph measures home monitoring of patients with Parkinson’s
discriminate pediatric bipolar disorder from disease using wearable sensors. IEEE Trans
attention-deficit/hyperactivity disorder and Biomed Eng 58(3):831–836. https://doi.
typically developing controls. J Child Psychol org/10.1109/TBME.2010.2090044
Psychiatry 57(6):706–716. https://doi.org/ 65. Morgiève M et al (2020) A digital companion,
10.1111/jcpp.12520 the Emma app, for ecological momentary
55. Favela J, Cruz-Sandoval D, Morales-Tellez A, assessment and prevention of suicide: quantita-
Lopez-Nava IH (2020) Monitoring behavioral tive case series study. JMIR MHealth UHealth
symptoms of dementia using activity trackers. J 8(10):e15741. https://doi.org/10.2196/
Biomed Inform 109:103520. https://doi. 15741
org/10.1016/j.jbi.2020.103520 66. Sepp€al€a J et al (2019) Mobile phone and wear-
56. Manley NA et al (2020) Long-term digital able sensor-based mHealth approaches for psy-
device-enabled monitoring of functional sta- chiatric disorders and symptoms: systematic
tus: implications for management of persons review. JMIR Ment Health 6(2):e9819.
with Alzheimer’s disease. Alzheimers Dement https://doi.org/10.2196/mental.9819
Transl Res Clin Interv 6(1):e12017. https:// 67. Cain AE, Depp CA, Jeste DV (2009) Ecologi-
doi.org/10.1002/trc2.12017 cal momentary assessment in aging research: a
57. Wainberg M et al (2021) Association of critical review. J Psychiatr Res 43(11):
accelerometer-derived sleep measures with life- 987–996. https://doi.org/10.1016/j.
time psychiatric diagnoses: a cross-sectional jpsychires.2009.01.014
study of 89,205 participants from the UK Bio- 68. Rugg-Gunn F (2020) The role of devices in
bank. PLoS Med 18(10):e1003782. https:// managing risk. Epilepsy Behav 103. https://
doi.org/10.1371/journal.pmed.1003782 doi.org/10.1016/j.yebeh.2019.106456
58. Fellendorf FT et al (2021) Monitoring sleep 69. Usmani S, Saboor A, Haris M, Khan MA, Park
changes via a smartphone app in bipolar disor- H (2021) Latest research trends in fall detec-
der: practical issues and validation of a potential tion and prevention using machine learning: a
diagnostic tool. Front Psychiatry 12:641241. systematic review. Sensors 21(15):5134.
https://doi.org/10.3389/fpsyt.2021.641241 https://doi.org/10.3390/s21155134
59. Gillani N, Arslan T (2021) Intelligent sensing 70. Omberg L et al (2022) Remote smartphone
technologies for the diagnosis, monitoring and monitoring of Parkinson’s disease and individ-
therapy of Alzheimer’s disease: a systematic ual response to therapy. Nat Biotechnol 40(4):
review. Sensors 21(12):4249. https://doi. 4. https://doi.org/10.1038/s41587-021-
org/10.3390/s21124249 00974-9
60. Karakostas A et al (2020) A French-Greek 71. Robert P et al (2021) Efficacy of serious exer-
cross-site comparison study of the use of auto- games in improving neuropsychiatric symp-
matic video analyses for the assessment of toms in neurocognitive disorders: results of
autonomy in dementia patients. Biosensors the X-TORP cluster randomized trial. Alzhei-
10(9):E103. https://doi.org/10.3390/ mers Dement Transl Res Clin Interv 7(1).
bios10090103 https://doi.org/10.1002/trc2.12149
61. Lyons BE et al (2015) Pervasive computing 72. Balaskas A, Schueller SM, Cox AL, Doherty G
technologies to continuously assess Alzhei- (2021) Ecological momentary interventions
mer’s disease progression and intervention for mental health: a scoping review. PLoS One
Mobile Devices, Connected Objects, and Sensors 387
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Part III
Methodologies
Chapter 13
Abstract
Image segmentation plays an essential role in medical image analysis as it provides automated delineation of
specific anatomical structures of interest and further enables many downstream tasks such as shape analysis
and volume measurement. In particular, the rapid development of deep learning techniques in recent years
has had a substantial impact in boosting the performance of segmentation algorithms by efficiently
leveraging large amounts of labeled data to optimize complex models (supervised learning). However,
the difficulty of obtaining manual labels for training can be a major obstacle for the implementation of
learning-based methods for medical images. To address this problem, researchers have investigated many
semi-supervised and unsupervised learning techniques to relax the labeling requirements. In this chapter,
we present the basic ideas for deep learning-based segmentation as well as some current state-of-the-art
approaches, organized by supervision type. Our goal is to provide the reader with some possible solutions
for model selection, training strategies, and data manipulation given a specific segmentation task and
dataset.
Key words Image segmentation, Deep learning, Semi-supervised method, Unsupervised method,
Medical image analysis
1 Introduction
Authors Han Liu, Dewei Hu, and Hao Li have equal contributors to this chapter
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_13,
© The Author(s) 2023
391
392 Han Liu et al.
2 Methods
1 64 64
128 64 64 2
input
image Output
tile segmentation
390 x 390
388 x 388
392 x 392
388 x 388
map
572 x 572
570 x 570
568 x 568
128 128
256 128
2002
1982
1962
2842
2822
2802
256 256
512 256
1042
Conv 3x3, ReLU
1402
1382
1362
1002
1022
Copy and crop
512 512 1024 512
max pool 2x2
562
682
662
642
542
522
conv 1x1
302
282
Fig. 1 U-Net architecture. Blue boxes are the feature maps. Channel numbers are denoted above each box,
while the tensor sizes are denoted on the lower left. White boxes show the concatenations and arrows indicate
various operations. ©2015 Springer Nature. Reprinted, with permission, from [1]
Fig. 2 Attention U-Net architecture. Hi, Wi, and Di represent the height, width, and depth of the feature map at
the ith layer of the U-Net structure. Fi indicates the number of feature map channels. Replicated from [4]
(CC BY 4.0)
emphasized with larger weights from the CNN during the training.
This leads the model to achieve higher accuracy on target structures
with various shapes and sizes. In addition, AGs are easy to integrate
into the existing popular CNN architectures. The details of the
attention mechanism and attention gates are discussed in Subhead-
ing 2.1.2. More details on attention can also be found in Chap. 6.
nnU-Net is a medical image segmentation pipeline that can
achieve a self-configuring network architecture based on the differ-
ent datasets and tasks it is given, without any manual intervention.
According to the dataset and task, nnU-Net will generate one of
(1) 2D U-Net, (2) 3D U-Net, and (3) cascaded 3D U-Net for the
segmentation network. For cascaded 3D U-Net, the first network
takes downsampled images as inputs, and the second network uses
the image at full resolution as input to refine the segmentation
accuracy. The nnU-Net is often used as a baseline method in
many medical image segmentation challenges, because of its robust
performance across various target structures and image properties.
The details of nnU-Net can be found in [6].
2.1.2 Attention Modules Although the U-Net architecture described in Subheading 2.1.1
has achieved remarkable success in medical image segmentation,
the downsampling steps included in the encoder path can induce
poor segmentation accuracy for small-scale anatomical structures
(e.g., tumors and lesions). To tackle this issue, the attention mod-
ules are often applied so that the salient features are enhanced by
higher weights, while the less important features are ignored. This
subsection will introduce two types of attention mechanisms: addi-
tive attention and multiplicative attention.
Medical Image Segmentation Using Deep Learning 395
s latt = ψ ⊤ σ 1 W ⊤x x li þ W ⊤g g i þ b g þ bψ ð3Þ
where bg and bψ represent the bias and Wx, Wg, ψ are linear
transformations. The output dimension of the linear transforma-
tion is F int where Fint is a self-defined integer. Denote these
Fig. 3 The structure of the additive attention gate. x li is the ith feature vector at the lth level of the U-Net
structure and gi is the corresponding gating signal. Wx and Wg are the linear transformation matrices applied
to x li and gi, respectively. The sum of the resultant vectors will be activated by ReLU and then its dot product
with a vector ψ is computed. The sigmoid function is used to normalize the resulting scalar to [0, 1] range,
which is the gating coefficient αi. The weighted feature vector is denoted by x li . Adapted from [4] (CC BY 4.0)
396 Han Liu et al.
Fig. 4 (a) The dot-product attention gate. ki are the keys and q is the query vector. si are the outputs of the
attention function. By using the softmax σ 3, the attention coefficients αi are normalized to [0, 1] range. The
output will be the weighted sum of values vi. (b) The multi-head attention is implemented in transformers. The
input values, keys, and query are linearly projected to different spaces. Then the dot-product attention is
applied on each space. The resultant vectors are concatenated by channel and passed through another linear
transformation. Image (b) is adapted from [10]. Permission to reuse was kindly granted by the authors
2.1.3 Loss Functions for This section summarizes some of the most widely used loss func-
Segmentation Tasks tions for medical image segmentation (Fig. 6) and describes their
usage in different scenarios. A complementary reading material for
an extensive list of loss functions can be found in [16, 17]. In the
following, the predicted probability by the segmentation model
and the ground truth at the ith pixel/voxel are denoted as pi and
gi, respectively. N is the number of voxels in the image.
398 Han Liu et al.
Fig. 5 The architecture of TransUNet. The transformer layer represented by the yellow box shows the
application of multi-head attention (MSA). MLP represents the multilayer perceptron. In general, the feature
vectors in the bottleneck of the U-Net are set as the input to the stack of n transformer layers. As these layers
will not change the dimension of the features, they are easy to be implemented and will not affect other parts
of the U-Net model. Replicated from [15] (CC BY 4.0)
Fig. 6 Loss functions for medical image segmentation. WCE: weighted cross-entropy loss. DPCE: distance
map penalized cross-entropy loss. ELL: exponential logarithmic loss. SS: sensitivity-specificity loss. GD:
generalized Dice loss. pGD: penalty loss. Asym: asymmetric similarity loss. IoU: intersection over union
loss. HD: Hausdorff distance loss. ©2021 Elsevier. Reprinted, with permission, from [16]
Medical Image Segmentation Using Deep Learning 399
Here, w y k is the coefficient for the kth class. Suppose there are
5 positive samples and 12 negative samples in a binary classification
training set. By setting w0 = 1 and w1 = 2, the loss would be as if
there were ten positive samples.
2 N
l = 1 wl i pi g i
L GDL = 1 - 2 2 N
ð12Þ
l = 1 wl i pi þ gi
Here w l = 1
N 2 is used to provide invariance to different region
ð g li Þ
i
sizes, i.e., the contribution of each region is corrected by the inverse
of its volume.
Finally, note that other loss functions have also been proposed
to introduce prior knowledge about size, topology, or shape, for
instance [20].
2.1.4 Early Stopping Given a loss function, a simple strategy for training is to stop the
training process once a predetermined maximum number of itera-
tions are reached. However, too few iterations would lead to an
under-fitting problem, while over-fitting may occur with too many
iterations. “Early stopping” is a potential method to avoid such
issues. The training set is split into training and validation sets when
using the early stopping condition. The early stopping condition is
based on the performance on the validation set. For example, if the
validation performance (e.g., average Dice score) does not increase
for a number of iterations, the early stopping condition is triggered.
In this situation, the best model with the highest performance on
the validation set is saved and used for inference. Of course, one
should not report the validation performance for the validation of
the model. Instead, one should use a separate test set which is kept
unseen during training for an unbiased evaluation.
Medical Image Segmentation Using Deep Learning 401
2.1.5 Evaluation Metrics Various metrics can quantitatively evaluate different aspects of a
for Segmentation Tasks segmentation algorithm. In a binary segmentation task, a true
positive (TP) indicates that a pixel in the target object is correctly
predicted as target. Similarly, a true negative (TN) represents a
background pixel that is correctly identified as background. On
the other hand, a false positive (FP) and a false negative
(FN) refer to a wrong prediction for pixels in the target and
background, respectively. Most of the evaluation metrics are based
upon the number of pixels in these four categories.
Sensitivity measures the completeness of positive predictions
with regard to the positive ground truth (TP + FN). It thus shows
the model’s ability to identify target pixels. It is also referred to as
recall or true-positive rate (TPR). It is defined as
TP
Sensitivity = ð14Þ
TP þ FN
As the negative counterpart of sensitivity, specificity describes
the proportion of negative pixels that are correctly predicted. It is
also referred to as true-negative rate (TNR). It is defined as
TN
Specificity = ð15Þ
TN þ FP
Specificity can be difficult to interpret because TN is usually very
large. It can even be misleading as TN can be made arbitrarily large
by changing the field of view. This is due to the fact that the metric
is computed over pixels and not over patients/controls like in
classification tasks (the number of controls is fixed). In order to
provide meaningful measures of specificity, it is preferable to define
a background region that has an anatomical definition (for instance,
the brain mask from which the target is subtracted) and does not
include the full field of view of the image.
Positive predictive value (PPV), also known as precision, mea-
sures the correct rate among pixels that are predicted as positives:
TP
PPV = ð16Þ
TP þ FP
For clinical interpretation of segmentation, it is often useful to have
a more direct estimation of false negatives. To that purpose, one can
report the false discovery rate:
FP
FDR = 1 - PPV = ð17Þ
TP þ FP
which is redundant with PPV but may be more intuitive for clin-
icians in the context of segmentation.
Dice similarity coefficient (DSC) measures the proportion of
spatial overlap between the ground truth (TP+FN) and the pre-
dicted positives (TP+FP). Dice similarity is the same as the F1 score,
which computes the harmonic mean of sensitivity and PPV:
402 Han Liu et al.
2TP
DSC = ð18Þ
2TP þ FN þ FP
Accuracy is the ratio of correct predictions:
TP þ TN
Accuracy = ð19Þ
TP þ TN þ FP þ FN
As was the case in specificity, we note that there are many segmen-
tation tasks where the target anatomical structure is very small (e.g.,
subcortical structures); hence, the foreground and background
have unbalanced number of pixels. In this case, accuracy can be
misleading and display high values for poor segmentations. More-
over, as for the case of specificity, one needs to define a background
region in order for TN, and thus accuracy, not to vary arbitrarily
with the field of view.
The Jaccard index (JI), also known as the intersection over
union (IoU), measures the percentage of overlap between the
ground truth and positive prediction relative to the union of
the two:
TP
JI = ð20Þ
TP þ FP þ FN
JI is closely related to the DSC. However, it is always lower than the
DSC and tends to penalize more severely poor segmentations.
There are also distance measures of segmentation accuracy
which are especially relevant when the accuracy of the boundary is
critical. These include the average symmetric surface distance
(ASSD) and the Hausdorff distance (HD). Suppose the surface of
the ground truth and the predicted segmentation are S and S ′ ,
respectively. For any point p∈S, the distance from p to surface S ′ is
defined by the minimum Euclidean distance:
dðp, S ′ Þ = min kp - p ′ k2 ð21Þ
p ′ ∈S ′
2.1.6 Pre-processing for Image pre-processing is a set of sequential steps taken to improve
Segmentation Tasks the data and prepare it for subsequent analysis. Appropriate image
pre-processing steps often significantly improve the quality of fea-
ture extraction and the downstream image analysis. For deep
learning methods, they can also help the training process converge
faster and achieve better model performance. The following sec-
tions will discuss some of the most widely used image
pre-processing techniques.
I ðxÞ - μ
I ws ðxÞ = ð28Þ
στ
Compared to the whole-brain normalization, the white stripe
normalization may work better and have better interpretation,
especially for applications where intensity outliers such as lesions
are expected.
2.3 Supervised In supervised learning, a model is presented with the given dataset
n
Methods D = fðx ðiÞ , y ðiÞ Þgi = 1 of inputs x and associated labels y. This y can
2.3.1 Background
take several forms, depending on the learning task. In particular, for
fully convolutional neural network-based segmentation applica-
tions, y is a segmentation map. In supervised learning, the model
can learn from labeled training data by minimizing the loss function
Fig. 7 Overview of the supervision settings for medical image segmentation. Best viewed in color
Medical Image Segmentation Using Deep Learning
407
408 Han Liu et al.
Fig. 8 Examples of single-path (top) and multipath (bottom) networks. In the multipath network, the inputs for
the two pathways are centered at the same location. The top pathway is equivalent to the single-path network
and takes the normal resolution image as input, while the bottom pathway takes a downsampled image with
larger field of view as input. Replicated from [47] (CC BY 4.0)
410 Han Liu et al.
contrast, local features are extracted in the local path. The global
path thus extracts global features and tends to locate the position of
the target structure. In contrast, the shape, size, texture, boundary,
and other details of the target structure are identified by the local
path. However, the performance of this type of network is easily
affected by the size and design of input patches: for example, too
small patches would not provide enough information, while too
large patches would be computationally prohibitive.
2.3.4 Framework The single network mainly focuses on a single task during training
Configuration and may ignore other potentially useful information. To improve
the segmentation accuracy, frameworks with multiple encoders and
decoders have been proposed [53, 81, 82].
Fig. 9 Example of multi-task framework. The model takes four 3D MRI sequences (T1w, T1c, T2w, and FLAIR)
as input. The U-Net structure (the top pathway with skip connection) serves as the segmentation network, and
the output contains the segmentation maps of the three subregions (whole tumor (WT), tumor core (TC), and
enhancing tumor (ET)). An auxiliary VAE branch (the bottom decoder) that reconstructs the input images is
applied in the training stage to regularize the shared encoder. ©2019 Springer Nature. Reprinted, with
permission, from [57]
2.3.5 Multiple Modalities Many neuroimaging studies contain multiple modalities or multi-
and Timepoints ple timepoints per subject. This additional information is clearly
valuable and can be leveraged to improve segmentation
performance.
Medical Image Segmentation Using Deep Learning 413
Fig. 10 Example of cascaded networks. WNet segments the whole tumor from the input multimodal 3D MRI.
Then based upon the segmentation, a bounding box (yellow dash line) can be obtained and used to crop the
input. The TNet takes the cropped image to segment the tumor core. Similarly, the ENet segments the
enhancing tumor core by taking the cropped images determined by the segmentation from the previous stage.
©2018 Springer Nature. Reprinted, with permission, from [50]
Fig. 11 SLANT-27: An example of ensemble networks. The whole brain is split into 27 overlapping subspaces
with regard to their spatial locations (yellow cube). For each location, there is an independent 3D fully
convolutional network (FCN) for segmentation (blue cube). The ensemble is achieved by label fusion on
overlapping locations. ©2019 Elsevier. Reprinted, with permission, from [82]
2.4.2 Overview of Semi- In the semi-supervised learning literature, most of the techniques
supervised Techniques are originally designed and validated in the context of classification
tasks. However, these methods can be readily adapted to segmen-
tation tasks since a segmentation task can be viewed as pixel-wise
classification. In this chapter, we mainly categorize the SSL
approaches into three techniques, namely, (1) consistency
416 Han Liu et al.
Table 1
Summary of classic semi-supervised learning methods
Consistency Entropy
Method regularization minimization Self-training
Pseudo-label [94] No Yes Yes
Π model [95] Yes No Yes
Temporal ensembling [95] Yes No Yes
Mean teacher [96] Yes No No
UDA [97] Yes Yes No
MixMatch [98] Yes Yes No
FixMatch [99] Yes Yes No
Fig. 12 An illustration of the MTANS framework. The blue solid lines indicate the path of unlabeled data, while
the labeled data follows the black lines. The two segmentation models provide the segmentation map and the
signed distance map (SDM). The discriminator is applied to check the consistency of the outputs from the
teacher and student models. The parameters of the teacher model are updated according to the student model
using the exponential moving average (EMA). ©2021 Elsevier. Reprinted, with permission, from [101]
2.4.5 Self-training Self-training is an iterative training process where the network uses
the high-confidence pseudo-labels of the unlabeled data from pre-
vious training steps. Interestingly, it has been shown that self-
training is equivalent to a version of the classification EM algorithm
[105]. The ideas of self-training and consistency regularization are
very similar. Here, we differentiate these two concepts as follows:
for consistency regularization, the supervision signals of the unla-
beled data are generated online, i.e., from the current training
epoch; in contrast, for self-training, the pseudo-labels of unlabeled
data are generated offline, i.e., generated from the previous training
epoch/epochs. Typically, in self-training, the pseudo-labels pro-
duced from previous epochs need to be carefully processed before
being used as the supervision, as they are crucial to the effectiveness
of the self-training methods. In the SSL literature, pseudo-label
[94] is a representative method that uses self-training. In pseudo-
label, the network is first trained on the labeled data only. Then the
pseudo-labels of the unlabeled data are obtained by feeding them to
the trained model. Next, the top K predictions on the unlabeled
data are used as the pseudo-labels for the next epoch. The training
objective function of pseudo-label is as follows:
L PL = L S ðx l , y l Þ þ αðtÞ L S ðx u , ~y u Þ ð32Þ
x l , y l ∈D L x u ∈D U
Fig. 13 The workflow of the 4S framework. Based on the assumption that consecutive images are strongly
correlated, the manual annotations (true labels) are provided for the first few slices. These labeled data are
used for the initial training. Then the model can provide the pseudo-labels for the next few slices which can be
applied for retraining. Adapted from [106] (CC BY 4.0)
2.5 Unsupervised As suggested in Subheadings 2.3 and 2.4, most deep segmentation
Methods models learn to map the input image x to the manually annotated
ground truth y. Although semi-supervised approaches can drasti-
2.5.1 Background
cally reduce the need for labels, low availability of ground truth is
still a primary concern for the development of learning-based mod-
els. Another disadvantage of supervised learning approaches
becomes evident when considering the anomaly detection/
segmentation task: a model can only recognize anomalies that are
similar to those in the training dataset and will likely fail with rare
findings that may not appear in the training data [109].
Unsupervised anomaly detection (UAD) methods have been
developed in recent years to tackle these problems. Since no ground
truth labels are provided, the models are designed to capture the
inherent discrepancy between healthy and pathological data distri-
butions. The general idea is to represent the distribution of normal
brain anatomy by a deep model that is trained exclusively on healthy
subjects [109]. Consequently, the pathological subjects are out of
Medical Image Segmentation Using Deep Learning 421
Fig. 14 The general idea of unsupervised anomaly detection (UAD) realized by an auto-encoder. (a) Train the
model with only healthy subjects. (b) Test with pathological samples. The residual image depicts the
anomalies. ©2021 Elsevier. Reprinted, with permission, from [109]
2.5.2 Auto-encoders The auto-encoder (AE) (Fig. 15a) is the simplest encoder-decoder
structure. Let an encoder fθ and a decoder gϕ, where θ, ϕ are model
parameters. Given a healthy input image X h ∈D × H × W , the
encoder learns to project it to a lower-dimensional latent space
z = fθ(Xh), z∈L . Then the decoder recovers the original image
from the latent vector as X^ = g ϕ ðzÞ. The model is trained by
h
argminℒθ,ϕ ðX h , X^ Þ = kX h - X^ kn
h h
ð33Þ
θ, ϕ
422 Han Liu et al.
Fig. 15 Variations of auto-encoder. (a) The auto-encoder. (b) The variational auto-encoder. (c) The adversarial
auto-encoder includes a discriminator that provides constraint on the distribution of the latent vector z. (d)
Anomaly detection VAEGAN introduces a discriminator to check whether the reconstructed image lies in the
same distribution as the healthy image. ©2021 Elsevier. Reprinted, with permission, from [109]
2.5.3 Variational Auto- In some applications, instead of utilizing the lack of generalizability
encoders of the model, we want to modify the latent vector z to further
guarantee that the reconstructed testing image X^ looks closer to a
a
2.5.4 Variational Auto- A generative adversarial network (GAN) consists of two modules, a
encoders with Generative generator G and a discriminator D. Similar with the decoder in VAE,
Adversarial Networks the generator G models the mapping from a latent vector to the
image space z↦X where z N ð0, I Þ. The discriminator D can be
deemed as a trainable loss function that judges whether the generated
image G(z) is in the image space X . Combining the GAN discrimina-
tor and the VAE backbone has become a common idea in UAD
problems. More details on GANs can be found in Chap. 5.
We note that D can be used as an additional loss in either latent
or image space. In the adversarial auto-encoder (AAE) discussed
above, the discriminator works to check whether the latent vector is
drawn from the multivariate normal distribution. In contrast, Buar
et al. [117] propose the AnoVAEGAN (Fig. 15d) model, in which
the discriminator is applied in the image space to check whether the
reconstructed image lies in the distribution of healthy data.
Medical Image Segmentation Using Deep Learning 425
3.1 Popular Medical image segmentation challenges aim to find better solutions
Segmentation to certain tasks, and it also provides researchers with benchmark or
Challenges baseline methods for future development. Furthermore, the devel-
opments are driven by the need to clinical problems.
3.2 Brain Tumor Brain tumor segmentation (BraTS) challenge is an annual challenge
Segmentation held since 2012 [104, 120–123]. The participants are provided
Challenge with a comprehensive dataset that includes annotated, multisite,
and multi-parametric MR images. It is worth noting that the data-
set has increased from 30 cases to 2000 between 2012 and
2021 [123].
Brain tumor segmentation is a difficult task for a variety of
reasons [124], including morphological and location uncertainty
of tumor, class imbalance between foreground and background,
and low contrast of MR images and annotation bias. BraTS focuses
on segmentations for the enhancing tumor (ET), tumor core (TC),
and whole tumor (WT). The Dice score, 95% Hausdorff distance,
sensitivity, and specificity are used as evaluation metrics.
BraTS 2021 There are two tasks in BraTS 2021 and one of them
is segmentation of brain tumor subregions (task 1) [123].
Fig. 16 BraTS 2021 dataset. The images and ground truth labels of enhancing tumor, tumor core, and whole
tumor are shown in the panels A (T1w with gadolinium injection), B (T2w), and C (T2-FLAIR), respectively.
Panel D shows the combined segmentations to generate the final tumor subregion labels. Replicated from
[123] (CC BY 4.0)
Medical Image Segmentation Using Deep Learning 427
4 Conclusion
References
1. Ronneberger O, Fischer P, Brox T (2015) of deep learning methods for biomedical
U-net: convolutional networks for biomedical image segmentation. Preprint.
image segmentation. In: International Con- arXiv:190408128
ference on Medical image computing and 6. Isensee F, Jaeger PF, Kohl SA, Petersen J,
computer-assisted intervention. Springer, Maier-Hein KH (2021) nnU-Net: a self-
Berlin, pp 234–241 configuring method for deep learning-based
2. Milletari F, Navab N, Ahmadi SA (2016) biomedical image segmentation. Nat. Meth-
V-net: fully convolutional neural networks ods 18(2):203–211
for volumetric medical image 7. Çiçek Ö, Abdulkadir A, Lienkamp SS, Brox T,
segmentation. In: 2016 fourth international Ronneberger O (2016) 3D U-Net: learning
conference on 3D vision (3DV). IEEE, Pis- dense volumetric segmentation from sparse
cataway, pp 565–571 annotation. In: International Conference on
3. Oktay O, Schlemper J, Folgoc LL, Lee M, Medical Image Computing and Computer-
Heinrich M, Misawa K, Mori K, Assisted Intervention. Springer, Berlin, pp
McDonagh S, Hammerla NY, Kainz B, et al 424–432
(2018) Attention u-net: learning where to 8. He K, Zhang X, Ren S, Sun J (2016) Deep
look for the pancreas. Preprint. residual learning for image recognition. In:
arXiv:180403999 Proceedings of the IEEE Conference on
4. Schlemper J, Oktay O, Schaap M, Computer Vision and Pattern Recognition
Heinrich M, Kainz B, Glocker B, Rueckert (CVPR)
D (2019) Attention gated networks: Learning 9. Zhang J, Jiang Z, Dong J, Hou Y, Liu B
to leverage salient regions in medical images. (2020) Attention gate resU-Net for auto-
Med Image Anal 53:197–207 matic MRI brain tumor segmentation. IEEE
5. Isensee F, J€ager PF, Kohl SA, Petersen J, Access 8:58533–58545
Maier-Hein KH (2019) Automated design
Medical Image Segmentation Using Deep Learning 429
T1-weighted MRI. Magn Reson Imaging 63: 43. Liu H, Fan Y, Cui C, Su D, McNeil A,
70–79 Dawant BM (2022) Unsupervised domain
32. Li H, Zhu Q, Hu D, Gunnala MR, adaptation for vestibular schwannoma and
Johnson H, Sherbini O, Gavazzi F, cochlea segmentation via semi-supervised
D’Aiello R, Vanderver A, Long JD, et al learning and label fusion. Preprint.
(2022) Human brain extraction with deep arXiv:220110647
learning. In: Medical Imaging 2022: Image 44. Isensee F, Petersen J, Kohl SA, J€ager PF,
Processing, vol 12032. SPIE, Bellingham, pp Maier-Hein KH (2019) nnU-Net: breaking
369–375 the spell on successful medical image segmen-
33. Kalavathi P, Prasath VS (2016) Methods on tation. Preprint 1:1–8. arXiv:190408128
skull stripping of MRI head scan images—a 45. Birenbaum A, Greenspan H (2016) Longitu-
review. J Digit Imaging 29(3):365–379 dinal multiple sclerosis lesion segmentation
34. Juntu J, Sijbers J, Van Dyck D, Gielen J using multi-view convolutional neural
(2005) Bias field correction for MRI networks. In: Deep Learning and Data Label-
images. In: Computer Recognition Systems. ing for Medical Applications. Springer, Berlin,
Springer, Berlin, pp 543–551 pp 58–67
35. Tustison NJ, Avants BB, Cook PA, Zheng Y, 46. Zhang H, Valcarcel AM, Bakshi R, Chu R,
Egan A, Yushkevich PA, Gee JC (2010) Bagnato F, Shinohara RT, Hett K, Oguz I
N4itk: improved N3 bias correction. IEEE (2019) Multiple sclerosis lesion segmentation
Trans Med Imaging 29(6):1310–1320 with tiramisu and 2.5 D stacked slices. In:
36. Fortin JP, Parker D, Tunç B, Watanabe T, International Conference on Medical Image
Elliott MA, Ruparel K, Roalf DR, Sat- Computing and Computer-Assisted Interven-
terthwaite TD, Gur RC, Gur RE, et al tion. Springer, Berlin, pp 338–346
(2017) Harmonization of multi-site diffusion 47. Kamnitsas K, Ledig C, Newcombe VF, Simp-
tensor imaging data. NeuroImage 161:149– son JP, Kane AD, Menon DK, Rueckert D,
170 Glocker B (2017) Efficient multi-scale 3D
37. Zhu JY, Park T, Isola P, Efros AA (2017) CNN with fully connected CRF for accurate
Unpaired image-to-image translation using brain lesion segmentation. Med Image Anal
cycle-consistent adversarial networks. In: Pro- 36:61–78
ceedings of the IEEE International Confer- 48. Dolz J, Desrosiers C, Ayed IB (2018) 3D fully
ence on Computer Vision, pp 2223–2232 convolutional networks for subcortical seg-
38. Reinhold JC, Dewey BE, Carass A, Prince JL mentation in MRI: A large-scale study. Neu-
(2019) Evaluating the impact of intensity nor- roImage 170:456–470
malization on MR image synthesis. In: Medi- 49. Li H, Zhang H, Hu D, Johnson H, Long JD,
cal Imaging 2019: Image Processing, vol Paulsen JS, Oguz I (2020) Generalizing MRI
10949. SPIE, Bellingham, pp 890–898 subcortical segmentation to
39. Shinohara RT, Sweeney EM, Goldsmith J, neurodegeneration. In: Machine Learning in
Shiee N, Mateen FJ, Calabresi PA, Jarso S, Clinical Neuroimaging and Radiogenomics in
Pham DL, Reich DS, Crainiceanu CM, et al Neuro-oncology. Springer, Berlin, pp
(2014) Statistical normalization techniques 139–147
for magnetic resonance imaging. NeuroImage 50. Wang G, Li W, Ourselin S, Vercauteren T
Clin 6:9–19 (2017) Automatic brain tumor segmentation
40. Brett M, Johnsrude IS, Owen AM (2002) using cascaded anisotropic convolutional neu-
The problem of functional localization in the ral networks. In: International MICCAI
human brain. Nat Rev Neurosci 3(3): Brainlesion Workshop. Springer, Berlin, pp
243–249 178–190
41. Shin H, Kim H, Kim S, Jun Y, Eo T, Hwang D 51. Li H, Hu D, Zhu Q, Larson KE, Zhang H,
(2022) COSMOS: cross-modality unsuper- Oguz I (2021) Unsupervised cross-modality
vised domain adaptation for 3D medical domain adaptation for segmenting vestibular
image segmentation based on target-aware schwannoma and cochlea with data augmen-
domain translation and iterative self-training. tation and model ensemble. Preprint.
Preprint. arXiv:220316557 arXiv:210912169
42. Dong H, Yu F, Zhao J, Dong B, Zhang L 52. Zhang L, Wang X, Yang D, Sanford T,
(2021) Unsupervised domain adaptation in Harmon S, Turkbey B, Wood BJ, Roth H,
semantic segmentation based on pixel align- Myronenko A, Xu D, et al (2020) General-
ment and self-training. Preprint. izing deep learning for medical image seg-
arXiv:210914219 mentation to unseen domains via deep
Medical Image Segmentation Using Deep Learning 431
stacked transformation. IEEE Trans Med 63. Huang G, Liu Z, Van Der Maaten L, Wein-
Imaging 39(7):2531–2540 berger KQ (2017) Densely connected convo-
53. Li H, Zhang H, Johnson H, Long JD, Paul- lutional networks. In: Proceedings of the
sen JS, Oguz I (2021) MRI subcortical seg- IEEE Conference on Computer Vision and
mentation in neurodegeneration with Pattern Recognition, pp 4700–4708
cascaded 3D CNNs. In: Medical Imaging 64. Isensee F, Kickingereder P, Wick W,
2021: Image Processing, International Soci- Bendszus M, Maier-Hein KH (2017) Brain
ety for Optics and Photonics, vol 11596, tumor segmentation and radiomics survival
p 115960W prediction: contribution to the brats 2017
54. Dong H, Yang G, Liu F, Mo Y, Guo Y (2017) challenge. In: International MICCAI Brainle-
Automatic brain tumor detection and seg- sion Workshop. Springer, Berlin, pp 287–297
mentation using U-Net based fully convolu- 65. Chang PD (2016) Fully convolutional deep
tional networks. In: Annual Conference on residual neural networks for brain tumor
Medical Image Understanding and Analysis. segmentation. In: International Workshop
Springer, Berlin, pp 506–517 on Brainlesion: Glioma, Multiple Sclerosis,
55. Beers A, Chang K, Brown J, Sartor E, Stroke and Traumatic Brain Injuries. Springer,
Mammen C, Gerstner E, Rosen B, Kalpathy- Berlin, pp 108–118
Cramer J (2017) Sequential 3D u-nets for 66. Castillo LS, Daza LA, Rivera LC, Arbeláez P
biologically-informed brain tumor segmenta- (2017) Volumetric multimodality neural net-
tion. Preprint. arXiv:170902967 work for brain tumor segmentation. In: 13th
56. Zhang H, Li H, Oguz I (2021) Segmentation International Conference on Medical Infor-
of new MS lesions with tiramisu and 2.5 D mation Processing and Analysis, vol 10572.
stacked slices. In: MSSEG-2 Challenge Pro- International Society for Optics and Photon-
ceedings: Multiple Sclerosis New Lesions Seg- ics, Bellingham, p 105720E
mentation Challenge Using a Data 67. Jégou S, Drozdzal M, Vazquez D, Romero A,
Management and Processing Infrastructure, Bengio Y (2017) The one hundred layers tira-
p 61 misu: fully convolutional densenets for
57. Myronenko A (2018) 3D MRI brain tumor semantic segmentation. In: Proceedings of
segmentation using autoencoder the IEEE Conference on Computer Vision
regularization. In: International MICCAI and Pattern Recognition Workshops, pp
Brainlesion Workshop. Springer, Berlin, pp 11–19
311–320 68. Wang G, Shapey J, Li W, Dorent R,
58. Pérez-Garcı́a F, Sparks R, Ourselin S (2021) Demitriadis A, Bisdas S, Paddick I,
Torchio: a python library for efficient loading, Bradford R, Zhang S, Ourselin S, et al
preprocessing, augmentation and patch-based (2019) Automatic segmentation of vestibular
sampling of medical images in deep learning. schwannoma from T2-weighted MRI by deep
Comput Methods Prog Biomed 208:106236 spatial attention with hardness-weighted
59. Kamnitsas K, Ferrante E, Parisot S, Ledig C, loss. In: International Conference on Medical
Nori AV, Criminisi A, Rueckert D, Glocker B Image Computing and Computer-Assisted
(2016) Deepmedic for brain tumor Intervention. Springer, Berlin, pp 264–272
segmentation. In: International Workshop 69. Zhang H, Zhang J, Zhang Q, Kim J, Zhang S,
on Brainlesion: Glioma, Multiple Sclerosis, Gauthier SA, Spincemaille P, Nguyen TD,
Stroke and Traumatic Brain Injuries. Springer, Sabuncu M, Wang Y (2019) RSANet: recur-
Berlin, pp 138–149 rent slice-wise attention network for multiple
60. Havaei M, Davy A, Warde-Farley D, Biard A, sclerosis lesion segmentation. In: Interna-
Courville A, Bengio Y, Pal C, Jodoin PM, tional Conference on Medical Image Com-
Larochelle H (2017) Brain tumor segmenta- puting and Computer-Assisted Intervention.
tion with deep neural networks. Med Image Springer, Berlin, pp 411–419
Anal 35:18–31 70. Hou B, Kang G, Xu X, Hu C (2019) Cross
61. Long J, Shelhamer E, Darrell T (2015) Fully attention densely connected networks for
convolutional networks for semantic multiple sclerosis lesion segmentation. In:
segmentation. In: Proceedings of the IEEE 2019 IEEE International Conference on Bio-
Conference on Computer Vision and Pattern informatics and Biomedicine (BIBM). IEEE,
Recognition, pp 3431–3440 Piscataway, pp 2356–2361
62. He K, Zhang X, Ren S, Sun J (2016) Deep 71. Islam M, Vibashan V, Jose VJM,
residual learning for image recognition. In: Wijethilake N, Utkarsh U, Ren H (2019)
Proceedings of the IEEE Conference on Brain tumor segmentation and survival pre-
Computer Vision and Pattern Recognition, diction using 3D attention UNet. In:
pp 770–778
432 Han Liu et al.
International MICCAI Brainlesion Work- whole brain segmentation using spatially loca-
shop. Springer, Berlin, pp 262–272 lized atlas network tiles. NeuroImage 194:
72. Zhou T, Ruan S, Guo Y, Canu S (2020) A 105–119
multi-modality fusion network based on 83. Kamnitsas K, Bai W, Ferrante E,
attention mechanism for brain tumor McDonagh S, Sinclair M, Pawlowski N,
segmentation. In: 2020 IEEE 17th Interna- Rajchl M, Lee M, Kainz B, Rueckert D, et al
tional Symposium on Biomedical Imaging (2017) Ensembles of multiple models and
(ISBI). IEEE, Piscataway, pp 377–380 architectures for robust brain tumour
73. Sinha A, Dolz J (2020) Multi-scale self- segmentation. In: International MICCAI
guided attention for medical image segmen- Brainlesion Workshop. Springer, Berlin, pp
tation. IEEE J Biomed Health Inform 25(1): 450–462
121–130 84. Kao PY, Ngo T, Zhang A, Chen JW, Manju-
74. Shamshad F, Khan S, Zamir SW, Khan MH, nath B (2018) Brain tumor segmentation and
Hayat M, Khan FS, Fu H (2022) Transfor- tractographic feature extraction from struc-
mers in medical imaging: a survey. Preprint. tural MR images for overall survival
arXiv:220109873 prediction. In: International MICCAI Brain-
75. Hatamizadeh A, Nath V, Tang Y, Yang D, lesion Workshop. Springer, Berlin, pp
Roth H, Xu D (2022) Swin UNETR: Swin 128–141
transformers for semantic segmentation of 85. Zhao X, Wu Y, Song G, Li Z, Zhang Y, Fan Y
brain tumors in MRI images. Preprint. (2017) 3D brain tumor segmentation
arXiv:220101266 through integrating multiple 2D
76. Peiris H, Hayat M, Chen Z, Egan G, Harandi FCNNs. In: International MICCAI Brainle-
M (2021) A volumetric transformer for accu- sion Workshop. Springer, Berlin, pp 191–203
rate 3D tumor segmentation. Preprint. 86. Zhang D, Huang G, Zhang Q, Han J, Han J,
arXiv:211113300 Yu Y (2021) Cross-modality deep feature
77. Cao H, Wang Y, Chen J, Jiang D, Zhang X, learning for brain tumor segmentation. Pat-
Tian Q, Wang M (2021) Swin-UNET: Unet- tern Recogn 110:107562
like pure transformer for medical image seg- 87. Havaei M, Guizard N, Chapados N, Bengio Y
mentation. Preprint. arXiv:210505537 (2016) Hemis: hetero-modal image
78. Zhang Y, Liu H, Hu Q (2021) Transfuse: segmentation. In: International Conference
fusing transformers and CNNs for medical on Medical Image Computing and
image segmentation. In: International Con- Computer-Assisted Intervention. Springer,
ference on Medical Image Computing and Berlin, pp 469–477
Computer-Assisted Intervention. Springer, 88. Liu H, Fan Y, Li H, Wang J, Hu D, Cui C, Lee
Berlin, pp 14–24 HZ Ho Hin, Oguz I (2022) Moddrop+ +: a
79. Li H, Hu D, Liu H, Wang J, Oguz I dynamic filter network with intra-subject
(2022) Cats: complementary CNN and trans- co-training for multiple sclerosis lesion seg-
former encoders for segmentation. In: 2022 mentation with missing modalities. Preprint.
IEEE 19th International Symposium on Bio- arXiv:220304959
medical Imaging (ISBI). IEEE, Piscataway, pp 89. Wang Y, Zhang Y, Liu Y, Lin Z, Tian J,
1–5 Zhong C, Shi Z, Fan J, He Z (2021) ACN:
80. Valanarasu JMJ, Oza P, Hacihaliloglu I, Patel adversarial co-training network for brain
VM (2021) Medical transformer: gated axial- tumor segmentation with missing
attention for medical image segmentation. In: modalities. In: International Conference on
International Conference on Medical Image Medical Image Computing and Computer-
Computing and Computer-Assisted Interven- Assisted Intervention. Springer, Berlin, pp
tion. Springer, Berlin, pp 36–46 410–420
81. McKinley R, Wepfer R, Aschwanden F, 90. Azad R, Khosravi N, Merhof D (2022)
Grunder L, Muri R, Rummel C, Verma R, SMU-Net: style matching U-Net for brain
Weisstanner C, Reyes M, Salmen A, et al tumor segmentation with missing modalities.
(2019) Simultaneous lesion and neuroanat- Preprint. arXiv:220402961
omy segmentation in multiple sclerosis using 91. Gao Y, Phillips JM, Zheng Y, Min R, Fletcher
deep neural networks. Preprint. PT, Gerig G (2018) Fully convolutional
arXiv:190107419 structured LSTM networks for joint 4D med-
82. Huo Y, Xu Z, Xiong Y, Aboud K, ical image segmentation. In: 2018 IEEE 15th
Parvathaneni P, Bao S, Bermudez C, Resnick International Symposium on Biomedical
SM, Cutting LE, Landman BA (2019) 3D Imaging (ISBI 2018). IEEE, Piscataway, pp
1104–1108
Medical Image Segmentation Using Deep Learning 433
92. Li H, Zhang H, Johnson H, Long JD, Paul- segmentation from multispectral MRI. Med
sen JS, Oguz I (2021) Longitudinal subcorti- Image Anal 35:250–269
cal segmentation with deep learning. In: 104. Menze BH, Jakab A, Bauer S, Kalpathy-
Medical Imaging 2021: Image Processing, Cramer J, Farahani K, Kirby J, Burren Y,
International Society for Optics and Photon- Porz N, Slotboom J, Wiest R, et al (2014)
ics, vol 11596, p 115960D The multimodal brain tumor image segmen-
93. Van Engelen JE, Hoos HH (2020) A survey tation benchmark (BraTS). IEEE Trans Med
on semi-supervised learning. Mach Learn Imaging 34(10):1993–2024
109(2):373–440 105. Amini MR, Gallinari P (2002) Semi-
94. Lee DH, et al (2013) Pseudo-label: the simple supervised logistic regression. In: ECAI, vol
and efficient semi-supervised learning method 2, p 11
for deep neural networks. In: Workshop on 106. Takaya E, Takeichi Y, Ozaki M, Kurihara S
Challenges in Representation Learning, (2021) Sequential semi-supervised segmenta-
ICML, vol 3, p 896 tion for serial electron microscopy image with
95. Laine S, Aila T (2016) Temporal ensembling small number of labels. J Neurosci Methods
for semi-supervised learning. Preprint. 351:109066
arXiv:161002242 107. Arganda-Carreras I, Turaga SC, Berger DR,
96. Tarvainen A, Valpola H (2017) Mean teachers Cireşan D, Giusti A, Gambardella LM,
are better role models: weight-averaged con- Schmidhuber J, Laptev D, Dwivedi S, Buh-
sistency targets improve semi-supervised deep mann JM, et al (2015) Crowdsourcing the
learning results. Preprint. arXiv:170301780 creation of image segmentation algorithms
97. Xie Q, Dai Z, Hovy E, Luong MT, Le QV for connectomics. Front Neuroanat 9:142
(2019) Unsupervised data augmentation for 108. Takeichi Y, Uebi T, Miyazaki N, Murata K,
consistency training. Preprint. Yasuyama K, Inoue K, Suzaki T, Kubo H,
arXiv:190412848 Kajimura N, Takano J, et al (2018) Putative
98. Berthelot D, Carlini N, Goodfellow I, neural network within an olfactory sensory
Papernot N, Oliver A, Raffel C (2019) Mix- unit for nestmate and non-nestmate discrimi-
match: a holistic approach to semi-supervised nation in the Japanese carpenter ant: the
learning. Preprint. arXiv:190502249 ultra-structures and mathematical simulation.
99. Sohn K, Berthelot D, Li CL, Zhang Z, Front Cell Neurosci 12:310
Carlini N, Cubuk ED, Kurakin A, Zhang H, 109. Baur C, Denner S, Wiestler B, Navab N,
Raffel C (2020) Fixmatch: simplifying semi- Albarqouni S (2021) Autoencoders for unsu-
supervised learning with consistency and con- pervised anomaly segmentation in brain MR
fidence. Preprint. arXiv:200107685 images: a comparative study. Med Image
100. Cubuk ED, Zoph B, Shlens J, Le QV (2020) Anal, p 101952
Randaugment: practical automated data aug- 110. Pawlowski N, Lee MC, Rajchl M,
mentation with a reduced search space. In: McDonagh S, Ferrante E, Kamnitsas K,
Proceedings of the IEEE/CVF Conference Cooke S, Stevenson S, Khetani A,
on Computer Vision and Pattern Recognition Newman T, et al (2018) Unsupervised lesion
Workshops, pp 702–703 detection in brain CT using bayesian convo-
101. Chen G, Ru J, Zhou Y, Rekik I, Pan Z, Liu X, lutional autoencoders. MIDL
Lin Y, Lu B, Shi J (2021) MTANS: multi- 111. Kingma DP, Welling M (2013) Auto-
scale mean teacher combined adversarial net- encoding variational bayes. Preprint.
work with shape-aware embedding for semi- arXiv:13126114
supervised brain lesion segmentation. Neuro- 112. Chen X, Konukoglu E (2018) Unsupervised
Image 244:118568 detection of lesions in brain MRI using con-
102. Carass A, Roy S, Jog A, Cuzzocreo JL, strained adversarial auto-encoders. Preprint.
Magrath E, Gherman A, Button J, arXiv:180604972
Nguyen J, Prados F, Sudre CH, et al (2017) 113. Pinaya WHL, Tudosiu PD, Gray R, Rees G,
Longitudinal multiple sclerosis lesion seg- Nachev P, Ourselin S, Cardoso MJ (2021)
mentation: resource and challenge. Neuro- Unsupervised brain anomaly detection and
Image 148:77–102 segmentation with transformers. Preprint.
103. Maier O, Menze BH, von der Gablentz J, arXiv:210211650
H€ani L, Heinrich MP, Liebrand M, 114. Van Den Oord A, Vinyals O, et al (2017)
Winzeck S, Basit A, Bentley P, Chen L, et al Neural discrete representation learning. Adv
(2017) ISLES 2015-a public evaluation Neural Inf Proces Syst 30
benchmark for ischemic stroke lesion
434 Han Liu et al.
115. Dilokthanakul N, Mediano PA, Garnelo M, progression assessment, and overall survival
Lee MC, Salimbeni H, Arulkumaran K, Sha- prediction in the BraTS challenge. Preprint.
nahan M (2016) Deep unsupervised cluster- arXiv:181102629
ing with gaussian mixture variational 122. Bakas S, Akbari H, Sotiras A, Bilello M,
autoencoders. Preprint. arXiv:161102648 Rozycki M, Kirby J, Freymann J, Farahani K,
116. You S, Tezcan KC, Chen X, Konukoglu E Davatzikos C (2017) Segmentation labels and
(2019) Unsupervised lesion detection via radiomic features for the pre-operative scans
image restoration with a normative prior. In: of the TCGA-LGG collection. Cancer Imag-
International Conference on Medical Imag- ing Arch 286
ing with Deep Learning, PMLR, pp 540–556 123. Baid U, Ghodasara S, Mohan S, Bilello M,
117. Baur C, Wiestler B, Albarqouni S, Navab N Calabrese E, Colak E, Farahani K, Kalpathy-
(2018) Deep autoencoding models for unsu- Cramer J, Kitamura FC, Pati S, et al (2021)
pervised anomaly segmentation in brain MR The RSNA-ASNR-MICCAI braTS 2021
images. In: International MICCAI Brainle- benchmark on brain tumor segmentation
sion Workshop. Springer, Berlin, pp 161–169 and radiogenomic classification. Preprint.
118. Antonelli M, Reinke A, Bakas S, Farahani K, arXiv:210702314
Kopp-Schneider A, Landman BA, Litjens G, 124. Liu Z, Chen L, Tong L, Zhou F, Jiang Z,
Menze B, Ronneberger O, Summers RM, Zhang Q, Shan C, Wang Y, Zhang X, Li L,
et al (2022) The medical segmentation et al (2020) Deep learning based brain tumor
decathlon. Nat Commun 13(1):1–13 segmentation: a survey. Preprint.
119. Dorent R, Kujawa A, Ivory M, Bakas S, arXiv:200709479
Rieke N, Joutard S, Glocker B, Cardoso J, 125. Luu HM, Park SH (2021) Extending
Modat M, Batmanghelich K, et al (2022) nn-Unet for brain tumor segmentation. Pre-
Crossmoda 2021 challenge: benchmark of print. arXiv:211204653
cross-modality domain adaptation techniques 126. Wang H, Zhu Y, Green B, Adam H, Yuille A,
for vestibular schwnannoma and cochlea seg- Chen LC (2020) Axial-deeplab: stand-alone
mentation. Preprint. arXiv:220102831 axial-attention for panoptic segmentation. In:
120. Ghaffari M, Sowmya A, Oliver R (2019) European Conference on Computer Vision.
Automated brain tumor segmentation using Springer, Berlin, pp 108–126
multimodal brain scans: a survey based on 127. Ho J, Kalchbrenner N, Weissenborn D, Sali-
models submitted to the BraTS 2012–2018 mans T (2019) Axial attention in multidimen-
challenges. IEEE Rev Biomed Eng 13:156– sional transformers. Preprint.
168 arXiv:191212180
121. Bakas S, Reyes M, Jakab A, Bauer S, 128. Zhang H, Oguz I (2020) Multiple sclerosis
Rempfler M, Crimi A, Shinohara RT, lesion segmentation–a survey of supervised
Berger C, Ha SM, Rozycki M, et al (2018) CNN-based methods. Preprint.
Identifying the best machine learning algo- arXiv:201208317
rithms for brain tumor segmentation,
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 14
Abstract
Registration is the process of establishing spatial correspondences between images. It allows for the
alignment and transfer of key information across subjects and atlases. Registration is thus a central
technique in many medical imaging applications. This chapter first introduces the fundamental concepts
underlying image registration. It then presents recent developments based on machine learning, specifically
deep learning, which have advanced the three core components of traditional image registration methods—
the similarity functions, transformation models, and cost optimization. Finally, it describes the key applica-
tion of these techniques to brain disorders.
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_14,
© The Author(s) 2023
435
436 Min Chen et al.
Fig. 1 Shown is an example of an atlas alignment using image registration between two different brain
magnetic resonance images. The atlas image (top left) is transformed (top right) to be aligned with the fixed
subject image (center). The transformation allows the anatomical labels from the atlas (bottom left) to be
directly transferred (bottom right) to label the subject image
Fig. 2 Block diagram of the general registration framework. The coloring represents the main pieces of the
framework: the input images (green), the output image (purple), the similarity cost function (orange), the
transformation model (blue), and the optimizer (yellow)
438 Min Chen et al.
Global Transformation One common choice for the transformation model is to represent
Models v entirely through a global transformation on the image coordinate
system. Here v is described by a single linear transformation matrix
M and a translation vector t = (tx, ty, tz):
vðxÞ = M x þ t : ð3Þ
The transformation matrix M determines the restrictiveness of the
model, which is often referred to as the model’s degrees of freedom
(dof). Algorithms that only allow translations and rotations (6 dof1)
are referred to as rigid registrations. In such cases, M is the product
of three rotation matrices (one for each axis):
1 0 0 cos θy 0 sin θy cos θz - sin θz 0
M rigid = 0 cos θx - sin θx 0 1 0 sin θz cos θz 0 ,
0 sin θx cos θx - sin θy 0 cos θy 0 0 1
ð4Þ
where θx, θy, and θz determine the amount of rotation around each
axis. If global scaling is also allowed (7 dof in total), then the
algorithm becomes a similarity registration, and Mrigid is multiplied
with an additional scaling matrix:
s 0 0
M similarity = 0 s 0 M rigid , ð5Þ
0 0 s
1
Here, dof are given for the 3D case since the vast majority of medical images are 3D.
440 Min Chen et al.
2.2.2 Types of Cost The purpose of the similarity cost function is to quantify how
Functions closely aligned the transformed moving image and fixed images
are to each other. Since it drives the optimization of the transfor-
mation model, the characteristics of the cost function determine
what kind of images can be aligned, the degree of accuracy, and the
ease of optimization. In this section, we will mainly discuss the
three most popular intensity-based cost functions, which are avail-
able in most algorithms. Naturally, a large number of cost functions
have been proposed in the literature, and a more complete list can
be found here [2].
Sum of Square Differences. Sum of square differences (SSD), or equivalently mean squared
error (MSE), between image intensities is one of the most basic
and earliest cost functions used for evaluating the similarity
between two images. It consists simply of subtracting the intensity
difference at each voxel between two images, squaring the differ-
ence, and then summing across all the voxels in the entire image.
This can be described using
442 Min Chen et al.
~ =
CSSD ðT , SÞ ~
T ðxÞ - SðxÞ :
2
ð8Þ
x∈T
Normalized Cross The cross correlation (CC) function is a concept borrowed from
Correlation signal processing theory for comparing the similarity between
waveforms. It requires vectorizing the image (reshaping the 3D
image grir into a single vector), subtracting the mean of each image,
and then computing the dot product between the image vectors.
The value is then divided by the magnitude of both mean sub-
tracted vectors. This can be described by
ðT - μT Þ ðS ~ - μS~ Þ
~ =
C CC ðT , SÞ , ð9Þ
jjT - μT jj jjS~ - μS~ jj
H ðT Þ þ H ðS~Þ
NMIðT , S~Þ = : ð15Þ
H ðT , S~Þ
We see that this measure ranges from one to two, where two
indicates a perfect alignment. Hence, we must again negate the
measure when using it as a cost function to fit our minimization
framework.
The main drawback of mutual information comes from its
probabilistic nature. The measure relies on an accurate estimate of
the probability density of the image intensities. As a result, its
effectiveness decreases significantly when working with small
regions within the image, where there is not enough intensity
samples to accurately estimate such densities. Likewise, the measure
is ineffective when facing areas of the image that have poor statisti-
cal consistency or lack clear structure [15]. Examples of this include
cases where there is overwhelming noise or conversely, when the
area has very homogeneous intensities and provides very little
information. As a result, mutual information must be calculated
over a relatively large region of the image, which reduces the
measure’s local acuity and diminishes its ability to handle small
changes between the moving and fixed images. Lastly, as men-
tioned before, mutual information is almost entirely calculated
from counts of intensity pairs, where the actual intensity value
does not matter. While this is useful for addressing multimodal
relationships, it also introduces inherent ambiguity into the mea-
sure. Given a moving and fixed image, their intensities can be paired
in multiple ways to give the exact same mutual information after the
transformation. Hence, the measure depends heavily on having a
good initialization where the objects being registered are aligned
well enough to give the correct intensity pairings at the start of the
optimization. Otherwise, mutual information can cause the algo-
rithm to align intensity pairs that incorrectly represent the corre-
spondence between the images, resulting in registration
errors [16].
From the previous sections, we can see that there are numerous
avenues where machine learning models can potentially be
employed to address specific parts of the registration problem. We
can build models to estimate the similarity between images, find
anatomical correspondences in images, speed up the optimization,
or even learn to estimate the transformations directly. As with most
learning models, these techniques can be very broadly categorized
into supervised and unsupervised techniques.
Supervised image registration within the context of machine
learning entails utilizing sufficiently large training data sets of input
Image Registration 445
3.1 Feature Much of the early work incorporating machine learning into solv-
Extraction ing image registration problems involved the detection of
corresponding features and then using that information to deter-
mine the correspondence relationship between spatial domains.
These included training models to find key landmarks [22] or
segmentation of structures [23], and fitting established transforma-
tions models to provide a full transformation between the images.
Unsurprisingly, adaptions of these ideas carried through to deep
learning approaches. For example, at the start of the current era of
deep learning in image-related research, the authors of [24] pro-
posed point correspondence detection using multiple feed-forward
neural networks, each of which is trained to detect a single feature.
These neural networks are relatively simple consisting of two
hidden layers each with 60 neurons where the output is a probabil-
ity of it containing a specific feature at the center of a small image
neighborhood. These detected point correspondences are then
used to estimate the total affine transformation with the RANSAC
algorithm [25]. Similarly, DeepFlow [26] uses CNNs to detect
matching features (called deep matching) which are then used as
additional information in the large displacement optical flow frame-
work [27]. A relatively small architecture, consisting of six layers, is
used to detect features at different convolution sizes which are then
matched across scales. Two algorithms for more traditional com-
puter vision applications are proposed in [28] and [29] where both
are based on the VGG architecture [30] for 2D homography
estimation. The former framework includes both a regression net-
work for determining corner correspondence and a classification
network for providing confidence estimates of those predictions.
The work in [29], which is publicly available, uses image patch pairs
in the input layer and the ℓ1 photometric loss between them to
remove the need for direct supervision. Finally, in the category of
feature learning, Wu et al. use nested auto-encoders (AE) to map
patchwise image content to learned feature vectors [31]. These
Image Registration 447
3.3 Transformation Many of the methods described so far have been centered around
Learning using learning models to establish spatial correspondences between
images, and then fitting traditional transformation models to align
the images. An alternative approach is to directly learn and predict
the transformation between images. Earlier work [49] employed
CNN-based regression for estimation of 2D/3D rigid image align-
ment of 3D X-ray attenuation maps derived from CT and
corresponding 2D digitally reconstructed (DRR) X-ray images.
The transformation space is partitioned into distinct zones where
each zone corresponds to a CNN-based regressor which learns
transformation parameters in a hierarchical fashion. The loss func-
tion is the mean squared error on the transformation parameters.
A novel deep learning perspective was given in [50] where
displacement fields are assumed to form low-dimensional manifolds
and are represented in the proposed fully connected network as
low-dimensional vectors. From the input vector, the network gen-
erates a 2D displacement field used to warp the moving image using
bilinear interpolation. The absolute intensity difference is used to
optimize the parameters of network and latent vectors. Instead of
explicit regularization of the displacement field, the sum of squares
of the network weights is included with the intensity error term in
the loss function. Instead of training with a loss function based on
similarity measures between fixed and moving images, the works of
[51, 52] formulate the loss in terms of the squared difference
between ground truth and predicted transformation parameters.
In terms of network architecture, [51] employs a variant of U-net
for training/prediction based on reference deformations provided
by registration of previously segmented ROIs for cardiac matching
where priority is alignment of the epicardium and endocardium.
Displacement fields are parameterized by stationary velocity fields
[53]. In contrast, [52] uses a smaller version of the VGG architec-
ture to learn the parameters of a 6 × 6 × 6 thin-plate spline grid.
In 2015, Jaderberg and his fellow co-authors described a pow-
erful new module, known as the spatial transformer network (STN)
[54]2 which features prominently now in many contemporary deep
2
Note that these networks are different from transformers and visual transformers described in Chap. 6.
450 Min Chen et al.
Fig. 4 Diagrammatic illustration of the spatial transformer network. The STN can be placed anywhere within a
CNN to provide spatial invariance for the input feature map. Core components include the localization network
used to learn/predict the parameters which transform the input feature map. The transformed output feature
map is generated with the grid generator and sampler. ©2019 Elsevier. Reprinted, with permission, from [21]
4.1 Spatial Normative and disease-specific atlases play an important role in the
Normalization and characterization of a disease. By registering images from different
Atlasing subjects into a common atlas space (i.e., spatial normalization), we
can remove typical variability between subjects, such as brain size,
to allow for more sensitive detection of disease-driven differences
between subjects. Learning-based registration can enable higher
throughput registration during atlas construction [74], thus allow-
ing more subjects to be included into the atlas and better encom-
passing the variability within a cohort. Various models have been
proposed to embed these advantages directly into the network,
such as [75], which uses a joint learning framework where image
attributes are used to learn conditional templates, and an efficient
deformation to these templates is jointly learned. In addition,
learning models have been used to provide priors for the atlas
[76] and establish groupwise correspondence within a cohort [77].
4.3 Morphometry Voxel-based [84] and tensor-based [85] morphometry is the anal-
ysis of the transformation result from an image registration to study
the shape and structural characteristics of a disease. In these
approaches, a disease cohort is spatially normalized into a common
space and the warped images and resulting deformation fields from
each registration are statistically compared on a voxel level to reveal
454 Min Chen et al.
5 Conclusion
Acknowledgements
References
1. Anuta PE (1970) Spatial registration of multi- 8. Thirion JP (1998) Image matching as a diffu-
spectral and multitemporal digital imagery sion process: an analogy with Maxwell’s
using fast Fourier transform techniques. IEEE demons. Med Image Anal 2(3):243–260
Trans Geosci Electron 8(4):353–368 9. Beg MF, Miller MI, Trouvé A, Younes L
2. Sotiras A, Davatzikos C, Paragios N (2013) (2005) Computing large deformation metric
Deformable medical image registration: a sur- mappings via geodesic flows of diffeomorph-
vey. IEEE Trans Med Imaging 32(7): isms. Int J Comput Vis 61(2):139–157
1153–1190 10. Ashburner J, Friston KJ (2000) Voxel-based
3. Pluim JP, Maintz JA, Viergever MA (2003) morphometry—the methods. NeuroImage
Mutual-information-based registration of 11(6):805–821
medical images: a survey. IEEE Trans Med 11. Davatzikos C, Genc A, Xu D, Resnick SM
Imaging 22(8):986–1004 (2001) Voxel-based morphometry using the
4. Bookstein FL (1989) Principal warps: thin- RAVENS maps: methods and validation using
plate splines and the decomposition of defor- simulated longitudinal atrophy. NeuroImage
mations. IEEE Trans. Pattern Anal. Mach. 14(6):1361–1369
Intell. 11(6):567–585 12. Wells III WM, Viola P, Atsumi H, Nakajima S,
5. Rohde GK, Aldroubi A, Dawant BM (2003) Kikinis R (1996) Multi-modal volume registra-
The adaptive bases algorithm for intensity- tion by maximization of mutual information.
based nonrigid image registration. IEEE Med Image Anal 1(1):35–51
Trans Med Imaging 22(11):1470–1479 13. Shannon CE (1948) A mathematical theory of
6. Gee JC, Bajcsy RK (1998) Elastic matching: communication. Bell Syst Tech J 27:379–423
continuum mechanical and probabilistic analy- 14. Studholme C, Hill DL, Hawkes DJ (1999) An
sis. Brain Warping 2 overlap invariant entropy measure of 3D medi-
7. Christensen GE, Rabbitt RD, Miller MI cal image alignment. Pattern Recogn 32(1):
(1996) Deformable templates using large 71–86
deformation kinematics. IEEE Trans. Image 15. Andronache A, von Siebenthal M, Székely G,
Process. 5(10):1435–1447 Cattin P (2008) Non-rigid registration of
Image Registration 455
multi-modal images using both mutual infor- motion estimation. IEEE Trans Pattern Anal
mation and cross-correlation. Med Image Anal Mach Intell 33(3):500–513. https://doi.org/
12(1):3–15 10.1109/TPAMI.2010.143
16. Maes F, Vandermeulen D, Suetens P (2003) 28. DeTone D, Malisiewicz T, Rabinovich A
Medical image registration using mutual infor- (2016) Deep image homography estimation.
mation. Proc IEEE 91(10):1699–1722 arXiv:160603798
17. Fan J, Cao X, Yap PT, Shen D (2019) BIRNet: 29. Nguyen T, Chen SW, Shivakumar SS, Taylor
brain image registration using dual-supervised CJ, Kumar V (2018) Unsupervised deep
fully convolutional networks. Med Image Anal homography: a fast and robust homography
54:193–206 estimation model. In: Proceedings of IEEE
18. Wang J, Zhang M (2020) Deepflash: an effi- robotics and automation letters
cient network for learning-based medical image 30. Simonyan K, Zisserman A (2014) Very deep
registration. In: Proceedings of the IEEE/CVF convolutional networks for large-scale image
conference on computer vision and pattern recognition. CoRR abs/1409.1556,
recognition, pp 4444–4452 1409.1556
19. Hinton GE, Zemel RS (1994) Autoencoders, 31. Wu G, Kim M, Wang Q, Munsell BC, Shen D
minimum description length and Helmholtz (2016) Scalable high-performance image regis-
free energy. In: Cowan JD, Tesauro G, Alspec- tration framework by unsupervised deep fea-
tor J (eds) Advances in neural information pro- ture representations learning. IEEE Trans
cessing systems. Morgan-Kaufmann, Biomed Eng 63(7):1505–1516. https://doi.
Burlington, pp 3–10 org/10.1109/TBME.2015.2496253
20. Zhao S, Dong Y, Chang EIC, Xu Y (2019) 32. Wang Q, Wu G, Yap PT, Shen D (2010) Attri-
Recursive cascaded networks for unsupervised bute vector guided groupwise registration.
medical image registration. In: Proceedings of NeuroImage 50(4):1485–1496. https://doi.
the IEEE/CVF international conference on org/10.1016/j.neuroimage.2010.01.040
computer vision (ICCV) 33. Shen D, Davatzikos C (2002) Hammer: hier-
21. Tustison NJ, Avants BB, Gee JC (2019) archical attribute matching mechanism for elas-
Learning image-based spatial transformations tic registration. IEEE Trans Med Imaging
via convolutional neural networks: a review. 21(11):1421–1439. https://doi.org/10.
Magn Reson Imaging 64:142–153 1109/TMI.2002.803111
22. Ozuysal M, Calonder M, Lepetit V, Fua P 34. Cao X, Yang J, Zhang J, Nie D, Kim MJ,
(2009) Fast keypoint recognition using ran- Wang Q, Shen D (2017) Deformable image
dom ferns. IEEE Trans Pattern Anal Mach registration based on similarity-steered cnn
Intell 32(3):448–461 regression. In: Proceedings of the international
23. Powell S, Magnotta VA, Johnson H, Jammala- conference on medical image computing and
madaka VK, Pierson R, Andreasen NC (2008) computer-assisted intervention 10433:300–
Registration and machine learning-based auto- 308. https://doi.org/10.1007/978-3-319-
mated segmentation of subcortical and cerebel- 66182-7_35
lar brain structures. NeuroImage 39(1): 35. Hu Y, Modat M, Gibson E, Li W, Ghavami N,
238–247 Bonmati E, Wang G, Bandula S, Moore CM,
24. Sergeev S, Zhao Y, Linguraru MG, Okada K Emberton M, Ourselin S, Noble JA, Barratt
(2012) Medical image registration using DC, Vercauteren T (2018) Weakly-supervised
machine learning-based interest point convolutional neural networks for multimodal
detector. In: Proceedings of the SPIE image registration. Med Image Anal 49:1–13.
25. Fischler MA, Bolles RC (1981) Random sam- https://doi.org/10.1016/j.media.2018.0
ple consensus: a paradigm for model fitting 7.002
with applications to image analysis and auto- 36. He K, Zhang X, Ren S, Sun J (2015) Deep
mated cartography. Comm ACM 24(6): residual learning for image recognition. CoRR
381–395 abs/1512.03385. http://arxiv.org/abs/1
26. Weinzaepfel P, Revaud J, Harchaoui Z, Schmid 512.03385, 1512.03385
C (2013) Deepflow: large displacement optical 37. Simonovsky M, Gutierrez-Becker B,
flow with deep matching. In: Proceedings of Mateus D, Navab N, Komodakis N (2016) A
the IEEE international conference on com- deep metric for multimodal registration. In:
puter vision, pp 1385–1392. https://doi.org/ Proceedings of the international conference
10.1109/ICCV.2013.175 on medical image computing and computer-
27. Brox T, Malik J (2011) Large displacement assisted intervention
optical flow: descriptor matching in variational
456 Min Chen et al.
38. Yoo TS, Metaxas DN (2005) Open science– 49. Miao S, Wang ZJ, Liao R (2016) A CNN
combining open data and open source soft- regression approach for real-time 2D/3D reg-
ware: medical image analysis with the insight istration. IEEE Trans Med Imaging 35(5):
toolkit. Med Image Anal 9(6):503–6. https:// 1352–1363. https://doi.org/10.1109/TMI.
doi.org/10.1016/j.media.2005.04.008 2016.2521800
39. Chen M, Carass A, Jog A, Lee J, Roy S, Prince 50. Sheikhjafari A, Noga M, Punithakumar K, Ray
JL (2017) Cross contrast multi-channel image N (2018) Unsupervised deformable image reg-
registration using image synthesis for mr brain istration with fully connected generative neural
images. Med Image Anal 36:2–14 network. In: Proceedings of medical imaging
40. Goodfellow IJ, Pouget-Abadie J, Mirza M, with deep learning
Xu B, Warde-Farley D, Ozair S, Courville A, 51. Rohé MM, Datar M, Heimann T,
Bengio Y (2014) Generative adversarial Sermesant M, Pennec X (2017) SVF-Net:
nets. In: Advances in neural information pro- learning deformable image registration using
cessing systems shape matching. In: Descoteaux M, Maier-
41. Yi X, Walia E, Babyn P (2018) Generative Hein L, Franz A, Jannin P, Collins DL, Duch-
adversarial network in medical imaging: a esne S (eds) Proceedings of the international
review. Preprint conference on medical image computing and
42. Radford A, Metz L, Chintala S (2016) Unsu- computer-assisted intervention. Springer
pervised representation learning with deep International Publishing, Cham, pp 266–274
convolutional generative adversarial 52. Eppenhof KAJ, Lafarge MW, Moeskops P,
networks. In: Proceedings of the international Veta M, Pluim JPW (2018) Deformable
conference on learning representations image registration using convolutional neural
43. Mahapatra D, Antony B, Sedai S, Garnavi R networks. In: Proceedings of the SPIE: medical
(2018) Deformable medical image registration imaging: image processing
using generative adversarial networks. In: Pro- 53. Arsigny V, Commowick O, Pennec X, Ayache
ceedings of IEEE 15th international sympo- N (2006) A log-euclidean framework for statis-
sium on biomedical imaging tics on diffeomorphisms. In: Proceedings of
44. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP the international conference on medical image
(2004) Image quality assessment: from error computing and computer-assisted interven-
visibility to structural similarity. IEEE Trans tion, vol 9, pp 924–31
Image Process 13(4):600–612 54. Jaderberg M, Simonyan K, Zisserman A,
45. Zhu JY, Park T, Isola P, Efros AA (2017) Kavukcuoglu K (2015) Spatial transformer
Unpaired image-to-image translation using networks. In: Neural information processing
cycle-consistent adversarial networks. In: systems
IEEE international conference on computer 55. Lin CH, Lucey S (2017) Inverse compositional
vision spatial transformer networks. In: Proceedings
46. Hu Y, Gibson E, Ghavami N, Bonmati E, of the IEEE conference on computer vision
Moore CM, Emberton M, Vercauteren T, and pattern recognition
Noble JA, Barratt DC (2018) Adversarial 56. Detlefsen NS, Freifeld O, Hauberg S (2018)
deformation regularization for training image Deep diffeomorphic transformer networks. In:
registration neural networks. In: Proceedings Proceedings of the IEEE conference on com-
of the international conference on medical puter vision and pattern recognition
image computing and computer-assisted 57. Baker S, Matthews I (2004) Lucas-kanade
intervention 20 years on: a unifying framework. Int J Com-
47. Hu Y, Modat M, Gibson E, Ghavami N, put V is 56(3):221–255. https://doi.
Bonmati E, Moore CM, Emberton M, Noble org/10.1023/B:VISI.0000011205.11775.fd
JA, Barratt DC, Vercauteren T (2018) Label- 58. Beg MF, Miller MI, Trouvé A, Younes L
driven weakly-supervised learning for multi- (2004) Computing large deformation metric
modal deformable image registration. In: Pro- mappings via geodesic flows of diffeomorph-
ceedings of IEEE 15th international isms. Int J Comput Vis 61(2):139–157
symposium on biomedical imaging 59. Freifeld O, Hauberg S, Batmanghelich K,
48. Fan J, Cao X, Xue Z, Yap PT, Shen D (2018) Fisher JW (2017) Transformations based on
Adversarial similarity network for evaluating continuous piecewise-affine velocity fields.
image alignment in deep learning based IEEE Trans Pattern Anal Mach Intell 39(12):
registration. In: Proceedings of the interna- 2496–2509. https://doi.org/10.1109/
tional conference on medical image computing TPAMI.2016.2646685
and computer-assisted intervention
Image Registration 457
60. Balakrishnan G, Zhao A, Sabuncu MR, 70. Sohn K, Lee H, Yan X (2015) Learning
Guttag J, Dalca AV (2018) An unsupervised structured output representation using deep
learning model for deformable medical image conditional generative models. In: Cortes C,
registration. In: Proceedings of the IEEE con- Lawrence ND, Lee DD, Sugiyama M, Garnett
ference on computer vision and pattern R (eds) Advances in neural information proces-
recognition sing systems 28. Curran Associates Inc., Red
61. Dalca AV, Balakrishnan G, Guttag J, Sabuncu Hook, pp 3483–3491
MR (2018) Unsupervised learning for fast 71. Kingma DP, Welling M (2014) Auto-encoding
probabilistic diffeomorphic registration. In: variational bayes. In: Proceedings of the 2nd
Frangi AF, Schnabel JA, Davatzikos C, Alber- international conference on learning represen-
ola-López C, Fichtinger G (eds) Proceedings tations (ICLR)
of the international conference on medical 72. Wu Y, Jiahao TZ, Wang J, Yushkevich PA,
image computing and computer-assisted inter- Hsieh MA, Gee JC (2022) Nodeo: a neural
vention. Springer International Publishing, ordinary differential equation based optimiza-
Cham, pp 729–738 tion framework for deformable image
62. Nazib A, Fookes C, Perrin D (2018) A com- registration. In: Proceedings of the IEEE/
parative analysis of registration tools: tradi- CVF conference on computer vision and pat-
tional vs deep learning approach on high tern recognition, pp 20804–20813
resolution tissue cleared data. Preprint. arXiv 73. Huang J, Wang H, Yang H (2020) Int-deep: a
63. Rueckert D, Sonoda LI, Hayes C, Hill DL, deep learning initialized iterative method for
Leach MO, Hawkes DJ (1999) Nonrigid reg- nonlinear problems. J Comput Phys 419:
istration using free-form deformations: appli- 109675
cation to breast MR images. IEEE Trans Med 74. Iqbal A, Khan R, Karayannis T (2019) Devel-
Imaging 18(8):712–721. https://doi. oping a brain atlas through deep learning. Nat
org/10.1109/42.796284 Mach Intell 1(6):277–287
64. Woods RP, Mazziotta JC, Cherry SR (1993) 75. Dalca A, Rakic M, Guttag J, Sabuncu M
MRI-PET registration with automated algo- (2019) Learning conditional deformable tem-
rithm. J Comput Assist Tomogr 17(4): plates with convolutional networks. In:
536–546 Wallach H, Larochelle H, Beygelzimer A, d’
65. Klein S, Staring M, Murphy K, Viergever MA, Alché-Buc F, Fox E, Garnett R (eds) Advances
Pluim JPW (2010) Elastix: a toolbox for in neural information processing systems. Cur-
intensity-based medical image registration. ran Associates Inc., Red Hook, vol 32. https://
IEEE Trans Med Imaging 29(1):196–205. proceedings.neurips.cc/paper/2019/file/
h t t p s : // d o i . o r g / 1 0 . 1 1 0 9 / T M I . 2 0 0 9 . bbcbf f5c1f1ded46c25d28119a8 5c6c2-
2035616 Paper.pdf
66. Avants BB, Tustison NJ, Song G, Cook PA, 76. Wu N, Wang J, Zhang M, Zhang G, Peng Y,
Klein A, Gee JC (2011) A reproducible evalua- Shen C (2022) Hybrid atlas building with deep
tion of ANTs similarity metric performance in registration priors. In: 2022 IEEE 19th inter-
brain image registration. NeuroImage 54(3): national symposium on biomedical imaging
2033–2044. https://doi.org/10.1016/j. (ISBI). IEEE, Piscataway, pp 1–5
neuroimage.2010.09.025 77. Yang J, Küstner T, Hu P, Liò P, Qi H (2022)
67. Modat M, Ridgway GR, Taylor ZA, End-to-end deep learning of non-rigid group-
Lehmann M, Barnes J, Hawkes DJ, Fox NC, wise registration and reconstruction of
Ourselin S (2010) Fast free-form deformation dynamic MRI. Front Cardiovasc Med 9
using graphics processing units. Comput 78. Sinclair M, Schuh A, Hahn K, Petersen K,
Methods Prog Biomed 98(3):278–284. Bai Y, Batten J, Schaap M, Glocker B (2022)
https://doi.org/10.1016/j.cmpb.2009.0 Atlas-ISTN: joint segmentation, registration
9.002 and atlas construction with image-and-spatial
68. Kim B, Kim DH, Park SH, Kim J, Lee JG, Ye transformer networks. Med Image Anal 78:
JC (2021) Cyclemorph: cycle consistent unsu- 102383
pervised deformable image registration. Med 79. Xu Z, Niethammer M (2019) Deepatlas: joint
Image Anal 71:102036 semi-supervised learning of image registration
69. Krebs J, Mansi T, Mailhé B, Ayache N, Delin- and segmentation. In: International conference
gette H (2018) Unsupervised probabilistic on medical image computing and computer-
deformation modeling for robust diffeo- assisted intervention. Springer, Berlin, pp
morphic registration. In: Proceedings of the 420–429
4th international workshop, DLMIA and 8th
international workshop, ML-CDS
458 Min Chen et al.
80. Wang S, Cao S, Wei D, Wang R, Ma K, 84. Ashburner J, Friston KJ (2000) Voxel-based
Wang L, Meng D, Zheng Y (2020) LT-Net: morphometry–the methods. NeuroImage
label transfer by learning reversible voxel-wise 11(6 Pt 1):805–821. https://doi.org/10.
correspondence for one-shot medical image 1006/nimg.2000.0582
segmentation. In: Proceedings of the IEEE/ 85. Hua X, Leow AD, Parikshak N, Lee S, Chiang
CVF conference on computer vision and pat- MC, Toga AW, Jack Jr CR, Weiner MW,
tern recognition (CVPR) Thompson PM, Initiative ADN, et al (2008)
81. Mao X, Li Q, Xie H, Lau RY, Wang Z, Paul Tensor-based morphometry as a neuroimaging
Smolley S (2017) Least squares generative biomarker for Alzheimer’s disease: an MRI
adversarial networks. In: Proceedings of the study of 676 AD, MCI, and normal subjects.
IEEE international conference on computer NeuroImage 43(3):458–469
vision, pp 2794–2802 86. Pahuja G, Prasad B (2022) Deep learning
82. Wang H, Yushkevich PA (2013) Multi-atlas architectures for Parkinson’s disease detection
segmentation with joint label fusion and cor- by using multi-modal features. Comput Biol
rective learning—an open source implementa- Med 105610
tion. Front Neuroinform 7:27 87. Huang H, Zheng S, Yang Z, Wu Y, Li Y, Qiu J,
83. Ding Z, Han X, Niethammer M (2019) Vote- Cheng Y, Lin P, Lin Y, Guan J, et al (2022)
net: a deep learning label fusion method for Voxel-based morphometry and a deep learning
multi-atlas segmentation. In: International model for the diagnosis of early alzheimer’s
conference on medical image computing and disease based on cerebral gray matter changes.
computer-assisted intervention. Springer, Ber- Cereb Cortex 33(3):754–763
lin, pp 202–210
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 15
Abstract
Computer-aided methods have shown added value for diagnosing and predicting brain disorders and can
thus support decision making in clinical care and treatment planning. This chapter will provide insight into
the type of methods, their working, their input data –such as cognitive tests, imaging, and genetic data– and
the types of output they provide. We will focus on specific use cases for diagnosis, i.e., estimating the current
“condition” of the patient, such as early detection and diagnosis of dementia, differential diagnosis of brain
tumors, and decision making in stroke. Regarding prediction, i.e., estimation of the future “condition” of
the patient, we will zoom in on use cases such as predicting the disease course in multiple sclerosis and
predicting patient outcomes after treatment in brain cancer. Furthermore, based on these use cases, we will
assess the current state-of-the-art methodology and highlight current efforts on benchmarking of these
methods and the importance of open science therein. Finally, we assess the current clinical impact of
computer-aided methods and discuss the required next steps to increase clinical impact.
1 Introduction
Authors Stefan Klein and Esther E. Bron have contributed equally to this chapter
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_15,
© The Author(s) 2023
459
460 Vikram Venkatraghavan et al.
prediction are crucial not only for decision making in clinical care
and treatment planning but also for managing the expectations of
patients and their caregivers. This is particularly important in brain
disorders as they may strongly affect life expectancy and quality of
life, as symptoms of the disorder and side effects of the treatment
can have a major impact on the patient’s cognitive skills, daily
functioning, social interaction, and general well-being. In clinical
practice, diagnosis and prediction are typically performed using
multiple sources of information, such as symptomatology, medical
history, cognitive tests, brain imaging, electroencephalography
(EEG), magnetoencephalography (MEG), blood tests, cerebrospi-
nal fluid (CSF) biomarkers, histopathological or molecular find-
ings, and lifestyle and genetic risk factors. These various pieces of
information are integrated by the treating clinician, often in con-
sensus with other experts at a multidisciplinary team meeting, in
order to reach a final diagnosis and/or treatment plan. The aim of
computer-aided methods for diagnosis and prediction is to support
this process, in order to achieve more accurate, objective, and
efficient decision making.
In the literature, numerous examples of computer-aided meth-
ods for diagnosis and prediction in brain disorders can be found.
Most of the state-of-the-art methods use some form of machine
learning to construct a model that maps (often high-dimensional)
input data to the output variable of interest. There exists a large
variation in machine learning technology, types of input data, and
output variables. Chapters 1–6 introduced the main machine
learning technologies used for computer-aided diagnosis and pre-
diction. These include, on the one hand, classical methods such as
linear models, support vector machines, and random forests, and
on the other hand, deep learning methods such as convolutional
neural networks and recurrent neural networks. These methods can
be implemented either as classification models (estimating discrete
labels) or as regression models (estimating continuous quantities),
possibly specialized for survival (or “time-to-event”) analysis. In
addition, Chapter 17 highlights the category of disease progression
modeling techniques, which could be considered as a specialized
form of machine learning incorporating models of the disease
evolution over time. Chapters 7–12 described the main types of
input data used in machine learning for brain disorders: clinical
evaluations, neuroimaging, EEG/MEG, genetics and omics data,
electronic health records, and smartphone and sensor data. The
current chapter focuses on the choice of the output variable, i.e.,
the diagnosis or prediction of interest (Fig. 1).
To illustrate the various ways in which machine learning could
aid diagnosis and prediction, we focus on representative use cases
Computer-Aided Diagnosis and Prediction in Brain Disorders 461
Fig. 1 Overview of the topics covered in this chapter, in the context of the other chapters in this book
1.1 Diagnosis Diagnosis aims to determine the current “condition” of the patient
to inform patient care and treatment decisions. Here, we introduce
three categories of diagnostic tasks that occur in clinical practice
and describe why and how computer-aided models have or could
have added value.
462 Vikram Venkatraghavan et al.
Box 1: Diagnosis
Categories of diagnostic tasks that occur in clinical practice in
which computer-aided models have or could have added
value, with brain disorders for which this is relevant as
examples:
• Early diagnosis Dementia and MS
• Differential diagnosis Dementia and brain cancer
• Decision making for treatment Stroke
Box 2: Prediction
Categories of prediction targets for which computer-aided
models have or could have added value, with example brain
disorders for which this is relevant as discussed in this chapter:
• Natural disease course Dementia and MS
• Patient outcomes after treatment MS, brain cancer, and
stroke
2 Method Evaluation
2.1 State-of-the-Art For early diagnosis in dementia, a large body of research has been
Methodology for published on classification of subjects into AD, mild cognitive
Diagnosis and impairment (MCI), and normal aging [36, 113, 145]. Overall,
Prediction classification methods show high performance for classification of
AD patients and cognitively normal controls with an area under the
receiver operating characteristic curve (AUC) of 85 - 98%.
Reported performances are somewhat lower for early diagnosis in
patients with MCI, i.e., prediction of imminent conversion to AD
(AUC: 62-82%). Dementia classification is usually based on clini-
cal diagnosis as a reference standard for training and validation
[87], but biological diagnosis based on assessment of amyloid
pathology with PET imaging or CSF has been increasingly used
over the last years [53, 129]. Structural T1-weighted (T1w) MRI
to quantify neuronal loss is the most commonly used biomarker,
whereas the support vector machine (SVM) is the most commonly
used classifier. For T1w, both voxel-based maps (e.g., voxel-based
morphometry maps quantifying local gray matter density [62]) and
region-based features [78] have been frequently used. While using
only region-based volumes may limit performance, combining
those with regional shape and texture has been shown to perform
competitively with using voxel-wise maps [13, 15, 24]. Using mul-
timodal imaging such as FDG-PET or DTI in addition to structural
MRI may have added value over structural MRI only, but limited
data is available [76, 150]. Following the trends and successes in
medical image analysis and machine learning, neural network
classifiers –convolutional neural networks (CNN) in particular–
have increasingly been used since a few years [16, 145], but have
not been shown to significantly outperform conventional classifiers.
In addition, data-driven disease progression models are being
developed [101], which do not rely on a priori defined labels but
instead derive disease progression in a data-driven way.
Regarding differential diagnosis in dementia, studies focus
mostly on discriminating AD from other types of dementia. Differ-
ential diagnosis based on CSF and PET biomarkers of AD pathol-
ogy has shown good performance for distinguishing AD from
FTLD with sensitivities of 0.83 (p-tau/amyloid-β ratio from
CSF) and 0.87 (amyloid PET) [48, 111, 117]. In addition,
machine learning approaches have been published based on either
structural or multimodal MRI as region-wise or voxel-wise imaging
features and generally SVM as a classifier, similar to those used for
early diagnosis in dementia. These methods focused mostly on
differential diagnosis of AD and FTLD and reported performances
in the range of AUC = 0.75 - 0.85 [12, 14, 92, 110]. A few
studies addressed differential diagnosis of AD and vascular
Computer-Aided Diagnosis and Prediction in Brain Disorders 469
Fig. 2 Summary of tumor segmentation methods (a), types of imaging features (b), means of internal
validation (c), and external validation (d) used by studies (n = 44) investigating machine learning models for
predicting genetic subtypes of glioma. VASARI, Visually Accessible Rembrandt Imaging. Recreated from
[55]. Permission to reuse was kindly granted by the publishers
Computer-Aided Diagnosis and Prediction in Brain Disorders 471
0.4
0.2
0.0
2010 2011 2012 2013 2014 2015 2016 2017 2018
Fig. 3 Evolution with time of the use of various algorithms for predicting the progression of mild cognitive
impairment. SVM with unknown kernel are simply noted as “SVM.” OPLS, orthogonal partial least square;
SVM, support vector machine. Reproduced from [5]. Permission to reuse was kindly granted by the publishers
Computer-Aided Diagnosis and Prediction in Brain Disorders 473
2.2 Benchmarks and For 15 years, grand challenges have been organized in the biomed-
Challenges ical image analysis research field. These are international bench-
marks in competition form that have the goal of objectively
comparing algorithms for a specific task on the same clinically
Computer-Aided Diagnosis and Prediction in Brain Disorders 475
3 Clinical Impact
1
https://scikit-learn.org/.
2
https://monai.io/.
476 Vikram Venkatraghavan et al.
3
quantib.com/solutions/quantib-nd.
4
combinostics.com/cdsi.
478 Vikram Venkatraghavan et al.
5
www.docker.com.
Computer-Aided Diagnosis and Prediction in Brain Disorders 479
6
osf.io/zyacb.
480 Vikram Venkatraghavan et al.
6 Conflict of Interest
Acknowledgements
References
1. Akkus Z, Ali I, Sedlář J, Agrawal JP, Parney IF, the progression of mild cognitive impairment
Giannini C, Erickson BJ (2017) Predicting using machine learning: a systematic, quanti-
deletion of chromosomal arms 1p/19q in tative and critical review. Med Image Anal 67:
low-grade gliomas from MR images using 101848. https://doi.org/10.1016/j.
machine intelligence. J Digit Imaging 30(4): media.2020.101848
469–476. https://doi.org/10.1007/s102 6. Baid U, Ghodasara S, Mohan S, Bilello M,
78-017-9984-3 Calabrese E et al (2021) The RSNA-ASNR-
2. Albert MS, DeKosky ST, Dickson D, MICCAI BraTS 2021 benchmark on brain
Dubois B, Feldman HH, Fox NC, Gamst A, tumor segmentation and radiogenomic
Holtzman DM, Jagust WJ, Petersen RC, Sny- classification
der PJ, Carrillo MC, Thies B, Phelps CH 7. Bakas S, Reyes M, Jakab A, Bauer S, Rempfler
(2011) The diagnosis of mild cognitive M et al (2019) Identifying the best machine
impairment due to Alzheimer’s disease: learning algorithms for brain tumor segmen-
recommendations from the National Institute tation, progression assessment, and overall
on Aging-Alzheimer’s Association work- survival prediction in the brats challenge
groups on diagnostic guidelines for Alzhei- 8. Bakas S, Baid U, Calabrese E, Carr C,
mer’s disease. Alzheimers Dement 7(3): Colak E, Farahani K, Flanders AE, Kalpathy-
270–279. https://doi.org/10.1016/j. Cramer J, Kitamura FC, Menze B, Mongan J,
jalz.2011.03.008 Prevedello L, Rudie J, Shinohara RT (2021)
3. Allen GI, Amoroso N, Anghel C, RSNA-MICCAI brain tumor radiogenomic
Balagurusamy V, Bare CJ et al (2016) Crowd- classification. Link to the Kaggle challenge
sourced estimation of cognitive decline and 9. Bentley P, Ganesalingam J, Carlton Jones AL,
resilience in Alzheimer’s disease. Alzheimers Mahady K, Epton S, Rinne P, Sharma P,
Dement 12(6):645–653. https://doi. Halse O, Mehta A, Rueckert D (2014) Pre-
org/10.1016/j.jalz.2016.02.006 diction of stroke thrombolysis outcome using
4. Alzheimer’s Association Report (2020) Alz- CT brain machine learning. NeuroImage Clin
heimer’s disease facts and figures. Alzheimers 4:635–640. https://doi.org/10.1016/j.
Dement 16(3):391–460. https://doi.org/ nicl.2014.02.003
10.1002/alz.12068 10. Bi WL, Hosny A, Schabath MB, Giger ML,
5. Ansart M, Epelbaum S, Bassignana G, Birkbak NJ, Mehrtash A, Allison T,
Boône A, Bottani S, Cattai T, Couronné R, Arnaout O, Abbosh C, Dunn IF, Mak RH,
Faouzi J, Koval I, Louis M, Thibeau-Sutre E, Tamimi RM, Tempany CM, Swanton C,
Wen J, Wild A, Burgos N, Dormont D, Hoffmann U, Schwartz LH, Gillies RJ,
Colliot O, Durrleman S (2021) Predicting Huang RY, Aerts HJWL (2019) Artificial
482 Vikram Venkatraghavan et al.
intelligence in cancer imaging: clinical chal- 17. Bron EE, Klein S, Reinke A, Papma JM,
lenges and applications. CA Cancer J Clin Maier-Hein L, Alexander DC, Oxtoby NP
69(2):127–157. https://doi.org/10.3322/ (2021) Ten years of image analysis and
caac.21552 machine learning competitions in dementia
11. Bilgel M, Jedynak BM, Initiative ADN (2019) 18. Brown FS, Glasmacher SA, Kearns PKA,
Predicting time to dementia using a quantita- MacDougall N, Hunt D, Connick P, Chan-
tive template of disease progression. Alzhei- dran S (2020) Systematic review of prediction
mers Dement Diagn Assess Dis Monit 11(1): models in relapsing remitting multiple sclero-
205–215. https://doi.org/10.1016/j. sis. PLoS One 15(5):1–13. https://doi.org/
dadm.2019.01.005 10.1371/journal.pone.0233575
12. Bouts MJRJ, Möller C, Hafkemeijer A, van 19. Buchlak QD, Esmaili N, Leveque JC,
Swieten JC, Dopper E, van der Flier WM, Bennett C, Farrokhi F, Piccardi M (2021)
Vrenken H, Wink AM, Pijnenburg YAL, Machine learning applications to neuroimag-
Scheltens P, Barkhof F, Schouten TM, de ing for glioma detection and classification: an
Vos F, Feis RA, van der Grond J, de artificial intelligence augmented systematic
Rooij M, Rombouts SARB (2018) Single sub- review. J Clin Neurosci Off J Neurosurg Soc
ject classification of alzheimer’s disease and A u s t r a l a s 8 9 : 1 7 7 – 1 9 8 . h t t p s : // d o i .
behavioral variant frontotemporal dementia org/10.1016/j.jocn.2021.04.043
using anatomical, diffusion tensor, and 20. Chen C, Ou X, Wang J, Guo W, Ma X (2019)
resting-state functional magnetic resonance Radiomics-based machine learning in differ-
imaging. J Alzheimer’s Dis 62(4): entiation between glioblastoma and meta-
1827–1839. https://doi.org/10.3233/ static brain tumors. Front Oncol 9:806.
jad-170893 https://doi.org/10.3389/fonc.2019.00806
13. Bron EE, Steketee RM, Houston GC, Oliver 21. Choi YS, Ahn SS, Chang JH, Kang SG, Kim
RA, Achterberg HC, Loog M, van Swieten EH, Kim SH, Jain R, Lee SK (2020) Machine
JC, Hammers A, Niessen WJ, Smits M, learning and radiomic phenotyping of lower
Klein S, for the Alzheimer’s Disease Neuro- grade gliomas: improving survival prediction.
imaging Initiative (2014) Diagnostic classifi- Eur Radiol 30(7):3834–3842. https://doi.
cation of arterial spin labeling and structural org/10.1007/s00330-020-06737-5
MRI in presenile early stage dementia. Hum 22. Crous-Bou M, Minguillón C, Gramunt N,
Brain Mapp 35(9):4916–4931. https://doi. Molinuevo JL (2017) Alzheimer’s disease
org/10.1002/hbm.22522 prevention: from risk factors to early interven-
14. Bron E, Smits M, Papma J, Steketee R, tion. Alzheimers Res Therapy 9(1):71.
Meijboom R, De Groot M, van Swieten J, https://doi.org/10.1186/s13195-017-02
Niessen W, Klein S (2016) Multiparametric 97-z
computer-aided differential diagnosis of Alz- 23. Cui Y, Liu B, Luo S, Zhen X, Fan M, Liu T,
heimer’s disease and frontotemporal demen- Zhu W, Park M, Jiang T, Jin JS, the Alzhei-
tia using structural and advanced MRI. Eur mer’s Disease Neuroimaging Initiative (2011)
Radiol 27(8):1–11. https://doi.org/10.100 Identification of conversion from mild cogni-
7/s00330-016-4691-x tive impairment to Alzheimer’s disease using
15. Bron EE, Smits M, van der Flier WM, multivariate predictors. PLoS One 6(7):1–10.
Vrenken H, Barkhof F et al (2015) Standar- https://doi.org/10.1371/journal.pone.
dized evaluation of algorithms for computer- 0021896
aided diagnosis of dementia based on 24. Cuingnet R, Gerardin E, Tessieras J,
structural MRI: The CADDementia chal- Auzias G, Lehéricy S, Habert MO,
lenge. NeuroImage 111:562–579. https:// Chupin M, Benali H, Colliot O (2011) Auto-
doi.org/10.1016/j.neuroimage.2015.01.04 matic classification of patients with Alzhei-
8 mer’s disease from structural MRI: a
16. Bron EE, Klein S, Papma JM, Jiskoot LC, comparison of ten methods using the adni
Venkatraghavan V, Linders J, Aalten P, De database. NeuroImage 56(2):766–781.
Deyn PP, Biessels GJ, Claassen JA, Middelk- h t t p s : // d o i . o r g / 1 0 . 1 0 1 6 / j .
oop HA, Smits M, Niessen WJ, van Swieten neuroimage.2010.06.013. Multivariate
JC, van der Flier WM, Ramakers IH, van der Decoding and Brain Reading
Lugt A (2021) Cross-cohort generalizability 25. Dekker I, Schoonheim MM,
of deep and conventional machine learning Venkatraghavan V, Eijlers AJ, Brouwer I,
for MRI-based diagnosis and prediction of Bron EE, Klein S, Wattjes MP, Wink AM,
Alzheimer’s disease. NeuroImage Clin 31: Geurts JJ, Uitdehaag BM, Oxtoby NP, Alex-
102712. https://doi.org/10.1016/j. ander DC, Vrenken H, Killestein J, Barkhof F,
nicl.2021.102712
Computer-Aided Diagnosis and Prediction in Brain Disorders 483
Wottschel V (2021) The sequence of struc- 34. Eshaghi A, Young A, Wijeratne P, Prados F,
tural, functional and cognitive changes in Arnold D, Narayanan S, Guttmann C,
multiple sclerosis. NeuroImage Clin 29: Barkhof F, Alexander D, Thompson A,
102550. https://doi.org/10.1016/j. Chard D, Ciccarelli O (2021) Identifying
nicl.2020.102550 multiple sclerosis subtypes using unsupervised
26. Delfanti RL, Piccioni DE, Handwerker J, machine learning and MRI data. Nat Com-
Bahrami N, Krishnan A, Karunamuni R, mun 12(1). https://doi.org/10.1038/s414
Hattangadi-Gluth JA, Seibert TM, Srikant A, 67-021-22265-2
Jones KA, Snyder VS, Dale AM, White NS, 35. Ezzati A, Lipton RB, Alzheimer’s Disease
McDonald CR, Farid N (2017) Imaging cor- Neuroimaging Initiative (2020) Machine
relates for the 2016 update on WHO classifi- learning predictive models can improve effi-
cation of grade II/III gliomas: Implications cacy of clinical trials for Alzheimer’s disease. J
for IDH, 1p/19q and ATRX status. J Neuro- Alzheimers Dis 74(1):55–63. https://doi.
oncol 135(3):601–609. https://doi. org/10.3233/jad-190822
org/10.1007/s11060-017-2613-7 36. Falahati F, Westman E, Simmons A (2014)
27. Disease Modifying Therapies (2021) Disease Multivariate data analysis and machine
modifying therapies for MS. National Multi- learning in Alzheimer’s disease with a focus
ple Sclerosis Society, New York, pp 3–21 on structural magnetic resonance imaging. J
28. Dubbink HJ, Atmodimedjo PN, Kros JM, Alzheimers Dis 41(3):685–708. https://doi.
French PJ, Sanson M, Idbaih A, Wesseling P, org/10.3233/jad-131928
Enting R, Spliet W, Tijssen C, Dinjens WNM, 37. Gasperini C, Prosperini L, Tintoré M, Sor-
Gorlia T, van den Bent MJ (2015) Molecular mani MP, Filippi M, Rio J, Palace J, Rocca
classification of anaplastic oligodendroglioma MA, Ciccarelli O, Barkhof F, Sastre-Garriga J,
using next-generation sequencing: a report of Vrenken H, Frederiksen JL, Yousry TA,
the prospective randomized EORTC Brain Enzinger C, Rovira A, Kappos L, Pozzilli C,
Tumor Group 26951 phase III trial. Neuro- Montalban X, De Stefano N, The MAGNIMS
Oncology 18(3):388–400. https://doi.org/ Study Group (2019) Unraveling treatment
10.1093/neuonc/nov182 response in multiple sclerosis. Neurology
29. Dubois B, Feldman HH, Jacova C, 92(4):180–192. https://doi.org/10.1212/
Hampel H, Molinuevo JL et al (2014) WNL.0000000000006810
Advancing research diagnostic criteria for Alz- 38. Gessler F, Bernstock JD, Braczynski A,
heimer’s disease: the IWG-2 criteria. Lancet Lescher S, Baumgarten P, Harter PN,
Neurol 13(6):614–629. https://doi. Mittelbronn M, Wu T, Seifert V, Senft C
org/10.1016/s1474-4422(14)70090-0 (2019) Surgery for glioblastoma in light of
30. Ducharme S, Price BH, Larvie M, Dougherty molecular markers: impact of resection and
DD, Dickerson BC (2015) Clinical approach MGMT promoter methylation in newly diag-
to the differential diagnosis between behav- nosed IDH-1 wild-type glioblastomas. Neu-
ioral variant frontotemporal dementia and pri- rosurgery 84(1):190–197. https://doi.org/
mary psychiatric disorders. Am J Psychiatry 10.1093/neuros/nyy049
172(9):827–837. https://doi.org/10.1176/ 39. Goodkin O, Pemberton H, Vos SB, Prados F,
appi.ajp.2015.14101248 Sudre CH, Moggridge J, Cardoso MJ,
31. Eckel-Passow JE, Lachance DH, Molinaro Ourselin S, Bisdas S, White M, Yousry T,
AM, Walsh KM, Decker PA et al (2015) Gli- Thornton J, Barkhof F (2019) The quantita-
oma groups based on 1p/19q, IDH, and tive neuroradiology initiative framework:
TERT promoter mutations in tumors. N application to dementia. Br J Radiol
Engl J Med 372(26):2499–2508. https:// 92(1101):20190365. https://doi.org/10.
doi.org/10.1056/NEJMoa1407279 1259/bjr.20190365, pMID: 31368776
32. Eijlers AJC, van Geest Q, Dekker I, Steenwijk 40. Gordon BA, Blazey TM, Su Y, Hari-Raj A,
MD, Meijer KA, Hulst HE, Barkhof F, Uit- Dincer A et al (2018) Spatial patterns of neu-
dehaag BMJ, Schoonheim MM, Geurts JJG roimaging biomarker change in individuals
(2018) Predicting cognitive decline in multi- from families with autosomal dominant Alz-
ple sclerosis: a 5-year follow-up study. Brain heimer’s disease: a longitudinal study. Lancet
141(9):2605–2618. https://doi.org/10. Neurol 17(3):241–250. https://doi.
1093/brain/awy202 org/10.1016/S1474-4422(18)30028-0
33. El-Koussy M, Schroth G, Brekenfeld C, 41. Gore S, Chougule T, Jagtap J, Saini J, Ingal-
Arnold M (2014) Imaging of acute ischemic halikar M (2021) A review of radiomics and
stroke. Eur Neurol 72(5–6):309–316. deep predictive modeling in glioma character-
https://doi.org/10.1159/000362719 ization. Acad Radiol 28(11):1599–1621.
484 Vikram Venkatraghavan et al.
124. Signori A, Schiavetti I, Gallo F, Sormani MP 132. Thompson AJ, Banwell BL, Barkhof F, Car-
(2015) Subgroups of multiple sclerosis roll WM, Coetzee T et al (2018) Diagnosis of
patients with larger treatment benefits: a multiple sclerosis: 2017 revisions of the
meta-analysis of randomized trials. Eur J mcdonald criteria. Lancet Neurol 17(2):
Neurol 22(6):960–966. https://doi.org/10. 162–173. https://doi.org/10.1016/S1474-
1111/ene.12690 4422(17)30470-2
125. Silva S, Altmann A, Gutman B, Lorenzi M 133. Tong T, Ledig C, Guerrero R, Schuh A,
(2020) Fed-biomed: a general open-source Koikkalainen J, Tolonen A, Rhodius H,
frontend framework for federated learning in Barkhof F, Tijms B, Lemstra AW,
healthcare. In: Albarqouni S, Bakas S, Soininen H, Remes AM, Waldemar G,
Kamnitsas K, Cardoso MJ, Landman B, Hasselbalch S, Mecocci P, Baroni M,
Li W, Milletari F, Rieke N, Roth H, Xu D, Lötjönen J, van der Flier W, Rueckert D
Xu Z (eds) Domain adaptation and represen- (2017) Five-class differential diagnostics of
tation transfer, and distributed and collabora- neurodegenerative diseases using random
tive learning. Springer International undersampling boosting. NeuroImage Clin
Publishing, Cham, pp 201–210 15:613–624. https://doi.org/10.1016/j.
126. Simblett S, Matcham F, Curtis H, Greer B, nicl.2017.06.012
Polhemus A, Novák J, Ferrao J, Gamble P, 134. van der Ende EL, Bron EE, Poos JM, Jiskoot
Hotopf M, Narayan V, Wykes T, Remote LC, Panman JL et al (2021) A data-driven
Assessment of Disease and Relapse – Central disease progression model of fluid biomarkers
Nervous System (RADAR-CNS) Consortium in genetic frontotemporal dementia. Brain.
(2020) Patients’ measurement priorities for https://doi.org/10.1093/brain/awab382,
remote measurement technologies to aid awab382
chronic health conditions: qualitative analysis. 135. van der Voort SR, Incekara F, Wijnenga
JMIR Mhealth Uhealth 8(6):e15086. MMJ, Kapsas G, Gahrmann R, Schouten
https://doi.org/10.2196/15086 JW, Tewarie RN, Lycklama GJ, Hamer
127. Singh G, Manjila S, Sakla N, True A, Wardeh PCDW, Eijgelaar RS, French PJ, Dubbink
AH, Beig N, Vaysberg A, Matthews J, HJ, Vincent AJPE, Niessen WJ, van den
Prasanna P, Spektor V (2021) Radiomics and Bent MJ, Smits M, Klein S (2020) Who
radiogenomics in gliomas: a contemporary 2016 subtyping and automated segmentation
update. Br J Cancer 125(5):641–657. of glioma using multi-task deep learning
https://doi.org/10.1038/s41416-021-013 136. van Kempen EJ, Post M, Mannil M,
87-w Kusters B, ter Laan M, Meijer FJA, Henssen
128. Smits M (2016) Imaging of oligodendro- DJHA (2021) Accuracy of machine learning
glioma. Br J Radiol 89(1060):20150857. algorithms for the classification of molecular
https://doi.org/10.1259/bjr.20150857 features of gliomas on MRI: a systematic liter-
129. Son HJ, Oh JS, Oh M, Kim SJ, Lee JH, Roh ature review and meta-analysis. Cancers
JH, Kim JS (2020) The clinical feasibility of 1 3 ( 1 1 ) . h t t p s : // d o i . o r g / 1 0 . 3 3 9 0 /
deep learning-based classification of amyloid cancers13112606
pet images in visually equivocal cases. Eur J 137. van Leeuwen KG, Schalekamp S, Rutten MJ,
Nucl Med Mol Imaging 47(2):332–341. van Ginneken B, de Rooij M (2021) Artificial
https://doi.org/10.1007/s00259-019-04 intelligence in radiology: 100 commercially
595-y available products and their scientific evi-
130. Sounderajah V, Ashrafian H, Golub RM, dence. Eur Radiol 31(6):3797–3804.
Shetty S, De Fauw J et al (2021) Developing https://doi.org/10.1007/s00330-021-0
a reporting guideline for artificial intelligence- 7892-z
centred diagnostic test accuracy studies: the 138. van Maurik IS, Vos SJ, Bos I, Bouwman FH,
STARD-AI protocol. BMJ Open 11(6). Teunissen CE et al (2019) Biomarker-based
h t t p s : // d o i . o r g / 1 0 . 1 1 3 6 / b m j o p e n - prognosis for people with mild cognitive
2020-047709 impairment (ABIDE): a modelling study.
131. Stühler E, Braune S, Lionetto F, Heer Y, Lancet Neurol 18(11):1034–1044. https://
Jules E, Westermann C, Bergmann A, van doi.org/10.1016/S1474-4422(19)30283-2
Hövell P, Group NS, et al (2020) Framework 139. van Vliet D, de Vugt ME, Bakker C, Pijnen-
for personalized prediction of treatment burg YAL, Vernooij-Dassen MJFJ, Koopmans
response in relapsing remitting multiple scle- RTCM, Verhey FRJ (2013) Time to diagnosis
rosis. BMC Med Res Methodol 20. https:// in young-onset dementia as compared with
doi.org/10.1186/s12874-020-0906-6 late-onset dementia. Psychol Med 43(2):
490 Vikram Venkatraghavan et al.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 16
Abstract
The imaging community has increasingly adopted machine learning (ML) methods to provide individua-
lized imaging signatures related to disease diagnosis, prognosis, and response to treatment. Clinical
neuroscience and cancer imaging have been two areas in which ML has offered particular promise.
However, many neurologic and neuropsychiatric diseases, as well as cancer, are often heterogeneous in
terms of their clinical manifestations, neuroanatomical patterns, or genetic underpinnings. Therefore, in
such cases, seeking a single disease signature might be ineffectual in delivering individualized precision
diagnostics. The current chapter focuses on ML methods, especially semi-supervised clustering, that seek
disease subtypes using imaging data. Work from Alzheimer’s disease and its prodromal stages, psychosis,
depression, autism, and brain cancer are discussed. Our goal is to provide the readers with a broad overview
in terms of methodology and clinical applications.
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_16,
© The Author(s) 2023
491
492 Junhao Wen et al.
2.1 Unsupervised Many recent efforts to discover the heterogeneous nature of brain
Clustering diseases have investigated different unsupervised clustering algo-
rithms [10, 11, 20–32]. Among these approaches, the key cluster-
ing methods are often K-means, hierarchical clustering, and
nonnegative matrix factorization (NMF) (Fig. 1). In this
Fig. 1 Schematic diagram of representative unsupervised clustering methods, K-means, NMF, and hierarchi-
cal clustering
494 Junhao Wen et al.
2.1.1 K-Means K-means clustering aims to directly partition the n patients into
Clustering k clusters. Each patient belongs to the cluster with the nearest mean
(i.e., cluster centroid) quantified by a distance metric of choice
(e.g., Euclidean distance). Since searching the global minimum in
clustering is computationally difficult (NP-hard), local minima are
searched in the K-means algorithm via an iterative refinement
approach. This usually involves two steps after giving an initial set
of k centroids: (i) assignment step, assigning each data point to the
cluster with the nearest centroid with the least squared Euclidean
distance, and (ii) update step, recalculating means (centroids) for all
data points assigned to each cluster. The two steps iteratively con-
tinue until the convergence, i.e., the assignments no longer change.
More details regarding the k-means algorithm are provided in
Chap. 2, Subheading 12.1. Please refer to [33–35] for representa-
tive studies using K-means for disease subtyping.
2.1.2 NMF Clustering Nonnegative matrix factorization (NMF) is a method that implic-
itly performs clustering by taking advantage that complex patterns
can be construed as a sum of simple parts. In essence, the input data
Xt is factorized into two nonnegative matrices C ∈p × k and
L∈k × m , for which we refer to the component matrix and loading
coefficient matrix, respectively. This method has been widely used
as an effective dimensionality reduction technique in signal proces-
sing and image analysis [36]. By its nature, the L matrix can be
directly used for clustering purposes, which is analogous to
K-means if we impose an orthogonality constraint on the L matrix.
Specifically, if Lkj > Lij for all i≠k, this clusters the data point xn into
the k-th cluster. The vectors of the C matrix indicate the cluster
centroids. Please refer to [32] for a representative study using NMF
for disease subtyping.
2.1.4 Representative Sustain [10] is an unsupervised clustering method for subtype and
Unsupervised Clustering stage inference. Specifically, Sustain formulates the model as groups
Methods of subjects with a particular biomarker progression pattern as a
subtype. The biomarker evolution of each subtype is modeled as a
linear z-score model, a continuous generalization of the original
event-based model [39]. Each biomarker follows a piecewise linear
trajectory over a common timeframe. The key advantage of this
model is that it can work with purely cross-sectional data and derive
an imaging signatures of subtype and stage simultaneously.
A Bayesian latent Dirichlet allocation model [11] was proposed
to extract latent AD-related atrophy factors. This probabilistic
approach hypothesizes that each patient expresses one or more
latent factors, and each factor is associated with distinct but possibly
overlapping atrophy patterns. However, due to the nature of latent
Dirichlet allocation methods, the input images have to be discre-
tized. Moreover, this method exclusively models brain atrophy
while ignoring brain enlargement. For example, larger brain
volumes in basal ganglia have been associated with one subtype of
schizophrenia [12].
Fig. 2 Schematic diagram of semi-supervised clustering methods. Figure is adapted from [14]
496 Junhao Wen et al.
group and the PT group, thereby teasing out clusters that are likely
driven by distinct pathological trajectories, instead of by global
similarity/dissimilarity in data, which is the core momentum of
conventional unsupervised clustering methods.
In the following subsections, we briefly discuss four semi-
supervised clustering methods. These methods employ different
techniques to seek this “1-to-k” mapping. In particular, CHI-
MERA [16] and Smile-GAN [9] utilize generative models to
achieve this mapping, while HYDRA [6] and MAGIC [8] are
built on top of discriminative models.
where
L GAN ðD, f Þ = ypPT ½log ðDðyÞÞ
ð2Þ
þzpSub ,xpCN ½1 - log ðDðf ðx, zÞÞÞ
þμ s i,j maxf0, 1 þ w Tj X Ti þ b j g
ijy i = - 1
ð5Þ
where wj and bj are the weight and bias for each hyperplane,
respectively. μ is a penalty parameter on the training error, and S is
the subtype membership matrix of dimension m k deciding
Subtyping Brain Diseases 499
2.2.4 MAGIC MAGIC was proposed to overcome one of the main limitations that
HYDRA faced. That is, a single-scale set of features (e.g., atlas-
based regions of interest) may not be sufficient to derive subtle
differences, compared to global demographics, disease heterogene-
ity, since ample evidence has shown that the brain is fundamentally
composed of multi-scale structural or functional entities. To this
objective, MAGIC extracts multi-scale features in a coarse-to-fine
granular fashion via stochastic orthogonal projective nonnegative
matrix factorization (opNMF) [42], a very effective unbiased, data-
driven method for extracting biologically interpretable and repro-
ducible feature representations. Together with these multi-scale
features, HYDRA is embedded into a double-cyclic optimization
procedure to yield robust and scale-consistent cluster solutions.
MAGIC encapsulates the two previous proposed methods (i.e.,
opNMF and HYDRA) and optimizes the clustering objective for
each single-scale feature as a sub-optimization problem. To fuse the
multi-scale clustering information and enforce the clusters to be
scale-consistent, it adopts a double-cyclic procedure that transfers
and fine-tunes the clustering polytope. Firstly, (i) inner cyclic pro-
cedure: let us remind that HYDRA decides the clusters based on
the subtype membership matrix (S). MAGIC first initializes the S
matrix with a specific single-scale feature set, i.e., Li, and then the S
matrix is transferred to the next set of feature set Li+1 until the
predefined stopping criterion is achieved (i.e., the clustering solu-
tion across scales is stable). Secondly, (ii) outer cyclic procedure: the
inner cyclic procedure was repeated by initializing with each single-
scale feature set. Finally, to determine the final subtype assignment,
we perform a consensus clustering by computing a co-occurrence
matrix based on all the clustering results and then perform spectral
clustering [43].
Brain disorders and diseases affect the human brain across a wide
age range. Neurodevelopmental disorders, such as autism spectrum
disorders (ASD), are usually present from early childhood and
affect daily functioning [44]. Psychotic disorders, such as schizo-
phrenia, involve psychosis that is typically diagnosed for the first
time in late adolescence or early adulthood [45]. Dementia and
mild cognitive impairment (MCI) prevail both in late mid-life for
500 Junhao Wen et al.
3.1 Autism Spectrum ASD encompasses a broad spectrum of social deficits and atypical
Disorder behaviors [48]. Heterogeneity of its clinical presentation has
sparked massive research efforts to find subtypes to better delineate
its diagnosis [49, 50]. Recent initiatives to aggregate neuroimaging
data of ASD, such as the ABIDE [51] and the EU-AIMS [52], also
have motivated large-scale subtyping projects using imaging
signatures [53].
Different clustering methods have been applied to reveal struc-
tural brain-based subtypes, but primarily traditional techniques
such as the K-means [54] or hierarchical clustering [37]. Besides
structural MRI, functional MRI [55] and EEG [56] have also been
popular modalities. For reasons discussed earlier, normative clus-
tering and dimensional analyses are better suited to parse a patient
population that is highly heterogeneous [57]. However, efforts in
this avenue have been primitive, with only a few recent publications
using cortical thickness [58]. Taken together, although more vali-
dation and replication efforts are necessary to define any reliable
neuroanatomical subtypes of ASD, some convergence in findings
has been noted [53]. First, most sets of ASD neuroimaging sub-
types indicate a combination of both increases and decreases in
imaging features compared to the CN group, instead of pointing
in a uniform direction. Second, most subtypes are characterized by
spatially distributed imaging patterns instead of isolated or focal
patterns. Both findings emphasize the significant heterogeneity in
ASD brains and the need for better stratification.
The search for subtypes in the ASD population has unique
challenges. First, the early onset of ASD implies that it is heavily
influenced by neurodevelopmental processes. Depending on the
selected age range, the results may significantly differ. Second,
ASD is more prevalent in males, with three to four male cases for
one female case [59], which adds a layer of potential bias. Third,
individuals with ASD often suffer psychiatric comorbidities, such as
ADHD, anxiety disorders, and obsessive-compulsive disorder,
among many others [60], which, if not screened carefully, can
dilute or alter the true signal.
Subtyping Brain Diseases 501
3.3 Major Depressive MDD is a common, severe, and recurrent disorder, with over
Disorder 300 million people affected worldwide, and is characterized by
low mood, apathy, and social withdrawal, with symptoms spanning
multiple domains [69]. Its vast heterogeneity is exemplified by the
fact that according to DSM-5 criteria, at least 227 and up to 16,400
unique symptom presentations exist [70, 71]. The potential causes
for this heterogeneity vary from divergent clinical symptom profiles
to genetic etiologies and individual differences in treatment
outcomes.
Despite neurobiological findings in MDD spanning cortical
thickness, gray matter volume (GMV), and fractional anisotropy
(FA) measures, objective brain biomarkers that can be used to
diagnose and predict disease course and outcome remain elusive
[71–73]. Recently, there have been efforts to identify neurobiolo-
gically based subtypes of depression using a bottom-up approach,
mainly using data from resting-state fMRI [71]. Several studies
[33–35] employed k-means clustering and group iterative multiple
model estimation, respectively, to identify two functional connec-
tivity subtypes, while Tokuda et al. [74] and Drysdale et al. [75]
identified three and four subtypes, respectively, using nonparamet-
ric Bayesian mixture models and hierarchical clustering. These
subtypes are characterized by reduced connectivity in different net-
works, including the default mode network (DMN), ventral atten-
tion network, and frontostriatal and limbic dysfunction. Regarding
structural neuroimaging, one study has used k-means clustering on
fractional anisotropy (FA) data to identify two depression subtypes.
The first subtype was characterized by decreased FA in the right
temporal lobe and the right middle frontal areas and was associated
with an older age at onset. In contrast, the second subtype was
characterized by increased FA in the left occipital lobe and was
associated with a younger age at onset [76].
Current research in the identification of brain subtypes in
MDD has produced results that are promising but confounded by
methodological and design limitations. While some studies have
shown clinical promises such as predicting higher depressive symp-
tomatology and lower sustenance of positive mood [34, 35],
depression duration [33], and TMS therapy response [75], they
are confounded by limitations such as relatively small sample sizes;
nuisance variances such as age, gender, and common ancestry; lack
of external validation; and lack of statistical significance testing of
identified clusters. Furthermore, there has been a lack of ambition
in the use of novel clustering techniques. Clustering based on
Subtyping Brain Diseases 503
3.4 MCI and AD AD, along with its prodromal stage presenting MCI, is the most
common neurodegenerative disease, affecting millions across the
globe. Although a plethora of imaging studies have derived
AD-related imaging signatures, most studies ignored the heteroge-
neity in AD. Recently, there has been a developing body of effort to
derive imaging signatures of AD that are heterogeneity-aware (i.e.,
subtypes) [7–11].
Most previous studies leveraged unsupervised clustering meth-
ods such as Sustain [10], NMF [32], latent Dirichlet allocation
[11], and hierarchical clustering [24, 25, 30, 38]. Other papers
[6, 9, 20, 77, 78] utilized semi-supervised clustering methods. Due
to the variabilities of the choice of databases and methodologies
and the lack of ground truth in the context of clustering, the
reported number of clusters and the subtypes’ neuroanatomical
patterns differ and cannot be directly compared. The targeted
heterogeneous population of study also varies across papers. For
instance, [6] focused on dissecting the neuroanatomical heteroge-
neity for AD patients, while [77] included AD plus MCI and [20]
studied MCI only. However, some common subtypes were found
in different studies. First, a subtype showing a typical diffuse atro-
phy pattern over the entire brain was witnessed in several studies
[6, 8–10, 22, 27, 29, 30, 32, 38, 77]. Another subtype demon-
strating nearly normal brain anatomy was robustly identified [8, 9,
16, 20, 22, 24, 25, 29, 30]. Moreover, studies [8, 9, 29, 30, 77]
also reported one subtype showing atypical AD patterns (i.e., hip-
pocampus or medial temporal lobe atrophy spared).
Though these methods enabled a better understanding of het-
erogeneity in AD, there are still limitations and challenges. First,
due to demographic variations and the existence of comorbidities,
it is not guaranteed that models cluster the data based on variations
of the pathology of interest. Semi-supervised methods might tackle
this problem to some extent, but more careful sample selection and
further study with longitudinal data may ensure disease specificity.
Second, spatial differences and temporal changes may simulta-
neously contribute to subtypes derived through clustering meth-
ods. Third, subtypes captured from neuroimaging data alone bring
limited insight into disease treatments, thereby a joint study of
neuroimaging and genetic heterogeneity may provide greater clini-
cal value [14, 79].
504 Junhao Wen et al.
3.5 Brain Cancer Brain tumors, such as glioblastoma (GBM), exhibit extensive inter-
and intra-tumor heterogeneity, diffuse infiltration, and invasiveness
of various immune and stromal cell populations, which pose diag-
nostic and prognostic challenges, and render the standard therapies
futile [80]. Deciphering the underlying heterogeneity of brain
tumors, which arises from genomic instability of these tumors,
plays a key role in understanding and predicting the course of
tumor progression and its response to the standard therapies,
thereby designing effective therapies targeted at aberrant genetic
alterations [81, 82]. Medical imaging noninvasively portrays the
phenotypic differences of brain tumors and their microenviron-
ment caused by molecular activities of tumors on a macroscopic
scale [83, 84]. It has the potential to provide readily accessible and
surrogate biomarkers of particular genomic alterations, predict
response to therapy, avoid risks of tumor biopsy or inaccurate
diagnosis due to sampling errors, and ultimately develop persona-
lized therapies to improve patient outcomes. An imaging subtype
of brain tumors may provide a wealth of information about the
tumor, including distinct molecular pathways [85, 86].
Recent studies on radiomic analysis of multiparametric MRI
(mpMRI) scans provide evidence of distinct phenotypic presenta-
tion of brain tumors associated with specific molecular character-
istics. These studies propose that quantification of tumor
morphology, texture, regional microvasculature, cellular density,
or microstructural properties can map to different imaging sub-
types. In particular, one study [87] discovered three distinct clus-
ters of GBM subtypes through unsupervised clustering of these
features, with significant differences in survival probabilities and
associations with specific molecular signaling pathways. These
imaging subtypes, namely solid, irregular, and rim-enhancing,
were significantly linked to different clinical outcomes and molecu-
lar characteristics, including isocitrate dehydrogenase-1,
O6-methylguanine-DNA methyltransferase, epidermal growth fac-
tor receptor variant III, and transcriptomic molecular subtype
composition.
These studies have offered new insights into the characteriza-
tion of tumor heterogeneity on both microscopic, i.e., histology
and molecular, and macroscopic, i.e., imaging levels, consequently
providing a more comprehensive understanding of the tumor
aggressiveness and patient prognosis, and ultimately, the develop-
ment of personalized treatments.
4 Conclusion
Acknowledgements
References
1. Murray ME, Graff-Radford NR, Ross OA, 5. Petersen RC, Aisen PS, Beckett LA, Donohue
Petersen RC, Duara R, Dickson DW (2011) MC, Gamst AC, Harvey DJ, Jack CR, Jagust
Neuropathologically defined subtypes of Alz- WJ, Shaw LM, Toga AW, Trojanowski JQ,
heimer’s disease with distinct clinical character- Weiner MW (2010) Alzheimer’s disease neuro-
istics: a retrospective study. Lancet Neurol imaging initiative (ADNI): clinical characteri-
10(9):785–796. https://doi.org/10.1016/ zation. Neurology 74(3):201–209. https://
S1474-4422(11)70156-9 doi.org/10.1212/WNL.0b013e3181cb3e25
2. Noh Y, Jeon S, Lee JM, Seo SW, Kim GH, 6. Varol E, Sotiras A, Davatzikos C (2017)
Cho H, Ye BS, Yoon CW, Kim HJ, Chin J, HYDRA: revealing heterogeneity of imaging
Park KH, Heilman KM, Na DL (2014) Anato- and genetic patterns through a multiple
mical heterogeneity of Alzheimer disease: max-margin discriminative analysis framework.
based on cortical thickness on MRIs. Neurol- NeuroImage 145:346–364. https://doi.
ogy 83(21):1936–1944. https://doi.org/10. org/10.1016/j.neuroimage.2016.02.041
1212/WNL.0000000000001003 7. Vogel JW, Young AL, Oxtoby NP, Smith R,
3. Whitwell JL, Petersen RC, Negash S, Weigand Ossenkoppele R, Strandberg OT, La Joie R,
SD, Kantarci K, Ivnik RJ, Knopman DS, Boeve Aksman LM, Grothe MJ, Iturria-Medina Y,
BF, Smith GE, Jack CR (2007) Patterns of Pontecorvo MJ, Devous MD, Rabinovici GD,
atrophy differ among specific subtypes of mild Alexander DC, Lyoo CH, Evans AC, Hansson
cognitive impairment. Arch Neurol 64(8): O (2021) Four distinct trajectories of tau
1130–1138. https://doi.org/10.1001/ deposition identified in Alzheimer’s disease.
archneur.64.8.1130 Nat Med 27(5):871–881. https://doi.
4. Miller KL, Alfaro-Almagro F, Bangerter NK, org/10.1038/s41591-021-01309-6
Thomas DL, Yacoub E, Xu J, Bartsch AJ, 8. Wen J, Varol E, Sotiras A, Yang Z, Chand GB,
Jbabdi S, Sotiropoulos SN, Andersson JLR, Erus G, Shou H, Abdulkadir A, Hwang G,
Griffanti L, Douaud G, Okell TW, Weale P, Dwyer DB, Pigoni A, Dazzan P, Kahn RS,
Dragonu I, Garratt S, Hudson S, Collins R, Schnack HG, Zanetti MV, Meisenzahl E,
Jenkinson M, Matthews PM, Smith SM Busatto GF, Crespo-Facorro B, Rafael RG,
(2016) Multimodal population brain imaging Pantelis C, Wood SJ, Zhuo C, Shinohara RT,
in the UK Biobank prospective epidemiological Fan Y, Gur RC, Gur RE, Satterthwaite TD,
study. Nat Neurosci 19(11):1523–1536. Koutsouleris N, Wolf DH, Davatzikos C, Alz-
https://doi.org/10.1038/nn.4393 heimer’s disease neuroimaging initiative
506 Junhao Wen et al.
43. Ng A, Jordan M, Weiss Y (2002) On spectral MP (2017) Enhancing studies of the connec-
clustering: analysis and an algorithm. In: tome in autism using the autism brain imaging
Dietterich T, Becker S, Ghahramani Z (eds) data exchange II. Sci Data 4:170010. https://
Advances in neural information processing sys- doi.org/10.1038/sdata.2017.10
tems, vol 14. MIT Press, Cambridge, 52. Loth E, Charman T, Mason L, Tillmann J,
M A . h t t ps : // p ro c ee d i n g s . n e u r i p s . c c / Jones EJH, Wooldridge C, Ahmad J,
paper/2001/file/801272ee79cfde7fa5960 Auyeung B, Brogna C, Ambrosino S,
571fee36b9b-Paper.pdf Banaschewski T, Baron-Cohen S,
44. Faras H, Al AN, Tidmarsh L (2010) Autism Baumeister S, Beckmann C, Brammer M,
spectrum disorders. Ann Saudi Med 30(4): Brandeis D, Bölte S, Bourgeron T, Bours C,
295–300. https://doi.org/10.4103/0256-4 de Bruijn Y, Chakrabarti B, Crawley D,
947.65261 Cornelissen I, Acqua FD, Dumas G,
45. Gottesman II, Shields J, Hanson DR (1982) Durston S, Ecker C, Faulkner J, Frouin V,
Schizophrenia. CUP Archive. Google-Books- Garces P, Goyard D, Hayward H, Ham LM,
ID: eoA6AAAAIAAJ Hipp J, Holt RJ, Johnson MH, Isaksson J,
46. Mucke L (2009) Alzheimer’s disease. Nature Kundu P, Lai MC, D’ardhuy XL, Lombardo
461(7266):895–897. https://doi.org/10.103 MV, Lythgoe DJ, Mandl R, Meyer-
8/461895a Lindenberg A, Moessnang C, Mueller N,
O’Dwyer L, Oldehinkel M, Oranje B,
47. Ostrom QT, Adel Fahmideh M, Cote DJ, Pandina G, Persico AM, Ruigrok ANV,
Muskens IS, Schraw JM, Scheurer ME, Bondy Ruggeri B, Sabet J, Sacco R, Cáceres ASJ,
ML (2019) Risk factors for childhood and Simonoff E, Toro R, Tost H, Waldman J, Wil-
adult primary brain tumors. Neuro-Oncology liams SCR, Zwiers MP, Spooren W, Murphy
21(11):1357–1375. https://doi.org/10. DGM, Buitelaar JK (2017) The EU-AIMS
1093/neuonc/noz123 Longitudinal European Autism Project
48. Masi A, DeMayo MM, Glozier N, Guastella AJ (LEAP): design and methodologies to identify
(2017) An overview of autism spectrum disor- and validate stratification biomarkers for autism
der, heterogeneity and treatment options. spectrum disorders. Mol Autism 8:24. https://
Neurosci Bull 33(2):183–193. https://doi. doi.org/10.1186/s13229-017-0146-8
org/10.1007/s12264-017-0100-y 53. Hong SJ, Vogelstein JT, Gozzi A, Bernhardt
49. Lord C, Jones RM (2012) Annual research BC, Yeo BTT, Milham MP, Di Martino A
review: re-thinking the classification of autism (2020) Toward neurosubtypes in Autism. Biol
spectrum disorders. J Child Psychol Psychiatry Psychiatry 88(1):111–128. https://doi.
Allied Disciplines 53(5):490–509. https://doi. org/10.1016/j.biopsych.2020.03.022
org/10.1111/j.1469-7610.2012.02547.x 54. Chen H, Uddin LQ, Guo X, Wang J, Wang R,
50. Wolfers T, Floris DL, Dinga R, van Rooij D, Wang X, Duan X, Chen H (2019) Parsing
Isakoglou C, Kia SM, Zabihi M, Llera A, brain structural heterogeneity in males with
Chowdanayaka R, Kumar VJ, Peng H, autism spectrum disorder reveals distinct clini-
Laidi C, Batalle D, Dimitrova R, Charman T, cal subtypes. Hum Brain Mapp 40(2):
Loth E, Lai MC, Jones E, Baumeister S, 628–637. https://doi.org/10.1002/hbm.
Moessnang C, Banaschewski T, Ecker C, 24400
Dumas G, O’Muircheartaigh J, Murphy D, 55. Easson AK, Fatima Z, McIntosh AR (2019)
Buitelaar JK, Marquand AF, Beckmann CF Functional connectivity-based subtypes of indi-
(2019) From pattern classification to stratifica- viduals with and without Autism Spectrum
tion: towards conceptualizing the heterogene- Disorder. Netw Neurosci 3(2):344–362.
ity of Autism Spectrum Disorder. Neurosci https://doi.org/10.1162/netn_a_00067
Biobehav Rev 104:240–254. https://doi.
org/10.1016/j.neubiorev.2019.07.010 56. Duffy FH, Als H (2019) Autism, spectrum or
clusters? An EEG coherence study. BMC
51. Di Martino A, O’Connor D, Chen B, Neurol 19(1):27. https://doi.org/10.1186/
Alaerts K, Anderson JS, Assaf M, Balsters JH, s12883-019-1254-1
Baxter L, Beggiato A, Bernaerts S, Blanken
LME, Bookheimer SY, Braden BB, Byrge L, 57. Tang S, Sun N, Floris DL, Zhang X, Di
Castellanos FX, Dapretto M, Delorme R, Fair Martino A, Yeo BTT (2020) Reconciling
DA, Fishman I, Fitzgerald J, Gallagher L, dimensional and categorical models of autism
Keehn RJJ, Kennedy DP, Lainhart JE, heterogeneity: a brain connectomics and
Luna B, Mostofsky SH, Müller RA, Nebel behavioral study. Biol Psychiatry 87(12):
MB, Nigg JT, O’Hearn K, Solomon M, 1071–1082. https://doi.org/10.1016/j.
Toro R, Vaidya CJ, Wenderoth N, White T, biopsych.2019.11.009
Craddock RC, Lord C, Leventhal B, Milham
Subtyping Brain Diseases 509
58. Zabihi M, Floris DL, Kia SM, Wolfers T, (2014) Dissecting psychiatric spectrum disor-
Tillmann J, Arenas AL, Moessnang C, ders by generative embedding. NeuroImage
Banaschewski T, Holt R, Baron-Cohen S, Clin 4:98–111. https://doi.org/10.1016/j.
Loth E, Charman T, Bourgeron T, nicl.2013.11.002
Murphy D, Ecker C, Buitelaar JK, Beckmann 68. Yan W, Zhao M, Fu Z, Pearlson GD, Sui J,
CF, Marquand A, EU-AIMS LEAP Group Calhoun VD (2022) Mapping relationships
(2020) Fractionating autism based on neuro- among schizophrenia, bipolar and schizoaffec-
anatomical normative modeling. Transl Psychi- tive disorders: a deep classification and cluster-
atry 10(1):384. https://doi.org/10.1038/ ing framework using fMRI time series.
s41398-020-01057-0 Schizophr Res 245:141–150. https://doi.
59. Loomes R, Hull L, Mandy WPL (2017) What org/10.1016/j.schres.2021.02.007
is the male-to-female ratio in autism spectrum 69. World Health Organization (1992) The
disorder? A systematic review and meta- ICD-10 classification of mental and beha-
analysis. J Am Acad Child Adolesc Psychiatry vioural disorders: clinical descriptions and diag-
56(6):466–474. https://doi.org/10.1016/j. nostic guidelines. Tech. rep., World Health
jaac.2017.03.013 Organization, p 362. https://apps.who.int/
60. Gillberg C, Fernell E (2014) Autism plus ver- iris/handle/10665/37958. ISBN:
sus autism pure. J Autism Dev Disord 44(12): 9787117019576
3274–3276. https://doi.org/10.1007/s10 70. Fried EI, Nesse RM (2015) Depression is not a
803-014-2163-1 consistent syndrome: an investigation of
61. Haijma SV, Van Haren N, Cahn W, Koolschijn unique symptom patterns in the STAR*D
PCMP, Hulshoff Pol HE, Kahn RS (2013) study. J Affective Disord 172:96–102.
Brain volumes in schizophrenia: a meta-analysis https://doi.org/10.1016/j.jad.2014.10.010
in over 18000 subjects. Schizophr Bull 39(5): 71. Lynch CJ, Gunning FM, Liston C (2020)
1129–1138. https://doi.org/10.1093/ Causes and consequences of diagnostic hetero-
schbul/sbs118 geneity in depression: paths to discovering
62. Pantelis C, Velakoulis D, McGorry PD, Wood novel biological depression subtypes. Biol Psy-
SJ, Suckling J, Phillips LJ, Yung AR, Bullmore chiatry 88(1):83–94. https://doi.org/10.101
ET, Brewer W, Soulsby B, Desmond P, 6/j.biopsych.2020.01.012
McGuire PK (2003) Neuroanatomical 72. Buch AM, Liston C (2021) Dissecting diag-
abnormalities before and after onset of psycho- nostic heterogeneity in depression by integrat-
sis: a cross-sectional and longitudinal MRI ing neuroimaging and genetics.
comparison. Lancet (London, England) Neuropsychopharmacology 46(1):156–175.
361(9354):281–288. https://doi.org/10.101 https://doi.org/10.1038/s41386-020-00
6/S0140-6736(03)12323-9 789-3
63. Abi-Dargham A, Horga G (2016) The search 73. Rajkowska G, Miguel-Hidalgo JJ, Wei J,
for imaging biomarkers in psychiatric disor- Dilley G, Pittman SD, Meltzer HY, Overholser
ders. Nat Med 22(11):1248–1255. https:// JC, Roth BL, Stockmeier CA (1999) Morpho-
doi.org/10.1038/nm.4190 metric evidence for neuronal and glial prefron-
64. Insel T, Cuthbert B (2015) Medicine. brain tal cell pathology in major depression. Biol
disorders? precisely. Science (New York, NY) Psychiatry 45(9):1085–1098. https://doi.
348:499–500. https://doi.org/10.1126/sci org/10.1016/s0006-3223(99)00041-4
ence.aab2358 74. Tokuda T, Yoshimoto J, Shimizu Y, Okada G,
65. Kaczkurkin AN, Moore TM, Sotiras A, Xia Takamura M, Okamoto Y, Yamawaki S, Doya K
CH, Shinohara RT, Satterthwaite TD (2020) (2018) Identification of depression subtypes
Approaches to defining common and dissoci- and relevant brain regions using a data-driven
able neurobiological deficits associated with approach. Sci Rep 8:14082. https://doi.
psychopathology in youth. Biol Psychiatry org/10.1038/s41598-018-32521-z
88(1):51–62. https://doi.org/10.1016/j. 75. Drysdale AT, Grosenick L, Downar J,
biopsych.2019.12.015 Dunlop K, Mansouri F, Meng Y, Fetcho RN,
66. Yang Z, Xu Y, Xu T, Hoy CW, Handwerker Zebley B, Oathes DJ, Etkin A, Schatzberg AF,
DA, Chen G, Northoff G, Zuo XN, Bandettini Sudheimer K, Keller J, Mayberg HS, Gunning
PA (2014) Brain network informed subject FM, Alexopoulos GS, Fox MD, Pascual-
community detection in early-onset schizo- Leone A, Voss HU, Casey BJ, Dubin MJ, Lis-
phrenia. Sci Rep 4:5549. https://doi.org/10. ton C (2017) Resting-state connectivity bio-
1038/srep05549 markers define neurophysiological subtypes of
67. Brodersen KH, Deserno L, Schlagenhauf F, depression. Nat Med 23(1):28–38. https://
Lin Z, Penny WD, Buhmann JM, Stephan KE doi.org/10.1038/nm.4246
510 Junhao Wen et al.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made. The images or other
third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated
otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder.
Chapter 17
Abstract
Intense debate in the neurology community before 2010 culminated in hypothetical models of Alzheimer’s
disease progression: a pathophysiological cascade of biomarkers, each dynamic for only a segment of the full
disease timeline. Inspired by this, data-driven disease progression modeling emerged from the computer
science community with the aim to reconstruct neurodegenerative disease timelines using data from large
cohorts of patients, healthy controls, and prodromal/at-risk individuals. This chapter describes selected
highlights from the field, with a focus on utility for understanding and forecasting of disease progression.
Key words Disease progression, Disease understanding, Forecasting, Cross-sectional data, Longitu-
dinal data, Disease timelines, Disease trajectories, Subtyping, Biomarkers
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_17,
© The Author(s) 2023
511
512 Neil P. Oxtoby
Fig. 1 Quadrant matrix. D3PMs all estimate a disease timeline, with some capable of estimating multiple
subtype timelines, using either cross-sectional data (pseudo-timeline) or longitudinal data (time-shift).
Abbreviations: EBM, event-based model; DEBM, discriminative EBM; KDE-EBM, kernel density estimation
EBM; DPS, disease progression score; LTJMM, latent-time joint mixed model; GPPM, Gaussian process
progression model; SuStaIn, subtype and stage inference; SubLign, subtyping alignment
Disease Progression Modeling 513
2 Data Preprocessing
3.1 Single Timeline There is only one framework for estimating disease timelines from
Estimation Using cross-sectional data: event-based modeling.
Cross-Sectional Data
3.1.1 Event-Based Model The event-based model (EBM) emerged in 2011 [5, 6]. The con-
cept is simple: in a progressive disease, biomarker measurements
only ever get worse, i.e., become increasingly and irreversibly
abnormal. Thus, among a cohort of individuals at different stages
of a single progressive disease, the cumulative sequence of bio-
marker abnormality events can be inferred from only a single visit
per individual. This requires making a few assumptions: measure-
ments from individuals are independent and represent samples
from a single sequence of cumulative abnormality, i.e., a single
timeline of disease progression. Such assumptions are common-
place in many statistical analyses of disease progression and are
reasonable approximations to make when analyzing data from
research studies that typically have strict inclusion and exclusion
criteria to focus on a single condition of interest. Unsurprisingly,
the event-based model has proven to be extremely powerful, pro-
ducing insight into many neurodegenerative diseases: sporadic
Alzheimer’s disease [7–10], familial Alzheimer’s disease [6, 11],
Huntington’s disease [6, 12], Parkinson’s disease [13], and others
[14, 15].
Disease Progression Modeling 515
Fig. 2 Event-based models fit a mixture model to map biomarker values to abnormality probabilities. Left to
right shows the convergence of a kernel density estimate (KDE) mixture model. From Firth et al. [9] (CC BY 4.0)
EBM Fitting The first step in fitting an event-based model maps biomarker
values to abnormality values, similar to the hypothetical curves of
biomarker abnormality proposed in 2010 [1, 2]. The EBM does
this probabilistically, using bivariate mixture modeling where indi-
viduals can be labeled either as pre-event/normal or post-event/
abnormal to allow for (later) events that are yet to occur in patients,
and similarly for the possibility of (earlier) events to have occurred
in asymptomatic individuals. Various distributions have been pro-
posed for this mixture modeling: combinations of uniform [5, 6],
Gaussian [5–7], and kernel density estimate (KDE) distributions
[9]. This is visualized in Fig. 2.
The second step in fitting an EBM over N events is to search
the space of N! possible sequences S to reveal the most likely
sequence (see refs. 6, 7, 9 for mathematical details). For small
N ≲ 10, it can be computationally feasible to perform an exhaustive
search over all possible N! sequences to find the maximum likeli-
hood/a posteriori solution. The EBM uses a combination of
multiply-initialized gradient ascent, followed by MCMC sampling
to estimate uncertainty in the sequence. This results in a model
posterior that is a collection of samples from the posterior proba-
bility density for each biomarker as a function of sequence position.
This is presented as a positional variance diagram [6], such as in
Fig. 3.
For further information and to try out EBM tutorials, the
reader is directed to the open-source kde_ebm package (github.
com/ucl-pond/kde_ebm) and disease-progression-modelling.
github.io.
Fig. 3 The event-based model posterior is a positional variance diagram showing uncertainty (left-to-right) in
the maximum likelihood sequence (top-to-bottom). Parkinson’s disease model from Oxtoby et al. [13] (CC BY
4.0)
DEBM Fitting As with the EBM, DEBM model fitting starts with mixture model-
ing (see Subheading 3.1.1). Next, a sequence is estimated for each
individual by ranking the abnormality probabilities in descending
order. A group-level mean sequence (with variance) is estimated by
fitting the individual sequences to a Mallow’s model. For details,
see refs. 16, 17 and subsequent innovations to the DEBM. Notably,
DEBM is often quicker to fit than EBM, which makes it appealing
for high-dimensional extensions, e.g., aiming to estimate voxel-
wise atrophy signatures from cross-sectional brain imaging data.
Disease Progression Modeling 517
Fig. 4 The concept of subtype and stage inference (SuStaIn). Reproduced from Young et al. [18] (CC BY 4.0)
Disease Progression Modeling 519
4.1 Single Timeline There are both parametric and nonparametric approaches to esti-
Estimation Using mating disease timelines from longitudinal data. The common goal
Longitudinal Data is to “stitch together” a full disease timeline (decades long) out of
relatively short samples from individuals (a few years each) covering
a range of severity in symptoms and biomarker abnormality. Some
of the earliest work emerged from the medical image registration
community, where “warping” images to a common template is one
of the first steps in group analyses [26].
Broadly speaking, there are two categories of D3PMs for
longitudinal data: time-shifting models and differential equation
models. Time-shifting models translate/deform the individual
data, metaphorically stitching them together into a quantitative
template of disease progression. Differential equation models esti-
mate a statistical model of biomarker dynamics in phase-plane space
(position vs velocity), which is subsequently inverted to produce
biomarker trajectories.
520 Neil P. Oxtoby
4.1.1 Explicit Models for Jedynak et al. [27] introduced the disease progression score (DPS)
Longitudinal Data: Latent- model in 2012, which aligns biomarker data from individuals to a
Time Models group template model using a linear transformation of age into a
disease progression score si = αiage + βi. Individuals have their own
rate of progression αi (constant over the short observation time)
and disease onset βi. Group-level biomarker dynamics are modeled
as sigmoid (“S”) curves. A Bayesian extension of the DPS approach
(BPS) appeared in 2019 [28]. Code for both the DPS and BPS was
released publicly: https://www.nitrc.org/projects/progscore;
https://hub.docker.com/r/bilgelm/bayesian-ps-adni/.
Donohue et al. [29] introduced a self-modeling regression
approach similar to the DPS model in 2014. It was later generalized
into the more flexible latent-time joint mixed (effects) model
(LTJMM) [30], which can include covariates as fixed effects and
is a flexible Bayesian framework for inference. The LTJMM soft-
ware was released publicly: https://bitbucket.org/mdonohue/
ltjmm.
A nonparametric latent-time mixed model appeared in 2017:
the Gaussian process progression model (GPPM) of Lorenzi et al.
[31]. This is a flexible Bayesian approach akin to (parametric) self-
modeling regression that doesn’t impose a parametric form for
biomarker trajectories. More recent work supplemented the
GPPM with a dynamical systems model of molecular pathology
spread through the brain [32] that can regularize the GPPM fit to
produce a more accurate disease timeline reconstruction that also
provides insight into neurodegenerative disease mechanisms
(which is a topic that could be a standalone chapter of this book).
The GPPM and GPPM-DS model source code was released pub-
licly via gitlab.inria.fr/epione and tutorials are available at disease-
progression-model.github.io.
In 2015, Schiratti et al. [33–35] introduced a general frame-
work for estimating spatiotemporal trajectories for any type of
manifold-valued data. The framework is based on Riemannian
geometry and a mixed-effects model with time reparametrization.
It was subsequently extended by Koval et al. [36] to form the
disease course mapping approach (available in the leaspy software
package). Disease course mapping combines time warping (of age)
and inter-biomarker spacing translation. Time warping changes
disease progression dynamics—time shift/onset and acceleration/
progression speed—but not the trajectory. Inter-biomarker spa-
cings shift an individual’s trajectory to account for individual differ-
ences in the timing and ordering of biomarker trajectories.
Figures 5 and 6 show example outputs of these models when
trained on data from older people at risk of Alzheimer’s disease,
including those with diagnosed mild cognitive impairment and
dementia due to probable Alzheimer’s disease.
Disease Progression Modeling 521
Fig. 5 Two examples of D3PMs fit to longitudinal data: disease progression score [27] and Gaussian process
progression model [31]. (a) Alzheimer’s disease progression score (2012) [27]. Reprinted from NeuroImage,
Vol 63, Jedynak et al., A computational neurodegenerative disease progression score: Method and results with
the Alzheimer’s Disease Neuroimaging Initiative cohort, 1478–1486, ©(2012), with permission from Elsevier.
(b) Gaussian process progression model (2017) [31]. Reprinted from NeuroImage, Vol 190, Lorenzi et al.,
Probabilistic disease progression modeling to characterize diagnostic uncertainty: Application to staging and
prediction in Alzheimer’s disease, 56–68, ©(2019), with permission from Elsevier
522 Neil P. Oxtoby
Fig. 6 Two additional examples of D3PMs fit to longitudinal data: latent-time joint mixed model [30] and
disease course mapping [36]. (a) Latent-time joint mixed model (2017) [30]. From [37] (CC BY 4.0). (b)
Alzheimer’s disease course map (2021) [36] (CC BY 4.0)
Disease Progression Modeling 523
Fitting Longitudinal Latent- Fitting D3PMs for longitudinal data is more complex than for
Time Models cross-sectional data, and the software packages discussed above
each expect the data in slightly different formats. One thing they
have in common is that renormalization (e.g., min-max or z-score)
and reorientation (e.g., to be increasing) is required to put biomar-
kers on a common scale and direction. In some cases, such pre-
processing is necessary to ensure/accelerate model convergence.
For example, the LTJMM used a quantile transformation followed
by inverse Gaussian quantile function to put all biomarkers on a
Gaussian scale. For further detailed discussion, including model
identifiability, we refer the reader to the original publications
cited above and the didactic resources at disease-progression-
modelling.github.io.
4.1.2 Implicit Models for Parametric differential equation D3PMs emerged between 2011
Longitudinal Data: and 2014 [38–41], receiving a more formal treatment in 2017
Differential Equation [42]. In a hat-tip to physics, these have also been dubbed “phase-
Models plane” models, which aids in their understanding as a model of
velocity (biomarker progression rate) as a function of position
(biomarker value). Model fitting is a two-step process whereby
the long-time biomarker trajectory is estimated by integrating the
phase-plane model estimated on observed data.
A nonparametric differential equation D3PM using Gaussian
processes (GP-DEM) was introduced in 2018 [11]. This added
flexibility to the preceding parametric approaches and produced
state-of-the-art results in predicting symptom onset in familial
Alzheimer’s disease.
Fitting Differential Equation The concept is shown in Fig. 7: differential equation model fitting
Models is a three-step process. First, estimate a single value per individual of
biomarker “velocity” and “position,” and then estimate a group-
level differential equation model of velocity y as a function of
position x, which is integrated/inverted to produce a biomarker
trajectory x(t). For example, linear regression can produce esti-
mates of position (e.g., intercept) and velocity (e.g., gradient).
Differential equation models can be univariate or multivariate and
can include covariates explicitly.
4.1.3 Hybrid Discrete- Recent work introduced the temporal EBM (TEBM) [43, 44],
Continuous Models which augments event-based modeling with hidden Markov mod-
eling to produce a hybrid discrete-continuous D3PM. This is a
halfway house between discrete models (great for medical decision
making) and continuous models (great for detailed understanding
of disease progression). Trained on data from ADNI, the TEBM
revealed the full timeline of the pathophysiological cascade of Alz-
heimer’s disease, as shown in Fig. 8.
524 Neil P. Oxtoby
Fig. 7 Differential equation models, or phase-plane models, for biomarker dynamics involve a three-step
process: estimate individual-level position and velocity; fit a group-level model of velocity y vs position x; and
integrate to produce a trajectory x(t). Reprinted by permission from Springer Nature: Oxtoby, N.P. et al.,
Learning Imaging Biomarker Trajectories from Noisy Alzheimer’s Disease Data Using a Bayesian Multilevel
Model. In: Cardoso, M.J., Simpson, I., Arbel, T., Precup, D., Ribbens, A. (eds) Bayesian and grAphical Models
for Biomedical Imaging. Lecture Notes in Computer Science, vol 8677, pp. 85–94 ©(2014) [41]. (a) Data. (b)
Differential fit. (c) Est. trajectory. (d) Stochastic model
Fig. 8 Alzheimer’s disease sequence and timeline estimated by a hybrid discrete-continuous D3PM: the
temporal event-based model [43, 44]. Permission to reuse was kindly granted by the authors of [43]
4.2 Subtyping Using Clustering longitudinal data without a well-defined time axis can be
Longitudinal Data extremely difficult. Jointly estimating latent time for multiple tra-
jectories is an identifiability challenge, i.e., multiple parameter
combinations can explain the same data. This is particularly chal-
lenging when observations span a relatively small fraction of the full
disease timeline, as in age-related neurodegenerative diseases.
Chen et al. [45] introduced SubLign for subtyping and align-
ing longitudinal disease data. The authors frame the challenge
eloquently as having misaligned, interval-censored data: left cen-
soring from patients being observed only after disease onset and
right censoring from patient dropout in more severe disease. Sub-
Lign combines a deep generative model (based on a recurrent
neural network [46]) for learning individual latent time-shifts and
parametric biomarker trajectories using a variational approach, fol-
lowed by k-means clustering. It was applied to data from a Parkin-
son’s disease cohort to recover some known clinical phenotypes in
new detail.
Disease Progression Modeling 525
5 Conclusion
Acknowledgements
Appendix
Table A.1
A taxonomy and pedigree of D3PM papers. *Asterisks denote models for cross-sectional data
(continued)
Disease Progression Modeling 527
Table A.1
(continued)
References
1. Jack CR, Knopman DS, Jagust WJ, Shaw LM, 9. Firth NC, Primativo S, Brotherhood E, Young
Aisen PS, Weiner MW, Petersen RC, Troja- AL, Yong KX, Crutch SJ, Alexander DC,
nowski JQ (2010) Hypothetical model of Oxtoby NP (2020) Sequences of cognitive
dynamic biomarkers of the Alzheimer’s patho- decline in typical Alzheimer’s disease and pos-
logical cascade. Lancet Neurol 9(1):119–128. terior cortical atrophy estimated using a novel
https://doi.org/10.1016/S1474-4422(09) event-based model of disease progression. Alz-
70299-6 heimer’s Dementia 16(7):965–973. https://
2. Aisen PS, Petersen RC, Donohue MC, doi.org/10.1002/alz.12083
Gamst A, Raman R, Thomas RG, Walter S, 10. Janelidze S, Berron D, Smith R, Strandberg O,
Trojanowski JQ, Shaw LM, Beckett LA, Jr Proctor NK, Dage JL, Stomrud E, Palmqvist S,
CRJ, Jagust W, Toga AW, Saykin AJ, Morris Mattsson-Carlgren N, Hansson O (2021)
JC, Green RC, Weiner MW (2010) Clinical Associations of plasma phospho-Tau217 levels
core of the Alzheimer’s disease neuroimaging with tau positron emission tomography in early
initiative: progress and plans. Alzheimer’s Alzheimer disease. JAMA Neurol 78(2):
Dementia 6(3):239–246. https://doi. 1 4 9 – 1 5 6 . h t t p s : // d o i . o r g / 1 0 . 1 0 0 1 /
org/10.1016/j.jalz.2010.03.006 jamaneurol.2020.4201
3. Oxtoby NP, Alexander DC, EuroPOND Con- 11. Oxtoby NP, Young AL, Cash DM, Benzinger
sortium (2017) Imaging plus X: multimodal TLS, Fagan AM, Morris JC, Bateman RJ, Fox
models of neurodegenerative disease. Curr NC, Schott JM, Alexander DC (2018) Data-
Opin Neurol 30(4). https://doi.org/10. driven models of dominantly-inherited Alzhei-
1097/WCO.0000000000000460 mer’s disease progression. Brain 141(5):
4. Young AL, Oxtoby NP, Ourselin S, Schott JM, 1529–1544. https://doi.org/10.1093/
Alexander DC (2015) A simulation system for brain/awy050
biomarker evolution in neurodegenerative dis- 12. Wijeratne PA, Young AL, Oxtoby NP, Mari-
ease. Med Image Anal 26(1):47–56. https:// nescu RV, Firth NC, Johnson EB, Mohan A,
doi.org/10.1016/j.media.2015.07.004 Sampaio C, Scahill RI, Tabrizi SJ, Alexander
5. Fonteijn H, Clarkson M, Modat M, Barnes J, DC (2018) An image-based model of brain
Lehmann M, Ourselin S, Fox N, Alexander D volume biomarker changes in Huntington’s
(2011) An event-based disease progression disease. Ann Clin Transl Neurol 5(5):
model and its application to familial Alzhei- 5 7 0 – 5 8 2 . h t t p s : // d o i . o r g / 1 0 . 1 0 0 2 /
mer’s disease. In: Székely G, Hahn H (eds) acn3.558
Information processing in medical imaging. 13. Oxtoby NP, Leyland LA, Aksman LM, Thomas
Lecture notes in computer science, vol 6801. GEC, Bunting EL, Wijeratne PA, Young AL,
Springer, Berlin/Heidelberg, pp 748–759. Zarkali A, Tan MMX, Bremner FD, Keane PA,
https://doi.org/10.1007/978-3-642-22092- Morris HR, Schrag AE, Alexander DC, Weil RS
0_61 (2021) Sequence of clinical and neurodegen-
6. Fonteijn HM, Modat M, Clarkson MJ, eration events in Parkinson’s disease progres-
Barnes J, Lehmann M, Hobbs NZ, Scahill RI, sion. Brain 144(3):975–988. https://doi.org/
Tabrizi SJ, Ourselin S, Fox NC, Alexander DC 10.1093/brain/awaa461
(2012) An event-based model for disease pro- 14. Eshaghi A, Marinescu RV, Young AL, Firth
gression and its application in familial Alzhei- NC, Prados F, Jorge Cardoso M, Tur C, De
mer’s disease and Huntington’s disease. Angelis F, Cawley N, Brownlee WJ, De
NeuroImage 60(3):1880–1889. https://doi. Stefano N, Laura Stromillo M, Battaglini M,
org/10.1016/j.neuroimage.2012.01.062 Ruggieri S, Gasperini C, Filippi M, Rocca MA,
7. Young AL, Oxtoby NP, Daga P, Cash DM, Fox Rovira A, Sastre-Garriga J, Geurts JJG,
NC, Ourselin S, Schott JM, Alexander DC Vrenken H, Wottschel V, Leurs CE,
(2014) A data-driven model of biomarker Uitdehaag B, Pirpamer L, Enzinger C,
changes in sporadic Alzheimer’s disease. Brain Ourselin S, Gandini Wheeler-Kingshott CA,
137(9):2564–2577. https://doi.org/10. Chard D, Thompson AJ, Barkhof F, Alexander
1093/brain/awu176 DC, Ciccarelli O (2018) Progression of
8. Oxtoby NP, Garbarino S, Firth NC, Warren regional grey matter atrophy in multiple sclero-
JD, Schott JM, Alexander DC, FtADNI sis. Brain 141(6):1665–1677. https://doi.
(2017) Data-driven sequence of changes to org/10.1093/brain/awy088
anatomical brain connectivity in sporadic Alz- 15. Firth NC, Startin CM, Hithersay R,
heimer’s disease. Front Neurol 8:580. https:// Hamburg S, Wijeratne PA, Mok KY, Hardy J,
doi.org/10.3389/fneur.2017.00580 Alexander DC, Consortium TL, Strydom A
Disease Progression Modeling 529
(2018) Aging related cognitive changes asso- pulmonary disease. Am J Respir Crit Care
ciated with Alzheimer’s disease in Down syn- Med 201(3):294–302. https://doi.org/10.
drome. Ann Clin Transl Neurol 5(6):741–751. 1164/rccm.201908-1600OC
https://doi.org/10.1002/acn3.571 23. Li M, Lan L, Luo J, Peng L, Li X, Zhou X
16. Venkatraghavan V, Bron EE, Niessen WJ, Klein (2021) Identifying the phenotypic and tempo-
S (2017) A discriminative event based model ral heterogeneity of knee osteoarthritis: data
for Alzheimer’s disease progression from the osteoarthritis initiative. Front Public
modeling. In: Niethammer M, Styner M, Health 9. https://doi.org/10.3389/fpubh.
Aylward S, Zhu H, Oguz I, Yap PT, Shen D 2021.726140
(eds) Information processing in medical imag- 24. Aksman LM, Wijeratne PA, Oxtoby NP,
ing. Lecture notes in computer science. Eshaghi A, Shand C, Altmann A, Alexander
Springer, Cham, pp 121–133. https://doi. DC, Young AL (2021) pySuStaIn: a python
org/10.1007/978-3-319-59050-9_10 implementation of the subtype and stage infer-
17. Venkatraghavan V, Bron EE, Niessen WJ, Klein ence algorithm. SoftwareX 16:100811.
S (2019) Disease progression timeline estima- https://doi.org/10.1016/j.softx.2021.100
tion for Alzheimer’s disease using discrimina- 811
tive event based modeling. NeuroImage 186: 25. Young AL, Vogel JW, Aksman LM, Wijeratne
518–532. https://doi.org/10.1016/j. PA, Eshaghi A, Oxtoby NP, Williams SCR,
neuroimage.2018.11.024 Alexander DC, ftADNI (2021) Ordinal sus-
18. Young AL et al (2018) Uncovering the hetero- tain: subtype and stage inference for clinical
geneity and temporal complexity of neurode- scores, visual ratings, and other ordinal data.
generative diseases with subtype and stage Front Artif Intell 4:111. https://doi.org/10.
inference. Nat Commun 9(1):4273. https:// 3389/frai.2021.613261
doi.org/10.1038/s41467-018-05892-0 26. Durrleman S, Pennec X, Trouvé A, Gerig G,
19. Vogel JW, Young AL, Oxtoby NP, Smith R, Ayache N (2009) Spatiotemporal atlas estima-
Ossenkoppele R, Strandberg OT, La Joie R, tion for developmental delay detection in lon-
Aksman LM, Grothe MJ, Iturria-Medina Y, gitudinal datasets. In: Yang GZ, Hawkes D,
the Alzheimer’s Disease Neuroimaging Initia- Rueckert D, Noble A, Taylor C (eds) Medical
tive, Pontecorvo MJ, Devous MD, Rabinovici image computing and computer-assisted
GD, Alexander DC, Lyoo CH, Evans AC, intervention—MICCAI 2009. Springer, Ber-
Hansson O (2021) Four distinct trajectories lin, pp 297–304. https://doi.org/10.1007/
of tau deposition identified in Alzheimer’s dis- 978-3-642-04268-3_37
ease. Nat Med. https://doi.org/10.1038/s41 27. Jedynak BM, Lang A, Liu B, Katz E, Zhang Y,
591-021-01309-6 Wyman BT, Raunig D, Jedynak CP, Caffo B,
20. Eshaghi A, Young AL, Wijeratne PA, Prados F, Prince JL (2012) A computational neurode-
Arnold DL, Narayanan S, Guttmann CRG, generative disease progression score: method
Barkhof F, Alexander DC, Thompson AJ, and results with the Alzheimer’s disease neuro-
Chard D, Ciccarelli O (2021) Identifying mul- imaging initiative cohort. NeuroImage 63(3):
tiple sclerosis subtypes using unsupervised 1478–1486. https://doi.org/10.1016/j.
machine learning and MRI data. Nat Commun neuroimage.2012.07.059
12(1):2078. https://doi.org/10.1038/s414 28. Bilgel M, Jedynak BM, Alzheimer’s Disease
67-021-22265-2 Neuroimaging Initiative (2019) Predicting
21. Collij LE, Salvadó G, Wottschel V, Masten- time to dementia using a quantitative template
broek SE, Schoenmakers P, Heeman F, of disease progression. Alzheimer’s Dementia:
Aksman L, Wink AM, Berckel BNM, Flier Diagn Assess Dis Monit 11(1):205–215.
WMvd, Scheltens P, Visser PJ, Barkhof F, https://doi.org/10.1016/j.dadm.2019.01.00
Haller S, Gispert JD, Alves IL, for the Alzhei- 5
mer’s Disease Neuroimaging Initiative; for the 29. Donohue MC, Jacqmin-Gadda H, Goff ML,
ALFA Study (2022) Spatial-temporal patterns Thomas RG, Raman R, Gamst AC, Beckett
of β-amyloid accumulation: a subtype and stage LA, Jack Jr. CR, Weiner MW, Dartigues JF,
inference model analysis. Neurology 98(17): Aisen PS (2014) Estimating long-term multi-
e1692–e1703. https://doi.org/10.1212/ variate progression from short-term data. Alz-
WNL.0000000000200148 heimer’s Dementia 10(5, Supplement):S400–
22. Young AL, Bragman FJS, Rangelov B, Han S 4 1 0 . h t t p s : // d o i . o r g / 1 0 . 1 0 1 6 / j .
MK, Galbán CJ, Lynch DA, Hawkes DJ, Alex- jalz.2013.10.003
ander DC, Hurst JR (2020) Disease progres- 30. Li D, Iddi S, Thompson WK, Donohue MC
sion modeling in chronic obstructive (2019) Bayesian latent time joint mixed effect
530 Neil P. Oxtoby
models for multicohort longitudinal data. Stat 39. Samtani MN, Farnum M, Lobanov V, Yang E,
Methods Med Res 28(3):835–845. https:// Raghavan N, DiBernardo A, Narayan V, Alz-
doi.org/10.1177/0962280217737566 heimer’s Disease Neuroimaging Initiative
31. Lorenzi M, Filippone M, Frisoni GB, Alexan- (2012) An improved model for disease pro-
der DC, Ourselin S (2019) Probabilistic disease gression in patients from the Alzheimer’s dis-
progression modeling to characterize diagnos- ease neuroimaging initiative. J Clin Pharmacol
tic uncertainty: application to staging and pre- 52(5):629–644. https://doi.org/10.1177/00
diction in Alzheimer’s disease. NeuroImage 91270011405497
190:56–68. https://doi.org/10.1016/j. 40. Villemagne VL, Burnham S, Bourgeat P,
neuroimage.2017.08.059 Brown B, Ellis KA, Salvado O, Szoeke C,
32. Garbarino S, Lorenzi M (2021) Investigating Macaulay SL, Martins R, Maruff P, Ames D,
hypotheses of neurodegeneration by learning Rowe CC, Masters CL (2013) Amyloid β
dynamical systems of protein propagation in deposition, neurodegeneration, and cognitive
the brain. NeuroImage 235:117980. https:// decline in sporadic Alzheimer’s disease: a pro-
doi.org/10.1016/j.neuroimage.2021.11 spective cohort study. Lancet Neurol 12(4):
7980 357–367. https://doi.org/10.1016/S1474-
33. Schiratti JB, Allassonnière S, Colliot O, Durrle- 4422(13)70044-9
man S (2015) Learning spatiotemporal trajec- 41. Oxtoby NP, Young AL, Fox NC, Daga P, Cash
tories from manifold-valued longitudinal DM, Ourselin S, Schott JM, Alexander DC
data. In: Advances in neural information pro- (2014) Learning imaging biomarker trajec-
cessing systems, Curran Associates, vol 28. tories from noisy Alzheimer’s disease data
https://proceedings.neurips.cc/paper/2015/ using a Bayesian multilevel model. In: Cardoso
hash/186a157b2992e7daed3677ce8e9fe40f- MJ, Simpson I, Arbel T, Precup D, Ribbens A
Abstract.html (eds) Bayesian and graphical models for bio-
34. Schiratti JB, Allassonnière S, Routier A, medical imaging. Lecture notes in computer
Colliot O, Durrleman S (2015) A mixed- science, vol 8677. Springer, Berlin, pp 85–94.
effects model with time reparametrization for https://doi.org/10.1007/978-3-319-122
longitudinal univariate manifold-valued 89-2_8
data. In: Ourselin S, Alexander DC, Westin 42. Budgeon C, Murray K, Turlach B, Baker S,
CF, Cardoso MJ (eds) Information processing Villemagne V, Burnham S, for the Alzheimer’s
in medical imaging. Lecture notes in computer Disease Neuroimaging Initiative (2017) Con-
science. Springer, Cham, pp 564–575. https:// structing longitudinal disease progression
doi.org/10.1007/978-3-319-19992-4_44 curves using sparse, short-term individual data
35. Schiratti JB, Allassonnière S, Colliot O, Durrle- with an application to Alzheimer’s disease. Stat
man S (2017) A Bayesian mixed-effects model Med 36(17):2720–2734. https://doi.org/10.
to learn trajectories of changes from repeated 1002/sim.7300
manifold-valued observations. J Mach Learn 43. Wijeratne PA, Alexander DC (2020) Learning
Res 3:1–48 transition times in event sequences: the event-
36. Koval I, Bône A, Louis M, Lartigue T, based hidden Markov model of disease pro-
Bottani S, Marcoux A, Samper-González J, gression. https://doi.org/10.48550/arXiv.
Burgos N, Charlier B, Bertrand A, 2011.01023. Machine Learning for Health
Epelbaum S, Colliot O, Allassonnière S, Durr- (ML4H) 2020
leman S (2021) AD course map charts Alzhei- 44. Wijeratne PA, Alexander DC (2021) Learning
mer’s disease progression. Sci Rep 11(1):8020. transition times in event sequences: the tempo-
https://doi.org/10.1038/s41598-021- ral event-based model of disease
87434-1 progression. In: Feragen A, Sommer S,
37. Li D, Iddi S, Thompson WK, Donohue MC Schnabel J, Nielsen M (eds) Information pro-
(2017) Bayesian latent time joint mixed effect cessing in medical imaging. Lecture notes in
models for multicohort longitudinal data. computer science. Springer, Cham, pp
arXiv 1703.10266v2. https://doi.org/10. 583–595. https://doi.org/10.1007/978-3-
48550/arXiv.1703.10266 030-78191-0_45
38. Sabuncu M, Desikan R, Sepulcre J, Yeo B, 45. Chen IY, Krishnan RG, Sontag D (2022) Clus-
Liu H, Schmansky N, Reuter M, Weiner M, tering interval-censored time-series for disease
Buckner R, Sperling R (2011) The dynamics phenotyping. In: Proceedings of the AAAI
of cortical and hippocampal atrophy in Alzhei- conference on artificial intelligence, vol 36(6),
mer disease. Arch. Neurol. 68(8):1040. pp 6211–6221. https://doi.org/10.1609/
h t t p s : // d o i . o r g / 1 0 . 1 0 0 1 / a r c h n e u r o l . aaai.v36i6.20570
2011.167
Disease Progression Modeling 531
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made. The images or other
third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated
otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder.
Chapter 18
Abstract
Noninvasive brain imaging techniques allow understanding the behavior and macro changes in the brain to
determine the progress of a disease. However, computational pathology provides a deeper understanding of
brain disorders at cellular level, able to consolidate a diagnosis and make the bridge between the medical
image and the omics analysis. In traditional histopathology, histology slides are visually inspected, under the
microscope, by trained pathologists. This process is time-consuming and labor-intensive; therefore, the
emergence of computational pathology has triggered great hope to ease this tedious task and make it more
robust. This chapter focuses on understanding the state-of-the-art machine learning techniques used to
analyze whole slide images within the context of brain disorders. We present a selective set of remarkable
machine learning algorithms providing discriminative approaches and quality results on brain disorders.
These methodologies are applied to different tasks, such as monitoring mechanisms contributing to disease
progression and patient survival rates, analyzing morphological phenotypes for classification and quantita-
tive assessment of disease, improving clinical care, diagnosing tumor specimens, and intraoperative inter-
pretation. Thanks to the recent progress in machine learning algorithms for high-content image processing,
computational pathology marks the rise of a new generation of medical discoveries and clinical protocols,
including in brain disorders.
Key words Computational pathology, Digital pathology, Whole slide imaging, Machine learning,
Deep learning, Brain disorders
1 Introduction
1.1 What Are We This chapter aims to assist the reader in discovering and under-
Presenting? standing state-of-the-art machine learning techniques used to ana-
lyze whole slide images (WSI), an essential data type used in
computational pathology (CP). We are restricting our review to
brain disorders, classified within four generally accepted groups:
• Brain injuries: caused by blunt trauma and can damage brain
tissue, neurons, and nerves.
• Brain tumors: can originate directly from the brain (and be
cancerous or benign) or be due to metastasis (cancer elsewhere
in the body and spreading to the brain).
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_18,
© The Author(s) 2023
533
534 Gabriel Jiménez and Daniel Racoceanu
1
https://www.fda.gov/news-events/press-announcements/fda-allows-marketing-first-whole-slide-imaging-sys
tem-digital-pathology.
Computational Pathology for Brain Disorders 535
1.2 Why AI for Brain An important role of CP in brain disorders is related to the study
Disorders? and assessment of brain tumors as they cause significant morbidity
and mortality worldwide, and pathology data is often available. In
2022 [2], over 25k adults (14,170 men and 10,880 women) in the
United States will have been diagnosed with primary cancerous
tumors of the brain and spinal cord. 85% to 90% of all primary
central nervous system (CNS) tumors (benign and cancerous) are
located in the brain. Worldwide, over 300k people were diagnosed
with a primary brain or spinal cord tumor in 2020. This disorder
does not distinguish age, as nearly 4.2k children under the age of
15 will have also been diagnosed with brain or CNS tumors in
2022, in the United States.
It is estimated that around one billion people have a mental or
substance use disorder [3]. Some other key figures related to men-
tal disorders worldwide are given by [4]. Globally, an estimated
264 million people are affected by depression. Bipolar disorder
affects about 45 million people worldwide. Schizophrenia affects
20 million people worldwide, and approximately 50 million have
dementia. In Europe, an estimated 10.5 million people have
dementia, and this number is expected to increase to 18.7 million
in 2050 [5].
In the neurodegenerative disease group, 50 million people
worldwide are living with Alzheimer’s and other types of dementia
[6], Alzheimer’s disease being the underlying cause in 70% of
people with dementia [5]. Parkinson’s disease affects approximately
6.2 million people worldwide [7] and represents the second most
common neurodegenerative disorder. As the incidence of
536 Gabriel Jiménez and Daniel Racoceanu
2.1 Formalin-Fixed FFPE is a technique used for preserving biopsy specimens for
Paraffin-Embedded clinical examination, diagnostic, experimental research, and drug
Tissue development. A correct histological analysis of tissue morphology
and biomarker localization in tissue samples will hinge on the ability
to achieve high-quality preparation of tissue samples, which usually
requires three critical stages: fixation, processing (also known as
pre-embedding), and embedding.
Fixation is the process that allows the preservation of the tissue
architecture (i.e., its cellular components, extracellular material,
and molecular elements). Histotechnologists perform this
538 Gabriel Jiménez and Daniel Racoceanu
2.2 Frozen Pathologists often use this tissue preservation method during sur-
Histological Tissue gical procedures where a rapid diagnosis of a pathological process is
needed (extemporaneous preparation). In fact, frozen tissue pro-
duces the fastest stainable sections, although, compared to FPPE
tissue, its morphological properties are not as good.
Frozen tissue (technically referred to as cryosection) is created
by submerging the fresh tissue sample into cold liquid (e.g.,
pre-cooled isopentane in liquid nitrogen) or by applying a tech-
nique called flash freezing, which uses liquid nitrogen directly. As in
FFPE, the tissue needs to be embedded in a medium to fix it to a
chuck (i.e., specimen holder) in an optimal position for microscopic
analysis. However, unlike FFPE tissue, no fixation or
pre-embedding processes are needed for preservation.
For embedding, technicians use OCT (optimal cutting temper-
ature compound), a viscous aqueous solution of polyvinyl alcohol
and polyethylene glycol designed to freeze, providing the ideal
support for cutting the cryosections in the cryostat (microtome
under cold temperature). Different embedding approaches exist
depending on the tissue orientation, precision and speed of the
process, tissue wastage, and the presence of freeze artifacts in the
resulting image. Stephen R. Peters describes these procedures and
other important considerations needed to prepare tissue samples
using the frozen technique [10].
Frozen tissue preservation relies on storing the embeddings at
low temperatures. Therefore, the tissue will degrade if the cold
chain breaks due to tissue sample mishandling. However, as it
better preserves the tissue’s molecular genetic material, it is fre-
quently used in sequencing analysis and immunohistochemistry
(IHC).
Other factors that affect the tissue quality and, therefore, the
scanned images are the formation of ice crystals and the thickness of
the sections. Ice crystals form when the tissue is not frozen rapidly
enough, and it may negatively affect the tissue structure and, there-
fore, its morphological characteristics. On the other hand, frozen
sections are often thicker than FFPE sections increasing the poten-
tial for lower resolution at higher magnifications and poorer
images.
2.3 Tissue We described the main pipeline to extract and preserve tissue
Preparation samples for further analysis. Although the techniques described
above can also be used for molecular and protein analysis (especially
the frozen sections), we now focus only on the image pipeline by
540 Gabriel Jiménez and Daniel Racoceanu
Fig. 3 [Top left] ALZ50 antibody used to discover compacted structures (tau pathologies). Below the WSI is an
example of a neurofibrillary tangle (left) and a neuritic plaque (right) stained with ALZ50 antibody. [Top right]
AT8 antibody, the most widely used in clinics, helps to discover all structures in a WSI. Below the WSI, there is
an example of a neurofibrillary tangle (left) and a neuritic plaque stained with AT8 antibody (right).
Abbreviation. WSI: whole slide image
[13]. Also, the Golgi method, which uses a silver staining tech-
nique, is used for observing neurons under the microscope
[11]. Studies for Alzheimer’s disease also frequently use ALZ50
and AT8 antibodies to reveal phosphorylated tau pathology using a
standardized immunohistochemistry protocol [14–16]. Figure 3
shows the difference between ALZ50 and AT8 biomarkers and
tau pathologies found in the tissue.
Having the slide stained is the last stage to prepare for studying
microscopic structures of diseased or abnormal tissues. Considering
the number of people involved in these processes (pathologists,
pathology assistants, histotechnologists, tissue technicians, and
trained repository managing personnel) and the precision of each
stage, standardizing certain practices to create valuable slides for
further analysis is needed.
Eiseman et al. [17] reported a list of best practices for biospeci-
men collection, processing, annotation, storage, and distribution.
The proposal aims to set guidelines for managing large biospecimen
banks containing the tissue sample embeddings excised from dif-
ferent organs with different pathologies and demographic
distributions.
More specific standardized procedures for tissue sampling and
processing have also been reported. For instance, in 2012, the
Society of Toxicologic Pathology charged a Nervous System Sam-
pling Working Group with devising recommended practices to
Computational Pathology for Brain Disorders 543
Fig. 4 [Top left] Folding artifact (floatation and mounting-related artifact), [Top right] Marking fixation process
(fixation artifact), [Bottom left] Breaking artifact (microtome-related artifact), [Bottom right] Overlaying tissue
(mounting artifact)
544 Gabriel Jiménez and Daniel Racoceanu
This section aims to better understand the role that digital pathol-
ogy plays in the analysis of complex and large amounts of informa-
tion obtained from tissue specimens. As an additional option to
incorporate more images with higher throughput, whole slide
image scanners are briefly discussed. Therefore, we must discuss
the DICOM standard used in medicine to digitally represent the
images and, in this case, the tissue samples. We then focus on
computational pathology, which is the analysis of the reconstructed
whole slide images using different pattern recognition techniques
such as machine learning (including deep learning) algorithms.
This section contains some extractions from Jimeénez’s thesis
work [20].
3.1 Digital Pathology Digital systems were introduced to the histopathological examina-
tion in order to deal with complex and vast amounts of information
obtained from tissue specimens. Digital images were originally
generated by mounting a camera on the microscope. The static
pictures captured only reflected a small region of the glass slide, and
the reconstruction of the whole glass slide was not frequently
attempted due to its complexity and the fact that it is time-
consuming. However, precision in the development of mechanical
systems has made possible the construction of whole slide digital
scanners. Garcia et al. [21] reviewed a series of mechanical and
software systems used in the construction of such devices. The
stored high-resolution images allow pathologists to view, manage,
and analyze the digitized tissue on a computer monitor, similar to
under an optical microscope but with additional digital tools to
improve the diagnosis process.
WSI technology, also referred to as virtual microscopy, has
proven to be helpful in a wide variety of applications in pathology
(e.g., image archiving, telepathology, image analysis). In essence, a
WSI scanner operation principle consists of moving the glass slide a
small distance every time a picture is taken to capture the entire
tissue sample. Every WSI scanner has six components: (a) a micro-
scope with lens objectives, (b) a light source (bright field and/or
fluorescent), (c) robotics to load and move glass slides around,
(d) one or more digital cameras for capture, (e) a computer, and
(f) software to manipulate, manage, and view digital slides
[22]. The hardware and software used for these six components
will determine the key features to analyze when choosing a scanner.
Some research articles have compared the hardware and software
capabilities of different scanners in the market. For instance, in
[22], Farahani et al. compared 11 WSI scanners from different
manufacturers regarding imaging modality, slide capacity, scan
speed, image magnification, image resolution, digital slide format,
Computational Pathology for Brain Disorders 545
3.2 Whole Slide The Digital Imaging and Communications in Medicine (DICOM)
Image Structure standard was adopted to store WSI digital slides into commercially
available PACS (picture archiving and communication system) and
facilitate the transition to digital pathology in clinics and labora-
tories. Due to the WSI dimension and size, a new pyramidal
approach for data organization and access was proposed by the
DICOM Standards Committee in [23].
A typical digitalization of a 20 mm × 15 mm sample using a
resolution of 0.25 μm/pixel, also referred to as 40 × magnification,
will generate an image of approximately 80, 000 × 60, 000 pixels.
Considering a 24-bit color resolution, the digitized image size is
about 15 GB. Data size might even go one order of magnitude
higher if the scanner is configured to a higher resolution (e.g., 80 ×,
100 ×), Z planes are used, or additional spectral bands are also
digitized. In any case, conventional storage and access to these
images will demand excessive computational resources to be imple-
mented into commercial systems. Figure 5 describes the traditional
approach (i.e., single frame organization), which stores the data in
rows that extend across the entire image. This row-major approach
has the disadvantage of loading unnecessary pixels into memory,
especially if we want to visualize a small region of interest.
Other types of organizations have also been studied. Figure 6
describes the storage of pixels in tiles, which decreases the compu-
tational time for visualization and manipulation of WSI by loading
only the subset of pixels needed into memory. Although this
approach allows faster access and rapid visualization of the WSI, it
fails when dealing with different magnifications of the images, as is
the case in WSI scanners. Figure 7 depicts the issues with rapid
zooming of WSI. Besides loading a larger subset of pixels into
memory, algorithms to perform the down-sampling of the image
are time-consuming. At the limit, to render a low-resolution
thumbnail of the entire image, all the data scanned must be
accessed and processed [23]. Stacking precomputed
low-resolution versions of the original image was proposed in
order to overcome the zooming problem. Figure 8 describes the
pyramidal structure used to store different down-sampled versions
546 Gabriel Jiménez and Daniel Racoceanu
Fig. 6 Tiled image organization of whole slide images. Tiles’ size can range from 240 × 240 pixels up to 4096
× 4096 pixels
Computational Pathology for Brain Disorders 547
Fig. 7 Rapid zooming issue when accessing lower-resolution images: large amount of data need to be loaded
into memory. In this example, the image size at the highest resolution (221 nm/pixels) is 82,432 × 80,640
pixels
Single-frame
thumbnail image
(lowest resolution)
2.5 x
Multi-frame image
(intermediate
resolution)
10
0x
Multi-frame
image (highest
resolution)
40
0x
Fig. 8 Pyramidal organization of whole slide images. In this example, the image size at the highest resolution
(221 nm/pixels) is 82,432 × 80,640 pixels. The compressed (JPEG) file size is 2.22 GB, whereas the
uncompressed version is 18.57 GB
2
GAN: generative adversarial networks.
3
FCNN: fully convolutional neural network.
Computational Pathology for Brain Disorders 551
4
https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/using-tcga/
typesan.
556 Gabriel Jiménez and Daniel Racoceanu
5 Perspectives
This last section of the chapter deals with new techniques for the
explainability of artificial intelligence algorithms. It also describes
new ideas related to responsible artificial intelligence in the context
of medical applications, computational histopathology, and brain
disorders. Besides, it introduces new image acquisition technology
mixing bright light and chemistry to improve intraoperative appli-
cations. Finally, we will highlight computational pathology’s strate-
gic role in spatial transcriptomics and refined personalized
medicine.
In [15, 16], we address the issue of accurate segmentation by
proposing a two-loop scheme as shown in Fig. 9. In our method, a
U-Net-based neural network is trained on several WSIs manually
annotated by expert pathologists. The structures we focus on are
neuritic plaques and tangles following the study in [14]. The net-
work’s predictions (in new WSIs) are then reviewed by an expert
who can refine the predictions by modifying the segmentation
outline or validating new structures found in the WSI. Additionally,
an attention-based architecture is used to create a visual explanation
and refine the hyperparameters of the initial architecture in charge
of the prediction proposal.
We tested the attention-based architecture with a dataset of
eight WSIs divided into patches following an ROI-guided sam-
pling. Results show qualitatively in Fig. 10 that through this visual
explanation, the expert in the loop could define the border of the
neuritic plaque (object of interest) more accurately so the network
can update its weights accordingly. Additionally, quantitative results
(Dice score of approximately 0.7) show great promise for this
attention U-Net architecture.
Our next step is to use a single architecture for explainability
and segmentation/classification. We believe our method will
improve the accuracy of the neuritic plaques and tangles outline
Computational Pathology for Brain Disorders 559
Network predictions
Incremental
AI network
Baseline AI
network
Online yes
Pre-training Human
training
engine input?
engine
Technical
experts
xAI
no
hyperparameters
Initial training database
(sparse annotated
patches + WSI patches)
Clinical
expert
Based on
tauopathies
features
Fig. 10 Attention U-Net results. The figure shows a patch of size 128 × 128 pixels, the ground-truth binary
mask, and the focus progression using successive activation layers of the network
Acknowledgements
5
Neuro-CEB brain bank: https://www.neuroceb.org/en/.
564 Gabriel Jiménez and Daniel Racoceanu
References
1. Serag A, Ion-Margineanu A, Qureshi H, New York. https://doi.org/10.1007/978-1-
McMillan R, Saint Martin MJ, Diamond J, 4419-1234-3
O’Reilly P, Hamilton P (2019) Translational 11. Yuan Y, Arikkath J (2014) Techniques in
AI and deep learning in diagnostic pathology. immunohistochemistry and
Front Med 6:185. https://doi.org/10. immunocytochemistry. In: Xiong H, Gendel-
3389/fmed.2019.00185 man HE (eds) Current laboratory methods in
2. CancerNet Editorial Board (2012) Brain neuroscience research. Springer, New York,
tumor—statistics. https://www.cancer.net/ pp 387–396. https://doi.org/10.1007/
cancer-types/brain-tumor/statistics. 978-1-4614-8794-4_27
Accessed 21 May 2022 12. Titford M (2009) Progress in the develop-
3. Ritchie H (2019) Global mental health: five ment of microscopical techniques for diagnos-
key insights which emerge from the data. tic pathology. J Histotechnol 32(1):9–19.
https://ourworldindata.org/global-mental- https://doi.org/10.1179/his.2009.32.1.9
health. Accessed 2 May 2022 13. Kapelsohn K, Kapelsohn K (2015) Improved
4. GBD 2017 DALYs and HALE Collaborators methods for cutting, mounting, and staining
(2018) Global, regional, and national tissue for neural histology. Protocol
disability-adjusted life-years (DALYs) for Exchange, Springer Nature. https://doi.
359 diseases and injuries and healthy life org/10.1038/protex.2015.022. Protocol
expectancy (HALE) for 195 countries and (version 1)
territories, 1990–2017: a systematic analysis 14. Maňoušková K, Abadie V, Ounissi M,
for the global burden of disease study 2017. Jimenez G, Stimmer L, Delatour B,
Lancet 392(10159):1859–1922. https://doi. Durrleman S, Racoceanu D (2022) Tau pro-
org/10.1016/S0140-6736(18)32335-3 tein discrete aggregates in Alzheimer’s dis-
5. European Brain Council (2019) Disease fact ease: neuritic plaques and tangles detection
sheets. https://www.braincouncil.eu/dis and segmentation using computational
ease-fact-sheets/. Accessed 2 May 2022 histopathology. In: Levenson RM, Tomas-
6. Alzheimer’s Association (2022) Alzheimer’s zewski JE, Ward AD (eds) Medical imaging
and dementia. https://www.alz.org/ 2022: digital and computational pathology,
alzheimer_s_dementia. Accessed 2 May 2022 SPIE, vol 12039, pp 33–39. https://doi.
7. GBD 2015 Neurological Disorders Collabo- org/10.1117/12.2613154
rator Group (2017) Global, regional, and 15. Jimenez G, Kar A, Ounissi M, Stimmer L,
national burden of neurological disorders Delatour B, Racoceanu D (2022) Interpret-
during 1990–2015: a systematic analysis for able deep learning in computational histopa-
the global burden of disease study 2015. Lan- thology for refined identification of
cet Neurol 16(11):877–897. https://doi. Alzheimer’s disease biomarkers. In: The Alz-
org/10.1016/S1474-4422(17)30299-5 heimer’s Association (ed) Alzheimer’s &
8. Stroke Alliance for Europe (2016) About Dementia: Alzheimer’s Association Interna-
stroke. https://www.safestroke.eu/about- tional Conference (AAIC). Wiley,
stroke/. Accessed 21 May 2022 forthcoming
9. Pan American Health Organization (2021) 16. Jimenez G, Kar A, Ounissi M, Ingrassia L,
Methodological notes. https://www.paho. Boluda S, Delatour B, Stimmer L, Racoceanu
org/en/enlace/technical-notes. Accessed D (2022) Visual deep Learning-Based expla-
2 May 2022 nation for neuritic plaques segmentation in
Alzheimer’s disease using weakly annotated
10. Peters SR (2009) A practical guide to Frozen whole slide histopathological images. In:
section technique, 1st edn. Springer, Wang L, Dou Q, Fletcher PT, Speidel S, Li S
Computational Pathology for Brain Disorders 565
resolution convolutional neural networks for 59. Truong AH, Sharmanska V, Limb€ack-Stanic,
semantic segmentation in histopathology Grech-Sollars M (2020) Optimization of
whole-slide images. Med Image Anal 68: deep learning methods for visualization of
101890. https://doi.org/10.1016/j. tumor heterogeneity and brain tumor grading
media.2020.101890 through digital pathology. Neuro-Oncol Adv
51. Schmitz R, Madesta F, Nielsen M, Krause J, 2(1):vdaa110. https://doi.org/10.1093/
Steurer S, Werner R, Rösch T (2021) Multi- noajnl/vdaa110
scale fully convolutional neural networks for 60. Zadeh Shirazi A, Fornaciari E, McDonnell
histopathology image segmentation: from MD, Yaghoobi M, Cevallos Y, Tello-
nuclear aberrations to the global tissue archi- Oquendo L, Inca D, Gomez GA (2020) The
tecture. Med Image Anal 70:101996. application of deep convolutional neural net-
https://doi.org/10.1016/j.media.2021.101 works to brain cancer images: a survey. J Pers
996 Med 10(4):224. https://doi.org/10.3390/
52. Janowczyk A, Madabhushi A (2016) Deep jpm10040224
learning for digital pathology image analysis: 61. Wurts A, Oakley DH, Hyman BT, Samsi S
a comprehensive tutorial with selected use (2020) Segmentation of Tau stained Alzhei-
cases. J Pathol Inform 7:29. https://doi. mers brain tissue using convolutional neural
org/10.4103/2153-3539.186902 networks. In: 2020 42nd annual international
53. Kong J, Cooper L, Wang F, Chisolm C, conference of the IEEE engineering in medi-
Moreno C, Kurc T, Widener P, Brat D, Saltz cine & biology society (EMBC). IEEE, vol
J (2011) A comprehensive framework for clas- 2020, pp 1420–1423. https://doi.org/10.
sification of nuclei in digital microscopy imag- 1109/EMBC44109.2020.9175832
ing: an application to diffuse gliomas. In: 62. Ronneberger O, Fischer P, Brox T (2015)
2011 IEEE international symposium on bio- U-Net: convolutional networks for biomedi-
medical imaging (ISBI): from nano to macro. cal image segmentation. In: Navab N,
IEEE, pp 2128–2131. https://doi.org/10. Hornegger J, Wells W, Frangi A (eds) Medical
1109/ISBI.2011.5872833 image computing and computer-assisted
54. Xing F, Xie Y, Yang L (2016) An automatic intervention (MICCAI). Lecture notes in
learning-based framework for robust nucleus computer science. Springer, Berlin, pp
segmentation. IEEE Trans Med Imaging 234–241. https://doi.org/10.1007/978-3-
35(2):550–566. https://doi.org/10.1109/ 319-24574-4_28
TMI.2015.2481436 63. Badrinarayanan V, Kendall A, Cipolla R
55. Xu Y, Jia Z, Ai Y, Zhang F, Lai M, Chang EIC (2017) SegNet: a deep convolutional
(2015) Deep convolutional activation features encoder-decoder architecture for image seg-
for large scale brain tumor histopathology mentation. IEEE Trans Pattern Anal Mach
image classification and segmentation. In: Intell 39(12):2481–2495. https://doi.org/
2015 IEEE international conference on 10.1109/TPAMI.2016.2644615
acoustics, speech and signal processing 64. Signaevsky M, Prastawa M, Farrell K,
(ICASSP). IEEE, pp 947–951. https://doi. Tabish N, Baldwin E, Han N, Iida MA,
org/10.1109/ICASSP.2015.7178109 Koll J, Bryce C, Purohit D, Haroutunian V,
56. Krizhevsky A, Sutskever I, Hinton GE (2012) McKee AC, Stein TD, White CL 3rd,
ImageNet classification with deep convolu- Walker J, Richardson TE, Hanson R, Dono-
tional neural networks. In: Pereira F, Burges van MJ, Cordon-Cardo C, Zeineh J,
CJ, Bottou L, Weinberger KQ (eds) Advances Fernandez G, Crary JF (2019) Artificial intel-
in neural information processing systems ligence in neuropathology: deep learning-
(NIPS). Curran Associates, vol 25 based assessment of tauopathy. Lab Invest
57. Xu Y, Jia Z, Wang LB, Ai Y, Zhang F, Lai M, 99(7):1019–1029. https://doi.org/10.103
Chang EIC (2017) Large scale tissue histopa- 8/s41374-019-0202-4
thology image classification, segmentation, 65. Vega AR, Chkheidze R, Jarmale V, Shang P,
and visualization via deep convolutional acti- Foong C, Diamond MI, White CL 3rd,
vation features. BMC Bioinform 18(1):281. Rajaram S (2021) Deep learning reveals
https://doi.org/10.1186/s12859-017-1 disease-specific signatures of white matter
685-x pathology in tauopathies. Acta neuropatholo-
58. Ker J, Bai Y, Lee HY, Rao J, Wang L (2019) gica communications 9(1):170. https://doi.
Automated brain histology classification using org/10.1186/s40478-021-01271-x
machine learning. J Clin Neurosci 66:239– 66. Border SP, Sarder P (2021) From what to
245. https://doi.org/10.1016/j.jocn.201 why, the growing need for a focus shift toward
9.05.019 explainability of AI in digital pathology. Front
568 Gabriel Jiménez and Daniel Racoceanu
medicine and biology society (EMBC). IEEE, 127. Wong DR, Tang Z, Mew NC, Das S, Athey J,
pp 3851–3854. https://doi.org/10.1109/ McAleese KE, Kofler JK, Flanagan ME,
EMBC.2015.7319234 Borys E, White CL 3rd, Butte AJ, Dugger
119. Bándi P, van de Loo R, Intezar M, Geijs D, BN, Keiser MJ (2022) Deep learning from
Ciompi F, van Ginneken B, van der Laak J, multiple experts improves identification of
Litjens G (2017) Comparison of different amyloid neuropathologies. Acta Neuropathol
methods for tissue segmentation in histo- Commun 10(1):66. https://doi.org/10.11
pathological whole-slide images. In: 2017 86/s40478-022-01365-0
IEEE 14th international symposium on bio- 128. Willuweit A, Velden J, Godemann R,
medical imaging (ISBI), pp 591–595. Manook A, Jetzek F, Tintrup H,
https://doi.org/10.1109/ISBI.2017. Kauselmann G, Zevnik B, Henriksen G,
7950590 Drzezga A, Pohlner J, Schoor M, Kemp JA,
120. Xiao Y, Decencière E, Velasco-Forero S, von der Kammer H (2009) Early-onset and
Burdin H, Bornschlögl T, Bernerd F, robust amyloid pathology in a new homozy-
Warrick E, Baldeweck T (2019) A new color gous mouse model of Alzheimer’s disease.
augmentation method for deep learning seg- PloS One 4(11):e7931. https://doi.org/10.
mentation of histological images. In: 2019 1371/journal.pone.0007931
IEEE 16th international symposium on bio- 129. Khalsa SSS, Hollon TC, Adapa A, Urias E,
medical imaging (ISBI), pp 886–890. Srinivasan S, Jairath N, Szczepanski J,
https://doi.org/10.1109/ISBI.2019. Ouillette P, Camelo-Piragua S, Orringer DA
8759591 (2020) Automated histologic diagnosis of
121. Li Y, Xu Z, Wang Y, Zhou H, Zhang Q CNS tumors with machine learning. CNS
(2020) SU-Net and DU-Net fusion for Oncol 9(2):CNS56. https://doi.org/10.
tumour segmentation in histopathology 2217/cns-2020-0003
images. In: 2020 IEEE 17th international 130. Tandel GS, Biswas M, Kakde OG, Tiwari A,
symposium on biomedical imaging (ISBI). Suri HS, Turk M, Laird JR, Asare CK, Ankrah
IEEE, pp 461–465. https://doi.org/10. AA, Khanna NN, Madhusudhan BK, Saba L,
1109/ISBI45749.2020.9098678 Suri JS (2019) A review on a deep learning
122. Pan SJ, Yang Q (2010) A survey on transfer perspective in brain cancer classification. Can-
learning. IEEE Trans Knowl Data Eng cers (Basel) 11(1):111. https://doi.org/10.
22(10):1345–1359. https://doi.org/10. 3390/cancers11010111
1109/TKDE.2009.191 131. Jose L, Liu S, Russo C, Nadort A, Di Ieva A
123. Caie PD, Schuur K, Oniscu A, Mullen P, Rey- (2021) Generative adversarial networks in
nolds PA, Harrison DJ (2013) Human tissue digital pathology and histopathological
in systems medicine. FEBS J 280(23): image processing: a review. J Pathol Inform
5949–5956. https://doi.org/10.1111/febs. 12:43. https://doi.org/10.4103/jpi.jpi_
12550 103_20
124. Snead DRJ, Tsang YW, Meskiri A, Kimani PK, 132. Herrmann MD, Clunie DA, Fedorov A,
Crossman R, Rajpoot NM, Blessing E, Doyle SW, Pieper S, Klepeis V, Le LP, Mutter
Chen K, Gopalakrishnan K, Matthews P, GL, Milstone DS, Schultz TJ, Kikinis R,
Momtahan N, Read-Jones S, Sah S, Kotecha GK, Hwang DH, Andriole KP, Iaf-
Simmons E, Sinha B, Suortamo S, Yeo Y, El rate AJ, Brink JA, Boland GW, Dreyer KJ,
Daly H, Cree IA (2016) Validation of digital Michalski M, Golden JA, Louis DN, Lennerz
pathology imaging for primary histopatholo- JK (2018) Implementing the DICOM stan-
gical diagnosis. Histopathology 68(7): dard for digital pathology. J Pathol Inform 9:
1063–1072. https://doi.org/10.1111/his. 37. https://doi.org/10.4103/jpi.jpi_42_18
12879 133. Aeffner F, Zarella MD, Buchbinder N, Bui
125. Williams JH, Mepham BL, Wright DH MM, Goodman MR, Hartman DJ, Lujan
(1997) Tissue preparation for immunocyto- GM, Molani MA, Parwani AV, Lillard K,
chemistry. J Clin Pathol 50(5):422–428. Turner OC, Vemuri VNP, Yuil-Valdes AG,
https://doi.org/10.1136/jcp.50.5.422 Bowman D (2019) Introduction to digital
126. Qin P, Chen J, Zeng J, Chai R, Wang L image analysis in whole-slide imaging: a
(2018) Large-scale tissue histopathology white paper from the digital pathology associ-
image segmentation based on feature pyra- ation. J Pathol Inform 10:9. https://doi.org/
mid. EURASIP J Image Video Process 10.4103/jpi.jpi_82_18
2018(1):1–9. https://doi.org/10.1186/s13 134. Farahani K, Kurc T, Bakas S, Bearce BA,
640-018-0320-8 Kalpathy-Cramer J, Freymann J, Saltz J,
572 Gabriel Jiménez and Daniel Racoceanu
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made. The images or other
third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated
otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder.
Chapter 19
Abstract
This chapter focuses on the joint modeling of heterogeneous information, such as imaging, clinical, and
biological data. This kind of problem requires to generalize classical uni- and multivariate association
models to account for complex data structure and interactions, as well as high data dimensionality.
Typical approaches are essentially based on the identification of latent modes of maximal statistical
association between different sets of features and ultimately allow to identify joint patterns of variations
between different data modalities, as well as to predict a target modality conditioned on the available ones.
This rationale can be extended to account for several data modalities jointly, to define multi-view, or multi-
channel, representation of multiple modalities. This chapter covers both classical approaches such as partial
least squares (PLS) and canonical correlation analysis (CCA), along with most recent advances based on
multi-channel variational autoencoders. Specific attention is here devoted to the problem of interpretability
and generalization of such high-dimensional models. These methods are illustrated in different medical
imaging applications, and in the joint analysis of imaging and non-imaging information, such as -omics or
clinical data.
Key words Multivariate analysis, Latent variable models, Multimodal imaging, -Omics, Imaging-
genetics, Partial least squares, Canonical correlation analysis, Variational autoencoders, Sparsity,
Interpretability
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_19,
© The Author(s) 2023
573
574 Marco Lorenzi et al.
1.2 Motivation from Multimodal analysis methods have been explored for their potential
Neuroimaging in automatic patient diagnosis and stratification, as well as for their
Applications ability to identify interpretable data patterns characterizing clinical
conditions. In this section, we summarize state-of-the-art contri-
butions to the field, along with the remaining challenges to
improve our understanding and applications to complex brain
disorders.
• Structural-structural combination. Methods combining sMRI
and dMRI imaging modalities are predominant in the field.
Such combined analysis has been proposed, for example, for
the detection of brain lesions (e.g., strokes [6, 7]) and to study
and improve the management of patients with brain
disorders [8].
• Functional-functional combination. Due to the complementary
nature of EEG and fMRI, research in brain connectivity analysis
has focused in the fusion of these modalities, to optimally inte-
grate the high temporal resolution of EEG with the high spatial
resolution of the fMRI signal. As a result, EEG-fMRI can pro-
vide simultaneous cortical and subcortical recording of brain
activity with high spatiotemporal resolution. For example, this
combination is increasingly used to provide clinical support for
the diagnosis and treatment of epilepsy, to accurately localize
seizure onset areas, as well as to map the surrounding functional
cortex in order to avoid disability [9–11].
• Structural-functional combination. The combined analysis of
sMRI, dMRI, and fMRI has been frequently proposed in neu-
ropsychiatric research due to the high clinical availability of these
imaging modalities and due to their potential to link brain
function, structure, and connectivity. A typical application is in
the study of autism spectrum disorder and attention-deficit
hyperactivity disorder (ADHD). The combined analysis of such
modalities has been proposed, for example, for the identification
of altered white matter connectivity patterns in children with
ADHD [12], highlighting association patterns between regional
brain structural and functional abnormalities [13].
• Imaging-genetics. The combination of imaging and genetics
data has been increasingly studied to identify genetic risk factors
(genetic variations) associated with functional or structural
abnormalities (quantitative traits, QTs) in complex brain disor-
ders [3]. Such multimodal analyses are key to identify the
Integration of Multimodal Data 577
2 Methodological Background
2.1 From The use of multivariate analysis methods for biomedical data analy-
Multivariate sis is widespread, for example, in neuroscience [17], genetics [18],
Regression to Latent and imaging-genetics studies [19, 20]. These approaches come
Variable Models with the potential of explicitly highlighting the underlying relation-
ship between data modalities, by identifying sets of relevant features
that are jointly associated to explain the observed data.
In what follows, we represent the multimodal information
available for a given subject k as a collection of arrays x ki , i = 1,
. . ., M, where M is the number of available modalities. Each array
has dimension dimðx ki Þ = D i . A multimodal data matrix for
N individuals is therefore represented by the collection of matrices
Xi, with dim(Xi) = N × Di. For sake of simplicity, we assume that
x ki ∈Di .
A first assumption that can be made for defining a multivariate
analysis method is that a target modality, say Xj, is generated by the
combination of a set of given modalities {Xi}i ≠ j. A typical example
of this application concerns the prediction of certain clinical vari-
ables from the combination of imaging features. In this case, the
underlying forward generative model for an observation x kj can be
expressed as:
x kj = gðfx ki gi ≠ j Þ þ εkj , ð1Þ
where we assume that there exists an ideal mapping g() that trans-
forms the ensemble of observed modalities for the individual k, to
generate the target one x kj . Note that we generally assume that the
observations are corrupted by a certain noise εkj , whose nature
depends on the data type. The standard choice for the noise is
Gaussian, εkj N ð0, σ 2 IdÞ.
Within this setting, a multimodal model is represented by a
function f ðfX i gM
i = 1 , θÞ, with parameters θ, taking as input the
ensemble of modalities across subjects. The model f is optimized
578 Marco Lorenzi et al.
Noise process
with respect to θ to solve a specific task. In our case, the set of input
modalities can be used to predict a target modality j, in this case we
have f : i ≠ j Di ↦Dj .
In its basic form, this kind of formulation includes standard
multivariate linear regression, where the relationship between two
modalities X1 and X2 is modeled through a set a linear parameters
θ = W ∈D2 × D1 and f(X2) = X2 W. Under the Gaussian noise
assumption, the typical optimization task is formulated as the
least squares problem:
W = argmin kX 1 - X 2 W k2 : ð2Þ
W
1
Note that we could also consider overcomplete basis for the latent space such that D > min{dim(Di), i = 1, . . .,
M}. This choice may be motivated by the need of accounting for modalities with particularly low dimension. The
study of overcomplete latent data representations is focus of active research [21–23].
Integration of Multimodal Data 579
2.2 Classical Latent Classical latent variable models extend the standard linear regres-
Variable Models: PLS sion to analyze the joint variability of different modalities. Typical
and CCA formulation of latent variable models include partial least squares
(PLS) and canonical correlation analysis (CCA) [24], which have
successfully been applied in biomedical research [25], along with
multimodal [26, 27] and nonlinear [28, 29] variants.
580 Marco Lorenzi et al.
#######################################
# We fit PLS and CCA as provided by scikit-learn
2.3 Latent Variable For PLS, the problem of Eq. 6 requires the estimation of projec-
Models Through tions u1 and u2 maximizing the covariance between the latent
Eigen-Decomposition representation of the two modalities X1 and X2:
2.3.1 Partial Least u 1 , u 2 = argmax CovðX 1 u 1 , X 2 u 2 Þ, ð8Þ
u1 , u2
Squares
where
u T1 Su 2
CovðX 1 u 1 , X 2 u 2 Þ = , ð9Þ
u T1 u 1 u T2 u 2
where
u T1 Su 2
Corr ðX 1 u 1 , X 2 u 2 Þ = : ð13Þ
u T1 S 1 u 1 u T2 S 2 u 2
2.4 Kernel Methods In order to capture nonlinear relationships, we may wish to project
for Latent Variable our input features into a high-dimensional space prior to
Models performing CCA (or PLS):
ϕ : X = ðx 1 , . . ., x N Þ↦ ϕðx 1 Þ, . . . , ϕðx N Þ ð16Þ
where ϕ is a nonlinear feature map. As derived by Bach et al. [33],
the data matrices X1 and X2 can be replaced by the Gram matrices
K1 and K2 such that we can achieve a nonlinear feature mapping via
the kernel trick [34]:
Integration of Multimodal Data 583
j j j j
K 1 x i1 , x 1 = ϕ x i1 , ϕ x 1 and K 2 x i2 , x 2 = ϕ x i2 , ϕ x 2
ð17Þ
j j
where K 1 = ½K 1 x i1 , x 1 and K 2 = ½K 2 x i2 , x 2 . In this
N ×N N ×N
case, kernel CCA canonical directions correspond to the solutions
of the updated generalized eigen-value problem:
0 K 1K 2 α1 K 21 0 α1
=λ : ð18Þ
K 2K 1 0 α2 0 K 22 α2
Similarly to the primal formulation of CCA, we can apply an ℓ 2-
norm regularization penalty on the weights α1 and α2 of Eq. 18,
giving rise to regularized kernel CCA:
0 K 1K 2 u1 K 21 þ δI 0 u1
=λ , ð19Þ
K 2K 1 0 u2 0 K 22 þ δI u2
2.5 Optimization of The nonlinear iterative partial least squares (NIPALS) is a classical
Latent Variable Models scheme proposed by H. Wold [35] for the optimization of latent
variable models through the iterative computation of PLS and CCA
projections. Within this method, the projections associated with
the modalities X1 and X2 are obtained through the iterative solu-
tion of simple least squares problems.
The principle of NIPALS is to identify projection vectors
u 1 , u 2 ∈ and corresponding latent representations z1 and z2 to
minimize the functionals
ℒi = kX i - z i u Ti k2 , ð20Þ
subject to the constraint of maximal similarity between representa-
tions z1 and z2 (Fig. 3).
Following [37], the NIPALS method is optimized as follows
(Algorithm 1). The latent projection for modality 1 is first initia-
ð0Þ
lized as z 1 from randomly chosen columns of the data matrix X1.
Subsequently, the linear regression function
ð0Þ ð0Þ
ℒ2 = kX 2 - z 1 u T2 k2
ð0Þ
is optimized with respect to u2, to obtain the projection u 2 . After
unit scaling of the projection coefficients, the new latent represen-
ð0Þ ð0Þ
tation is computed for modality 2 as z 2 = X 2 u 2 . At this point,
the latent projection is used for a new optimization step of the
linear regression problem
ð0Þ ð0Þ
ℒ1 = kX 1 - z 2 u T1 k2 ,
this time with respect to u1, to obtain the projection parameters
ð0Þ
u 1 relative to modality 1. After unit scaling of the coefficients, the
new latent representations is computed for modality 1 as
ð1Þ ð0Þ
z 1 = X 1 u 1 . The whole procedure is then iterated.
584 Marco Lorenzi et al.
z Ti X i
Xi ← X i - zi ð21Þ
z Ti z i
Integration of Multimodal Data 585
3.1 Multi-view PPCA Recently, the seminal work of Tipping and Bishop on probabilistic
PCA (PPCA) [40] has been extended to allow the joint integration
of multimodal data [41] (multi-view PPCA), under the assumption
of a common latent space able to explain and generate all
modalities.
M
Recalling the notation of Subheading 2.1, let x = fx ki gi = 1 be an
observation of M modalities for subject k, where each x i is a vector
k
Fig. 4 (a) Graphical model of multi-view PPCA. The green node represents the
latent variable able to jointly describe all observed data explaining the patient
status. Gray nodes denote original multimodal data, and blue nodes the view-
specific parameters. (b) Hierarchical structure of multi-view PPCA: prior knowl-
edge on model’s parameters can be integrated in a natural way when the model
is embedded in a Bayesian framework
3.1.1 Optimization In order to solve the inference problem and estimate the model’s
parameters in θ, the classical expectation-maximization
(EM) scheme can be deployed. EM optimization consists in an
iterative process where each iteration is composed of two steps:
588 Marco Lorenzi et al.
3.2 Bayesian Latent Autoencoders and variational autoencoders have become very pop-
Variable Models via ular approaches for the estimation of latent representation of com-
Autoencoding plex data, which allow powerful extensions of the Bayesian models
presented in Subheading 3.1 to account for nonlinear and deep
data representations.
Autoencoders (AEs) extend classical latent variable models to
account for complex, potentially highly nonlinear, projections from
the data space to the latent space (encoding), along with recon-
struction functions (decoding) mapping the latent representation
back to the data space. Since typical encoding ( fe) and decoding
( fd) functions of AEs are parameterized by feedforward neural
networks, inference can be efficiently performed by means of sto-
chastic gradient descent through backpropagation. In this sense,
AEs can be seen as a powerful extension of classical PCA, where
encoding into the latent representations and decoding are jointly
optimized to minimize the reconstruction error of the data:
ℒ = X - f d ðf e ðX ÞÞ 2
2 ð25Þ
The variational autoencoder (VAE) [42, 43] introduces a Bayesian
formulation of AEs, akin to PPCA, where the latent variables are
inferred by estimating the associated posterior distributions. In this
case, the optimization problem can be efficiently performed by
stochastic variational inference [44], where the posterior moments
of the variational posterior of the latent distribution are parameter-
ized by neural networks.
In the same way PLS and CCA extend PCA for multimodal
analysis, research has been devoted to define equivalent extensions
for the VAEs to identify common latent representations of multiple
data modalities, such as the multi-channel VAE [23], or deep CCA
[29]. These approaches are based on a similar formulation, which is
provided in the following section.
Integration of Multimodal Data 589
the latent space which can, on average, jointly reconstruct all the
other channels. This term thus enforces a coherent latent represen-
tation across different modalities and is balanced by the regulariza-
tion term R, which constrains the latent representation of each
modality to the common prior p(z). As for standard VAEs, encod-
ing and decoding functions can be arbitrarily chosen to parameter-
ize respectively latent distributions and data likelihoods. Typical
choices for such functions are neural networks, which can provide
extremely flexible and powerful data representation (Box 6). For
example, leveraging the modeling capabilities of deep convolu-
tional networks, mcVAE has been used in a recent cardiovascular
study for the prediction of cardiac MRI data from retinal fundus
images [46].
import torch
from mcvae.models import Mcvae
from mcvae.models.utils import DEVICE, load_or_fit
3.4 Deep CCA The mcVAE uses neural network layers to learn nonlinear repre-
sentations of multimodal data. Similarly, Deep CCA [29] provides
an alternative to kernel CCA to learn nonlinear mappings of multi-
modal information. Deep CCA computes representations by pass-
ing two views through functions f1 and f2 with parameters θ1 and
θ2, respectively, which can be learnt by multilayer neural networks.
The parameters are optimized by maximizing the correlation
between the learned representations f1(X1;θ1) and f2(X2;θ2):
θ1opt , θ2opt = argmax Corrðf 1 ðX 1 ; θ1 Þ, f 2 ðX 2 ; θ2 ÞÞðθ1 , θ2 Þ ð29Þ
In its classical formulation, the correlation objective given in Eq. 29
is a function of the full training set, and as such, mini-batch opti-
mization can lead to suboptimal results. Therefore, optimization of
classical deep CCA must be performed with full-batch optimiza-
tion, for example, through the L-BFGS (limited Broyden-Fletcher-
Goldfarb-Shanno) scheme [47]. For this reason, with this vanilla
implementation, deep CCA is not computationally viable for large
datasets. Furthermore, this approach does not provide a model for
generating samples from the latent space. To address these issues,
Wang et al. [48] introduced deep variational CCA (VCCA) which
extends the probabilistic CCA framework introduced in Subhead-
ing 3 to a nonlinear generative model. In a similar approach to
VAEs and mcVAE, deep VCCA uses variational inference to
approximate the posterior distribution and derives the
following ELBO:
where RðW l Þ = D 1
j =1
2
s∈S l W ½s, j is the penalization of the
entries of W associated with the features of X2 indexed by S l. The
total penalty is achieved by the sum across the D1 columns.
Group-wise regularization is particularly effective in the follow-
ing situations:
• To compensate for large data dimensionality, by reducing the
number of “free parameters” to be optimized by aggregating the
available features [51].
2
https://www.genome.jp/kegg/pathway.html.
3
http://geneontology.org/.
Integration of Multimodal Data 593
Fig. 5 The multi-channel VAE (mcVAE) for the joint modeling of multimodal
medical imaging, clinical, and biological information. The mcVAE approximates
the latent posterior p(z|X1, X2, X3, X4) to maximize the likelihood of the data
reconstruction p(X1, X2, X3, X4|z) (plus a regularization term)
594 Marco Lorenzi et al.
5 Conclusions
Acknowledgements
References
1. Civelek M, Lusis AJ (2014) Systems genetics 12. Hong SB, Zalesky A, Fornito A, Park S, Yang
approaches to understand complex traits. Nat YH, Park MH, Song IC, Sohn CH, Shin MS,
Rev Gen 15(1):34–48. https://doi.org/10. Kim BN, Cho SC, Han DH, Cheong JH, Kim
1038/nrg3575 JW (2014) Connectomic disturbances in
2. Liu S, Cai W, Liu S, Zhang F, Fulham M, attention-deficit/hyperactivity disorder: a
Feng D, Pujol S, Kikinis R (2015) Multimodal whole-brain tractography analysis. Biol Psychi-
neuroimaging computing: a review of the atry 76(8):656–663
applications in neuropsychiatric disorders. 13. Mueller S, Keeser D, Samson AC, Kirsch V,
Brain Inform 2(3):167–180. https://doi. Blautzik J, Grothe M, Erat O, Hegenloh M,
org/10.1007/s40708-015-0019-x Coates U, Reiser MF, Hennig-Fast K, Meindl
3. Shen L, Thompson PM (2020) Brain imaging T (2013) Convergent findings of altered func-
genomics: Integrated analysis and machine tional and structural brain connectivity in indi-
learning. Proc IEEE Inst Electr Electron Eng viduals with high functioning autism: a
108(1):125–162. https://doi.org/10.1109/ multimodal mri study. PLOS ONE 8(6):
JPROC.2019.2947272 1–11. https://doi.org/10.1371/journal.
4. Visscher PM, Wray NR, Zhang Q, Sklar P, pone.0067329
McCarthy MI, Brown MA, Yang J (2017) 14. Lorenzi M, Altmann A, Gutman B, Wray S,
10 years of GWAS discovery: biology, function, Arber C, Hibar DP, Jahanshad N, Schott JM,
and translation. Am J Hum Genet 101(1): Alexander DC, Thompson PM, Ourselin S,
5–22. https://doi.org/10.1016/j.ajhg.201 null null (2018) Susceptibility of brain atrophy
7.06.005 to TRIB3 in Alzheimer’s disease, evidence
5. Lahat D, Adali T, Jutten C (2014) Challenges from functional prioritization in imaging
in multimodal data fusion. In: EUSIPCO genetics. Proc Natl Acad Sci 115(12):
2014—22th European signal processing con- 3162–3167. https://doi.org/10.1073/pnas.
ference, Lisbonne, Portugal, pp 101–105. 1706100115
https://hal.archives-ouvertes.fr/hal-010623 15. Kim M, Kim J, Lee SH, Park H (2017) Imag-
66 ing genetics approach to Parkinson’s disease
6. Menon BK, Campbell BC, Levi C, Goyal M and its correlation with clinical score. Sci Rep
(2015) Role of imaging in current acute ische- 7(1):46700. https://doi.org/10.1038/
mic stroke workflow for endovascular therapy. srep46700
Stroke 46(6):1453–1461. https://doi.org/10. 16. Martins D, Giacomel A, Williams SC,
1161/STROKEAHA.115.009160 Turkheimer F, Dipasquale O, Veronese M,
7. Zameer S, Siddiqui AS, Riaz R (2021) Multi- Group PTW, et al. (2021) Imaging transcrip-
modality imaging in acute ischemic stroke. tomics: convergent cellular, transcriptomic,
Curr Med Imaging 17(5):567–577 and molecular neuroimaging signatures in the
healthy adult human brain. Cell Rep 37(13):
8. Liu X, Lai Y, Wang X, Hao C, Chen L, Zhou Z, 110173
Yu X, Hong N (2013) A combined DTI and
structural MRI study in medicated-naı̈ve 17. Schrouff J, Rosa MJ, Rondina JM, Marquand
chronic schizophrenia. Magn Reson Imaging AF, Chu C, Ashburner J, Phillips C,
32(1):1–8 Richiardi J, Mourão-Miranda J (2013)
PRoNTo: pattern recognition for neuroimag-
9. Rastogi S, Lee C, Salamon N (2008) Neuroim- ing toolbox. Neuroinformatics 11(3):319–337
aging in pediatric epilepsy: a multimodality
approach. Radiographics 28(4):1079–1095 18. Szymczak S, Biernacka JM, Cordell HJ, Gon-
zález-Recio O, König IR, Zhang H, Sun YV
10. Abela E, Rummel C, Hauf M, Weisstanner C, (2009) Machine learning in genome-wide
Schindler K, Wiest R (2014) Neuroimaging of association studies. Genetic Epidemiol 33
epilepsy: lesions, networks, oscillations. Clin (S1):S51–S57
Neuroradiol 24(1):5–15
19. Liu J, Calhoun VD (2014) A review of multi-
11. Fernández S, Donaire A, Serès E, Setoain X, variate analyses in imaging genetics. Front
Bargalló N, Falcén C, Sanmartı́ F, Maestro I, Neuroinform 8:29
Rumià J, Pintor L, Boget T, Aparicio J, Car-
reño M (2015) PET/MRI and PET/MRI/ 20. Lorenzi M, Altmann A, Gutman B, Wray S,
SISCOM coregistration in the presurgical eval- Arber C, Hibar DP, Jahanshad N, Schott JM,
uation of refractory focal epilepsy. Epilepsy Alexander DC, Thompson PM, Ourselin S
Research 111:1–9. https://doi.org/10.1016/ (2018) Susceptibility of brain atrophy to trib3
j.eplepsyres.2014.12.011 in Alzheimer’s disease, evidence from
596 Marco Lorenzi et al.
functional prioritization in imaging genetics. 33. Bach F, Jordan M (2003) Kernel independent
Proc Natl Acad Sci 115(12):3162–3167. component analysis. J Mach Learn Res 3:1–48.
https://doi.org/10.1073/pnas.1706100115 h t t p s : // d o i . o r g / 1 0 . 1 1 6 2 / 1 5 3 2 4 4 3 0 3
21. Shashanka M, Raj B, Smaragdis P (2007) 768966085
Sparse overcomplete latent variable decompo- 34. Theodoridis S, Koutroumbas K (2008) Pattern
sition of counts data. In: Advances in neural recognition, 4th edn. Academic Press,
information processing systems, vol 20 New York
22. Anandkumar A, Ge R, Janzamin M (2015) 35. Wold H (1975) Path models with latent vari-
Learning overcomplete latent variable models ables: the nipals approach. In: Quantitative
through tensor methods. In: Conference on sociology. Elsevier, Amsterdam, pp 307–357
learning theory, PMLR, pp 36–112 36. Pedregosa F, Varoquaux G, Gramfort A,
23. Antelmi L, Ayache N, Robert P, Lorenzi M Michel V, Thirion B, Grisel O, Blondel M,
(2019) Sparse multi-channel variational auto- Prettenhofer P, Weiss R, Dubourg V et al
encoder for the joint analysis of heterogeneous (2011) Scikit-learn: machine learning in
data. In: International conference on machine python. J Mach Learn Res 12:2825–2830
learning, PMLR, pp 302–311 37. Tenenhaus M (1999) L’approche pls. Revue de
24. Hotelling H (1936) Relations between two statistique appliquée 47(2):5–40
sets of variates. Biometrika 28(3/4):321 38. Vidaurre D, van Gerven MA, Bielza C,
25. Liu J, Calhoun V (2014) A review of multivari- Larrañaga P, Heskes T (2013) Bayesian sparse
ate analyses in imaging genetics. Front Neu- partial least squares. Neural Comput 25(12):
roinform 8:29. https://doi.org/10.3389/ 3318–3339
fninf.2014.00029 39. Klami A, Virtanen S, Kaski S (2013) Bayesian
26. Kettenring JR (1971) Canonical analysis of canonical correlation analysis. J Mach Learn
several sets of variables. Biometrika 58(3): Res 14(4):965–1003
433–451. https://doi.org/10.1093/biomet/ 40. Tipping ME, Bishop CM (1999) Probabilistic
58.3.433 principal component analysis. J R Stat Soc
27. Luo Y, Tao D, Ramamohanarao K, Xu C, Wen Series B (Statistical Methodology) 61(3):
Y (2015) Tensor canonical correlation analysis 611–622
for multi-view dimension reduction. IEEE 41. Balelli I, Silva S, Lorenzi M (2021) A probabi-
Trans Knowl Data Eng 27(11):3111–3124. listic framework for modeling the variability
https://doi.org/10.1109/TKDE.2015. across federated datasets of heterogeneous
2445757 multi-view observations. In: Information pro-
28. Huang SY, Lee MH, Hsiao CK (2009) Non- cessing in medical imaging: proceedings of
linear measures of association with kernel the. . .conference.
canonical correlation analysis and applications. 42. Kingma DP, Welling M (2014) Auto-Encoding
J Stat Plan Inference 139(7):2162–2174. Variational Bayes. In: Proc. 2nd Int. Conf.
https://doi.org/10.1016/j.jspi.2008.10.011 Learn. Represent. (ICLR2014) 1312.6114
29. Andrew G, Arora R, Bilmes J, Livescu K (2013) 43. Rezende DJ, Mohamed S, Wierstra D (2014)
Deep canonical correlation analysis. In: Stochastic backpropagation and approximate
Dasgupta S, McAllester D (eds) Proceedings inference in deep generative models. In: Inter-
of the 30th international conference on national conference on machine learning.
machine learning, PMLR, Atlanta, Georgia, PMLR, pp 1278–1286
USA, Proceedings of Machine Learning 44. Kim Y, Wiseman S, Miller A, Sontag D, Rush A
Research, vol 28, pp 1247–1255. https:// (2018) Semi-amortized variational
proceedings.mlr.press/v28/andrew13.html autoencoders. In: International conference on
30. McIntosh A, Bookstein F, Haxby JV, Grady C machine learning. PMLR, pp 2678–2687
(1996) Spatial pattern analysis of functional 45. Blei DM, Kucukelbir A, McAuliffe JD (2017)
brain images using partial least squares. Neuro- Variational inference: a review for statisticians. J
image 3(3):143–157 Am Stat Assoc 112(518):859–877
31. Worsley KJ (1997) An overview and some new 46. Diaz-Pinto A, Ravikumar N, Attar R,
developments in the statistical analysis of pet Suinesiaputra A, Zhao Y, Levelt E,
and fmri data. Hum Brain Mapp 5(4):254–258 Dall’Armellina E, Lorenzi M, Chen Q, Keenan
32. De Bie T, Cristianini N, Rosipal R (2005) TD et al (2022) Predicting myocardial infarc-
Eigenproblems in pattern recognition. In: tion through retinal scans and minimal per-
Handbook of geometric computing, pp sonal information. Nat Mach Intell 4:55–61
129–167
Integration of Multimodal Data 597
47. Nocedal J, Wright S (2006) Numerical optimi- international conference on pattern recogni-
zation. Springer nature, pp 1–664. Springer tion. Vol. II. Conference B: pattern recogni-
series in operations research and financial tion methodology and systems, pp 1–4.
engineering https://doi.org/10.1109/ICPR.1992.
48. Wang W, Lee H, Livescu K (2016) Deep varia- 201708
tional canonical correlation analysis. http:// 55. Deprez M, Moreira J, Sermesant M, Lorenzi M
arxiv.org/abs/1610.03454 (2022) Decoding genetic markers of multiple
49. Hafkemeijer A, Altmann-Schneider I, Oleksik phenotypic layers through biologically con-
AM, van de Wiel L, Middelkoop HA, van strained genome-to-phenome Bayesian sparse
Buchem MA, van der Grond J, Rombouts SA regression. Front Mol Med. https://doi.org/
(2013) Increased functional connectivity and 10.3389/fmmed.2022.830956
brain atrophy in elderly with subjective mem- 56. Molchanov D, Ashukha A, Vetrov D (2017)
ory complaints. Brain Connectivity 3(4): Variational dropout sparsifies deep neural net-
353–362 works. arXiv 1701.05369
50. Yuan M, Lin Y (2006) Model selection and 57. Kingma DP, Welling M (2014) Auto-encoding
estimation in regression with grouped vari- variational bayes. CoRR abs/1312.6114
ables. J R Stat Soc Series B (Statistical Method- 58. Pearlson GD, Liu J, Calhoun VD (2015) An
ology) 68(1):49–67 introductory review of parallel independent
51. Zhang Y, Xu Z, Shen X, Pan W, Initiative ADN component analysis (p-ICA) and a guide to
(2014) Testing for association with multiple applying p-ICA to genetic data and imaging
traits in generalized estimation equations, phenotypes to identify disease-associated
with application to neuroimaging data. Neuro- biological pathways and systems in common
Image 96:309–325. https://doi.org/10.101 complex disorders. Front Genetics 6:276
6/j.neuroimage.2014.03.061 59. Le Floch É, Guillemot V, Frouin V, Pinel P,
52. Hibar DP, Stein JL, Kohannim O, Lalanne C, Trinchera L, Tenenhaus A,
Jahanshad N, Saykin AJ, Shen L, Kim S, Moreno A, Zilbovicius M, Bourgeron T et al
Pankratz N, Foroud T, Huentelman MJ, Pot- (2012) Significant correlation between a set of
kin SG, Jack Jr CR, Weiner MW, Toga AW, genetic polymorphisms and a functional brain
Thompson PM, Initiative ADN (2011) Voxel- network revealed by feature selection and
wise gene-wide association study (vGeneWAS): sparse partial least squares. Neuroimage
multivariate gene-based association testing in 63(1):11–24
731 elderly subjects. NeuroImage 56(4): 60. Rodin I, Fedulova I, Shelmanov A, Dylov DV
1875–1891. https://doi.org/10.1016/j. (2019) Multitask and multimodal neural net-
neuroimage.2011.03.077 work model for interpretable analysis of x-ray
53. Ge T, Feng J, Hibar DP, Thompson PM, images. In: 2019 IEEE international confer-
Nichols TE (2012) Increasing power for ence on bioinformatics and biomedicine
voxel-wise genome-wide association studies: (BIBM). IEEE, pp 1601–1604
The random field theory, least square kernel 61. Huang SC, Pareek A, Zamanian R, Banerjee I,
machines and fast permutation procedures. Lungren MP (2020) Multimodal fusion with
NeuroImage 63:858–873 deep neural networks for leveraging ct imaging
54. Schmidt W, Kraaijveld M, Duin R (1992) and electronic health record: a case-study in
Feedforward neural networks with random pulmonary embolism detection. Sci Rep
weights. In: Proceedings of the 11th IAPR 10(1):1–9
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made. The images or other
third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated
otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder.
Part IV
Abstract
This chapter describes model validation, a crucial part of machine learning whether it is to select the best
model or to assess performance of a given model. We start by detailing the main performance metrics for
different tasks (classification, regression), and how they may be interpreted, including in the face of class
imbalance, varying prevalence, or asymmetric cost–benefit trade-offs. We then explain how to estimate
these metrics in an unbiased manner using training, validation, and test sets. We describe cross-validation
procedures—to use a larger part of the data for both training and testing—and the dangers of data
leakage—optimism bias due to training data contaminating the test set. Finally, we discuss how to obtain
confidence intervals of performance metrics, distinguishing two situations: internal validation or evaluation
of learning algorithms and external validation or evaluation of resulting prediction models.
Key words Validation, Performance metrics, Cross-validation, Data leakage, External validation
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_20,
© The Author(s) 2023
601
602 Gael Varoquaux and Olivier Colliot
2 Performance Metrics
2.1 Metrics for For classification tasks, the results can be summarized in a matrix
Classification called the confusion matrix (Fig. 1). For binary classification, the
confusion matrix divides the test samples into four categories,
2.1.1 Binary
depending on their true and predicted labels:
Classification
• True Positives (TP): Samples for which the true and predicted
labels are both 1. Example: The patient has cancer (1), and the
model classifies this sample as cancer (1).
• True Negatives (TN): Samples for which the true and predicted
labels are both 0. Example: The patient does not have cancer (0),
and the model classifies this sample as non-cancer (0).
• False Positives (FP): Samples for which the true label is 0 and
the predicted label is 1. Example: The patient does not have
cancer (0), and the model classifies this sample as cancer (1).
• False Negatives (FN): Samples for which the true label is 1 and
the predicted label is 0. Example: The patient has cancer (1), and
the model classifies this sample as non-cancer (0).
Are false positives and false negatives equally problematic? This
depends on the application. For instance, consider the case of
detecting brain tumors. For a screening application, detected posi-
tive cases would then be subsequently reviewed by a human expert,
and one can thus consider that false negatives (missed brain tumor)
lead to more dramatic consequences than false positives. On the
opposite, if a detected tumor leads the patient to be sent to brain
surgery without complementary exam, false positives are problem-
atic and brain surgery is not a benign operation. For automatic
volumetry from magnetic resonance images (MRI), one could
argue that false positives and false negatives are equally problematic.
(continued)
604 Gael Varoquaux and Olivier Colliot
Box 1 (continued)
Summary metrics
• Accuracy: A fraction of the samples correctly classified.
Accuracy = TPþTN
TPþFPþTNþFN.
• Balanced accuracy (BA): Accuracy metric that accounts for
unbalanced samples.
SensitivityþSpecificity
BA = 2 .
• F1 score: Harmonic mean of PPV (precision) and sensitivity
(recall).
F1 = 2
= 2TPþFPþFN :
2TP
PPV þSensitivity
1 1
• Markedness = TP
TPþFP - FP
FPþTN = PPV þ NPV - 1.
• Area under the receiver operating characteristic curve
(ROC AUC).
• Area under the precision–recall curve (PR AUC, also
called average precision).
Simple Summaries and It is convenient to use summary metrics that provide a more global
Their Pitfalls assessment of the performance, for instance, to select a “best”
model. However, as we will see, summary metrics, when used in
isolation, can lead to erroneous conclusions. The most widely used
summary metric is arguably accuracy. Its main advantage is a natural
interpretation: the proportion of correctly classified samples. How-
ever, it is misleading when the data are imbalanced. Let us for
Machine-Learning Evaluation 605
Bayes’ rule thus shows that we must account for the prevalence: the
proportion of the people with the disease in the target population,
the population in which the test is intended to be applied. The
target population can be the general population for a screening test.
It could be the population of people with memory complaints for a
test aiming to diagnose Alzheimer’s disease. Now, suppose that the
prevalence is low, which will often be the case for a screening test in
the general population. For instance, prevalence = 0.001. This
leads to P(D + jT +) = 0.0098 ≈ 1%. So, if the test is positive, there
is only 1% chance that the patient has the disease. Even though our
classifier has seemingly good sensitivity, specificity, and balanced
accuracy, it is not very informative on the general population. The
PPV and NPV readily give the information of interest: P(D + jT +)
and P(D -jT -). However, they are not natural metrics to report a
classifier’s performance because, unlike sensitivity and specificity,
they are not intrinsic to the test (in other words the trained ML
model) but also depend on the prevalence and thus on the target
population (Fig. 2).
100
10–2
10–8
10–10 10–8 10–6 10–4 10–2 1 1 – 10–2 1 – 10–4 1 – 10–6 1 – 10–8 1 – 10–10
2
prevalence
Fig. 2 NPV and PPV as functions of prevalence when the sensitivity and the specificity are fixed (image
courtesy of Johann Faouzi)
606 Gael Varoquaux and Olivier Colliot
Summary Metrics for Low The F1 score is another summary metric, built as the harmonic
Prevalence mean of the sensitivity (recall) and PPV (precision). It is popular
in machine learning but, as we will see, it also has substantial
drawbacks. Note that it is equal to the Dice coefficient used for
segmentation. Given that it builds on the PPV rather than the
specificity to characterize retrieval, it accounts slightly better for
prevalence. In our example, the F1 score would have been low.
The F1 score can nevertheless be misleading if the prevalence is
high. In such a case, one can have high values for sensitivity,
specificity, PPV, F1 score but a low NPV. A solution can be to
exchange the two classes. The F1 score becomes informative again.
Those shortcomings are fundamental, as the F1 score is
completely blind to the number of true negatives, TNs. This is
probably one of the reasons why it is a popular metric for seg-
mentation (usually called Dice rather than F1) as in this task TN is
almost meaningless (TN can be made arbitrarily large by just
changing the field of view of the image). In addition, this metric
has no simple link to the probabilities of interest, even more so
after switching classes.
Another option is to use Matthews Correlation Coefficient
(MCC). The MCC makes full use of the confusion matrix and can
remain informative even when prevalence is very low or very high.
However, its interpretation may be less intuitive than that of the
other metrics. Finally, markedness [2] is a seldom known summary
metric that deals well with low-prevalence situations as it is built
from the PPV and NPV (Box 1). Its drawback is that it is as much
related to the population under study as to the classifier.
As we have seen, it is important to distinguish metrics that are
intrinsic characteristics of the classifier (sensitivity, specificity,
balanced accuracy) from those that are dependent on the target
population and in particular of its prevalence (PPV, NPV, MCC,
markedness). The former are independent of the situation in
which the model is going to be used. The latter inform on the
probability of the condition (the output label) given the output
of the classifier, but they depend on the operational situation
and, in particular, on the prevalence. The prevalence can be
variable (for instance, the prevalence of an infectious disease
will be variable across time, and the prevalence of a neurodegen-
erative disease will depend on the age of the target population),
and a given classifier may be intended to be applied in various
situations. This is why the intrinsic characteristics (sensitivity and
specificity) need to be judged according to the different
intended uses of the classifier (e.g., a specificity of 90% may
be considered excellent for some applications, while it would
be considered unacceptable if the intended use is in a
low-prevalence situation).
Machine-Learning Evaluation 607
Metrics for Shifts in Odds enable designing metrics that characterize the classifier but
Prevalence are adapted to target populations with a low prevalence. Odds are
defined as the ratio between the probability that an event occurs
and the probability this event does not occur: OðaÞ = 1 -PðaÞ P ðaÞ. Ratios
between odds can be invariant to the sampling frequency
(or prevalence) of a—see Appendix “Odds Ratio and Diagnostic
Tests Evaluation” for an introduction to odds and their important
properties. For this reason, they are often used in epidemiology. A
classifier can be characterized by the ratio between the pre-test and
post-test odds, often called the positive likelihood ratio:
LR þ = OðDþjT þÞ sensitivity
OðDþÞ = 1 - specif icity. This quantity depends only on sen-
sitivity and specificity, properties of the classifier only, and not of the
prevalence on the study population. Yet, given a target population,
post-test odds can easily be obtained by multiplying LR+ by
pre-test odds, itself given by prevalence: OðDþÞ = 1 -prevalence prevalence.
The larger the LR+ , the more useful the classifier and a classifier
with LR+ = 1 or less brings no additional information on the
likelihood of the disease. An equivalent to LR + characterizes the
negative class: controlling on “T-” instead of “T+ ” gives the
negative likelihood ratio: LR - = 1 -specificity
sensitivity
; and low values of
LR- (below 1) denote more useful predictions. These metrics,
LR+ and LR-, are very useful in a situation common in biomedical
settings where the only data available to learn and evaluate a classi-
fier are study population with nearly balanced classes, such as a case–
control study, while the target application—the general
population—is one with a different prevalence (e.g., a very low
prevalence) or when the intended use considers variable
prevalences.
Fig. 3 ROC curve for different classifiers. AUC denotes the area under the curve,
typically used to extract a number summarizing the ROC curve
Fig. 4 Precision–recall curve for different classifiers. AUC denotes the area under
the curve, often called average precision here. Note that the chance level
depends on the class imbalance (or prevalence), here 0.57
Brier(ŝ, y)
Brier skill score = 1 −
Brier(ȳ, y)
Class prevalence
2.1.2 Multi-class When there are multiple classes to distinguish, the main difference
Classification with two-class classification is that the problem can no longer be
separated into a positive class (typically individuals with the medical
condition of interest) and a negative class (individuals without). As
a consequence, sensitivity and specificity no longer have a meaning
Machine-Learning Evaluation 611
Multilabel Classification Multilabel settings are when the multiple classes are not mutually
exclusive: for instance, if an individual can have multiple patholo-
gies. The problem is then to detect the presence or absence of each
label for an individual. In terms of evaluation, multilabel settings
can be understood as several binary classification problems, and
thus the corresponding metrics can be used on each label. As in
the multi-class settings, there are different ways to average the
results for each label—macro, micro—that put more or less empha-
sis on the rare labels.
Fig. 5 Multi-class confusion matrix, for a 3-class problem, C1, C2, C3. Each
entry gives the number of instances predicted of a given class, knowing the
actual class. A perfect prediction would give non-zero entries only on the
diagonal
612 Gael Varoquaux and Olivier Colliot
2.2 Metrics for In regression settings, the outcome to predict y is continuous, for
Regression instance, an individual’s age, cognitive scores, or glucose level.
Corresponding error metrics gauge how far the prediction ^y is
from the observed y.
R2 Score. The go-to metric here is typically the R2 score, some-
times called explained variance—however, the term R2 score
should be preferred, as some authors define explained variance as
ignoring bias. Mathematically, the R2 score is the fraction of vari-
ance of the outcome y explained by the prediction ^y , relative to the
variance explained by the mean y on the test set:
SSðy - ^y Þ
R2 = 1 - ,
SSðy - yÞ
where SS is the sum of squares on the test data. A strong benefit of
this metric is that it comes with a natural scale: an R2 of 1 implies
perfect prediction, while an R2 of zero implies a trivial and not very
useful prediction. Note that chance-level predictions (as obtained
for instance by learning on permuted y) yield slightly negative
predictions: indeed, even when the data do not support a predic-
tion of y—as in chance settings—it is impossible to estimate the
mean y perfectly and predictions will be worse than the actual mean.
In this respect, the R2 score has a different behavior in machine
learning settings compared to inferential statistics settings not
focused on prediction: in-sample (for inferential statistics) versus
out-of-sample settings (for machine learning). Indeed, when the
mean of y is computed on the same data as the model, the R2
score is positive and is the square of the correlation between y and ^y .
This is not the case in predictive settings, and the correlation
between y and ^y should not be used to judge the quality of a
prediction [6], because it discards errors on the mean and the
scale of the prediction, which are important in practice.
Absolute Error Measures. Reporting only the R2 score is not
sufficient to characterize well a predictive model. Indeed, the R2
score depends on the variance of the outcome y in the study
population and thus does not enable comparing predictive models
on different samples. For this purpose, it is important to report also
an absolute error measure. The root mean square error (RMSE)
and the mean absolute error (MAE) are two of such measures that
give an error in the scale of the outcome: if the outcome y is an age
in years, the error is also in years. The mean absolute error is easier
to interpret. Compared to the root mean square error, the mean
absolute error will put much less weight on some rare large devia-
tions. For instance, consider the following prediction error (on 11
observations):
Machine-Learning Evaluation 613
Note that if the error was uniformly equal to the same value (10, for
instance), both measures would give the same result.
Assessing the Distribution of Errors. The difference between the
mean absolute error and the root mean square error arises from the
fact that both measures account differently for the tails of the
distribution of errors. It is often useful to visualize these errors, to
understand how they are structured. Figure 6 shows such visualiza-
tion: predicted y as a function of observed y. It reveals that for large
values of y, the predictive model has a larger prediction error, but
also that it tends to undershoot: predict a value that underestimates
the observed value. This aspect of the prediction error is not well
captured by the summary metrics because there are comparatively
much less observations with large y.
Concluding Remarks on Performance Metrics. Whether it is in
regression or in classification, a single metric is not enough to
capture all aspects of prediction performance that are important
for applications. Heterogeneity of the error, as we have just seen in
our last example, can be present not only as a function of prediction
target, but of any aspect of the problem, for instance, the sex of the
individuals. Problems related to fairness, where some groups (e.g.,
demographic, geographic, socioeconomic groups) suffer more
errors than others, can lead to loss of trust or amplification of
inequalities [7]. For these reasons, it may be important to also
report error metrics on relevant subgroups, following common
medical research practice of stratification.
614 Gael Varoquaux and Olivier Colliot
3 Evaluation Strategies
3.1.1 Cross-validation The split between train and test sets is arbitrary. With the same
machine learning algorithm, two different data splits will lead to
two different observed performances, both of which are noisy
estimates of the expected generalization performance of prediction
models built with this learning procedure. A common strategy to
obtain better estimates consists in performing multiple splits of the
whole dataset into training and testing sets: a so-called cross-valida-
tion loop. For each split, a model is trained using the training set,
and the performances are computed using the testing set. The
performances over all the testing sets are then aggregated. Figure 7
displays different cross-validation methods. k-fold cross-validation
consists in splitting the data into k sets (called folds) of approxi-
mately equal size. It ensures that each sample in the dataset is used
exactly once for testing. For classification, sklearn.model_
selection.StratifiedKFold performs stratified k-fold cross-
validation.
In each split, ideally, one would want to have a large training
set, because it usually allows training better performing models,
and a large testing set, because it allows a more accurate estimation
of the performance. But the dataset size is not infinite. Splitting out
10–20% for the test set is a good trade-off [9], which amounts to
k = 5 or 10 in a k-fold. With small datasets, to maximize the amount
of train data, it may be tempting to leave out only one observation,
in a so-called leave-one-out cross-validation. However, such deple-
tion of the test set gives overall worse estimates of the generaliza-
tion performance. Increasing the number of splits is, however,
useful, and thus another strategy consists in performing a large
number of random splits of the data, breaking from the regularity
of the k-fold. If the number of splits is sufficiently large, all samples
will be approximately used the same number of times for training
and testing. This strategy can be done using sklearn.model_
selection.StratifiedSuffleSplit(n_splits) and is
called “Repeated hold-out” or “Monte-Carlo cross-validation.”
616 Gael Varoquaux and Olivier Colliot
Fig. 7 Different validation methods, from top to bottom. The first method, called “hold-out,” involves a single
split of the dataset into training and testing sets. It is thus not a cross-validation method. Stratification is the
procedure that controls that the output variable (for instance, disease vs. healthy) has approximately the same
distribution in the training and testing sets. k-fold cross-validation consists in splitting the data into k sets
(called folds) of approximately equal size. Repeated hold-out consists in performing a large number of random
splits of the data
Machine-Learning Evaluation 617
3.1.2 The Need of an Often, it is useful to make choices on the model to maximize
Additional Validation Set prediction performance: make changes on the architecture, tune
hyper-parameters, perform early stopping,. . . . As the test set per-
formance is our best estimate of prediction performance, it would
be be natural to run cross-validation and pick the best model.
However, in such a situation, the performances reported on the
testing set will have an optimistic bias: a data-dependent choice has
been made on this test set. There are two main solutions to this
issue. The first one is usually applied when the model training is fast
and the dataset is of small size. It is called nested cross-validation. It
consists in running two loops of cross-validation, one nested into
the other. The inner loop serves for hyper-parameter tuning or
model selection, while the outer loop is used to evaluate the per-
formance. The second solution is to separate from the whole data-
set the test set, which will only be used to evaluate the
performances. Then, the remainder of the dataset can be further
split into training data and data used to make modeling choices,
called the validation set.1 Such a procedure is illustrated in Fig. 8.
Commonly, the training and validation sets will be used in a cross-
validation manner. They can then be used to experiment with
different models, tune parameters, . . . . It is absolutely crucial that
the test set is isolated at the very beginning, before any experiment
is done. It should be left untouched and used only at the end of the
study to report the performances. As for the split between training
and validation sets, it is desirable that stratification is done when
isolating the test set.
If the dataset is very small, nested cross-validation should be
preferred as it gives better testing power than hold-out: all the data
are used alternatively for model testing. If the dataset feels too small
to split into train, validation, test, it may be too small to conduct a
trustworthy machine learning study [10].
3.1.3 Various Sources of Data leakage denotes cases where some information from the train-
Data Leakage ing set has “leaked” into the test set. As a consequence, the estima-
tion of the performances is likely to be optimistic. Data leakage can
be introduced in many ways, some of which are particularly insidi-
1
In Chapter 23, the validation set is called the tuning set, as it is the standard practice in regulatory science and
because it insists on the fact that it should not be used to evaluate the final performance, which should be done on
an independent test set. In the present chapter, we use the term validation set as it is the most common in the
academic setting.
618 Gael Varoquaux and Olivier Colliot
Fig. 8 A standard approach consists in splitting the whole dataset into training, validation, and test sets. The
test set must be isolated from the very beginning, left untouched until the end of the study and only be used to
evaluate the performance. The training and validation sets are often used in a cross-validation manner. They
can be used to experiment with different architectures and tune parameters
ous and may not be obvious to a researcher that is not familiar with
a specific application field. Below, we describe some common causes
of data leakage. A summary can be found in Box 3.
A first basic cause of data leakage is to use the whole dataset for
performing various operations on the data. A very common example
is to perform feature selection using the whole dataset and then to use
the selected features for model training. A similar situation is when
dimensionality reduction is performed on the whole dataset. If this is
done in an unsupervised manner (for example, using principal com-
ponent analysis), it is likely to introduce less bias in the performance
estimation because the target is not used. It nevertheless remains, in
principle, a bad practice. A common practice in deep learning is to
perform early stopping, i.e., use the validation set to determine when
to stop training. If this is the case, the validation performances can be
overoptimistic, and a separate test dataset should be used to report
performance. Another cause of data leakage is when there are multi-
ple longitudinal visits (i.e., the patient is evaluated at several time
points) or multiple modalities for a given patient. In such as case, one
should never put data from the same patient in both the training and
validation sets. For instance, one should not, for a given patient, put
the visit at month 0 in the training set and the visit at month 6 in the
validation set. Similarly, one should not use the magnetic resonance
imaging (MRI) data of a given patient for training and the positron
emission tomography (PET) image for validation. A similar situation
arises when dealing with 3D medical image. It is absolutely manda-
tory to avoid putting some of the 2D slices of a given patient in the
training set and the rest of the slices in the validation set. More
generally, in medical applications, the split between training and test
sets should always be done at the patient level. Unfortunately, data
leakage is still prevalent in many machine learning studies on brain
disorders. For instance, a literature review identified that up to 40% of
the studies on convolutional neural networks for automatic classifica-
tion of Alzheimer’s disease from T1-weighted MRI potentially suf-
fered from data leakage [11].
3.1.4 Statistical Testing Train–test splits, cross-validation, and the like seek to estimate the
expected generalization performance of a learning procedure.
Sources of Variance Keeping test data rigorously independent from algorithm develop-
ment minimizes the bias of this estimation. However, there are
multiple sources of arbitrary variations in these estimates. The most
obvious one is the intrinsic randomness of certain aspects of learning
procedures, such as the random initial weights in deep learning.
Indeed, while fixing the seed of the random number generator may
remove the randomness on given train data, this stability is mislead-
ing given this choice is arbitrary and not representative of the overall
behavior of the machine learning algorithm on the data distribution
of interest [12]. A systematic study of machine learning benchmarks
[13] shows that their most important sources of variance are:
measure of performance is an
imperfect estimate of the actual
expected performance. Subhead-
ing 3.2, below, gives the resulting
confidence intervals for a fixed
test set. Using multiple splits,
and thus multiple test sets,
improves the estimation [13],
though it makes computing con-
fidence intervals hard [14].
Conclusions Must Account With all these sources of arbitrary variance, the question is: given
for Benchmarking Variance benchmarks of a learning procedure performance, or improvement,
is it likely to generalize reliably to new data or rather to be due to
benchmarking fluctuations? Considering, for instance, the perfor-
mance metrics in Table 1, it seems a safe bet to say that the
convolutional neural network outperforms the two others but
what about the difference between the two other models? From
an application perspective, the question is whether this observed
difference is likely to generalize to new data.
To answer this question, we must account for estimation error
for the expected generalization performance from the different
sources of uncontrolled variance in the benchmarks, as listed
above. The first source of error comes from the limited sample
size to test the predictions of the different learning procedures.
Indeed, suppose that the testing set was composed of 100 samples.
In that case, if only 3 more samples had been misclassified by the
Machine-Learning Evaluation 621
Table 1
Accuracies obtained by different ML models on a binary classification
task. Which model performs best? While it is quite likely that the
convolutional neural network outperforms the two other models, it is less
clear for the two other models. It seems that the support vector machine
results in a slightly higher accuracy but is it due to random fluctuations in
the benchmarks? Will the difference carry over to new data?
Model Accuracy
Logistic regression 0.72
Support vector machine 0.75
Convolutional neural network 0.95
support vector machine, the two models would have had the same
performance. A difference of 3 out of 100 could be easily due to
having drawn 3 samples not representative of the population. Other
sources of variance are due to how stable the learning pipeline is:
sensitivity to hyper-parameters, random initialization, etc.
A Simple Statistical Testing Training and testing a prediction pipeline multiple times are needed
Procedure to estimate the variability of the performance measure. The simplest
solution is to do this several times while varying the arbitrary
factors, such as split between the train and the test or random
initialization (see Box 4). The resulting set of performance measures
is similar to bootstrap samples and can be used to draw conclusions
on the distribution of performances in a test set. Confidence inter-
vals can be computed using percentiles of this distribution. Two
learning procedures can be compared by counting the number of
times that one outperforms the other: outperforming 75% of the
times is typically considered as a reliable improvement [13]. If the
available computing power enables training learning procedures
only a few times, empirical standard deviations should be used, as
they require less runs to estimate. The improvements brought by a
learning procedure can then be compared to these standard
deviations.
Note these procedures do not perform classic null-hypothesis
significance testing, which is difficult here. In particular, the stan-
dard error across the various runs should not be used instead of the
standard deviation: the standard error is the standard deviation
divided by the number of runs. The number of runs can be made
arbitrarily large given enough compute power, thus making the
standard error arbitrarily small. But in no way does the uncertainty
due to the limited test data vanish. This uncertainty can be quanti-
fied for a fixed test set—see Subheading 3.2, but in repeated splits or
cross-validation, it is difficult to derive confidence intervals because
the runs are not independent [14, 15]. In particular, it is invalid to
use a standard hypothesis test—such as a T-test—across the different
folds of a cross-validation. There are some valid options to perform
hypothesis testing in a cross-validation setting [14, 16], but they
must be implemented with care.
Another reason not to rely on null-hypothesis testing is that
their statistical significance only asserts that the expected
performance—or improvement—is non-zero over a test population
of infinite size. From a practical perspective, we care about mean-
ingful improvements on test sets of finite size, which is related to
the notion of acceptance tests —as opposed to significance—in the
Neyman–Pearson framework of statistical testing [17]. Unlike null-
hypothesis significance testing, it requires choosing a non-zero
difference considered as acceptable, for instance as implicitly set
by considering that a new learning procedure should improve upon
an existing one 75% of the times—far from chance, which lies at
50%.
Fig. 9 In order to assess the generalization ability of a model under different conditions (such as data coming
from different hospitals/countries, acquired with different devices and protocols. . .), a common practice is to
use one or several additional datasets that come from other studies than the one used for training
Table 2
Binomial confidence intervals on accuracy (95% CI) for different values of
ground-truth accuracy
4 Conclusion
Acknowledgements
Appendix
Odds Ratio and Odds and odds ratio are frequently used in biostatistics and epide-
Diagnostic Tests miology, but less in machine learning. Here we give a quick intro-
Evaluation duction to these topics.
Odds Ratio and Invariance The odds ratio measures the association between two events, a and
to Class Sampling b, which we can arbitrarily call respectively outcome and property.
The odds ratio is defined as the ratio of the odds of the outcome in
the group where the property holds to that in the group where the
property does not hold:
Oðajb = þÞ
Odds ratio between a and b ORða, bÞ = :
Oðajb = - Þ
ð2Þ
To compute the odds ratio, the problem is fully specified by the
counts in the following contingency table:
ð3Þ
This property is one reason why odds and odds ratio are so
central to biostatistics and epidemiology: sampling or recruitment
bias is an important concern in these fields. For instance, a case–
control study has a very different prevalence as the target popula-
tion, where the frequency of the disease is typically very low.
2
Indeed, thankfully, many diseases have a prevalence much lower than 50%, e.g., 1%, which is already considered a
frequent disease. Therefore, in order to have a sufficient number of diseased individuals in the sample without
dramatically increasing the cost of the study, diseased participants will be oversampled. One extreme example, but
very common in medical research, is a case–control study where the number of diseased and healthy individuals is
equal.
628 Gael Varoquaux and Olivier Colliot
Likelihood Ratio of The likelihood ratio used to characterize diagnostic tests or classi-
Diagnostic Tests or fiers is strongly related to the odds ratio introduced above, though
Classifiers it is not strictly speaking an odds ratio. It is defined as
PðT þ jDþÞ
LR þ = : ð5Þ
PðT þ jD - Þ
Using the expressions in Box 1 and the fact that P(T+ |D+) = 1 -
P(T -|D+), the LR+ can be written as
Sensitivity
LR þ = : ð6Þ
1 - Specificity
nþþ n-þ þ n- -
= ð8Þ
n-þ nþþ þ nþ -
P ðDþjT þÞ P ðD - Þ
PðD - jT þÞ
= OðDþjT þÞ PðDþÞ
= OðDþÞ
1
OðD þ jT þÞ
LRþ = : ð9Þ
OðDþÞ
þÞ PðDþjT þÞ
Indeed, OðD þ jT þÞ = 1 -PðDþjT
PðDþjT þÞ = PðD - jT þÞ and
PðDþÞ PðDþÞ
OðDþÞ = 1 - P ðDþÞ = P ðD - Þ.
O(D+) is called the pre-test odds (the odds of having the
disease in the absence of test information). O(D + |T+) is called
the post-test odds (the odds of having the disease once the test
result is known).
Machine-Learning Evaluation 629
Equation 9 shows how the LR+ relates pre- and post-test odds,
an important aspect of its practical interpretation.
Invariance to Prevalence
If the prevalence of the population changes, the quantities are
changed as follows: nþþ → f nþþ , nþ - → f nþ - ,
n - þ → ð1 - f Þ n - þ , n - - → ð1 - f Þ n - - , affecting LR+ as
follows:
f nþþ ð1 - f Þ n - þ þ ð1 - f Þ n - -
LR þ = : ð10Þ
ð1 - f Þ n - þ f nþþ þ f nþ -
The factors f and (1 - f ) cancel out, and thus the expression of LR
+ is unchanged for a change of the pre-test frequency of the label
(prevalence of the test population). This is alike odds ratio, though
the likelihood ratio is not an odds ratio (and does not share all
properties; for instance, it is not symmetric).
References
1. Pedregosa F, et al (2011) Scikit-learn: machine cross-validation, caveats, and guidelines. Neu-
learning in python. J Mach Learn Res 12(85): roImage 145:166–179
2825–2830 10. Varoquaux G (2018) Cross-validation failure:
2. Powers D (2011) Evaluation: from precision, small sample sizes lead to large error bars. Neu-
recall and f-measure to ROC, informedness, roImage 180:68–77
markedness & correlation. J Mach Learn Tech- 11. Wen J, Thibeau-Sutre E, Diaz-Melo M, Sam-
nol 2(1):37–63 per-González J, Routier A, Bottani S,
3. Naeini MP, Cooper G, Hauskrecht M (2015) Dormont D, Durrleman S, Burgos N,
Obtaining well calibrated probabilities using Colliot O, et al (2020) Convolutional neural
Bayesian binning. In: Twenty-Ninth AAAI networks for classification of Alzheimer’s dis-
Conference on Artificial Intelligence ease: overview and reproducible evaluation.
4. Vickers AJ, Van Calster B, Steyerberg EW Med Image Anal 63:101694
(2016) Net benefit approaches to the evalua- 12. Bouthillier X, Laurent C, Vincent P (2019)
tion of prediction models, molecular markers, Unreproducible research is reproducible. In:
and diagnostic tests. BMJ 352:i6 International Conference on Machine
5. Perez-Lebel A, Morvan ML, Varoquaux G Learning, PMLR, pp 725–734
(2023) Beyond calibration: estimating the 13. Bouthillier X, Delaunay P, Bronzi M,
grouping loss of modern neural networks. In: Trofimov A, Nichyporuk B, Szeto J, Moham-
ICLR 2023 Conference madi Sepahvand N, Raff E, Madan K, Voleti V,
6. Poldrack RA, Huckins G, Varoquaux G (2020) et al (2021) Accounting for variance in
Establishment of best practices for evidence for machine learning benchmarks. Proc Mach
prediction: a review. JAMA Psychiatry 77(5): Learn Syst 3:747–769
534–540 14. Bates S, Hastie T, Tibshirani R (2021) Cross-
7. Barocas S, Hardt M, Narayanan A (2019) Fair- validation: what does it estimate and how well
ness and machine learning.http://www. does it do it? Preprint arXiv:210400673
fairmlbook.org 15. Bengio Y, Grandvalet Y (2004) No unbiased
8. Raschka S (2018) Model evaluation, model estimator of the variance of k-fold cross-valida-
selection, and algorithm selection in machine tion. J Mach Learn Res 5(Sep):1089–1105
learning. Preprint arXiv:181112808 16. Nadeau C, Bengio Y (2003) Inference for the
9. Varoquaux G, Raamana PR, Engemann DA, generalization error. Mach Learn 52(3):
Hoyos-Idrobo A, Schwartz Y, Thirion B 239–281
(2017) Assessing and tuning brain decoders:
630 Gael Varoquaux and Olivier Colliot
17. Perezgonzalez JD (2015) Fisher, Neyman- 21. Leisenring W, Pepe MS, Longton G (1997) A
Pearson or NHST? A tutorial for teaching marginal regression modelling framework for
data testing. Front Psychol 6:223 evaluating medical diagnostic tests. Statist Med
18. Moons KG, Altman DG, Reitsma JB, Ioannidis 16(11):1263–1281
JP, Macaskill P, Steyerberg EW, Vickers AJ, 22. Dietterich TG (1998) Approximate statistical
Ransohoff DF, Collins GS (2015) Transparent tests for comparing supervised classification
reporting of a multivariable prediction model learning algorithms. Neural Comput 10(7):
for individual prognosis or diagnosis (tripod): 1895–1923
explanation and elaboration. Ann Int Med 23. DeLong ER, DeLong DM, Clarke-Pearson DL
162(1):W1–W73 (1988) Comparing the areas under two or
19. Dockès J, Varoquaux G, Poline JB (2021) Pre- more correlated receiver operating characteris-
venting dataset shift from breaking machine- tic curves: a nonparametric approach.
learning biomarkers. GigaScience 10(9): Biometrics 44:837–845
giab055 24. Bandos AI, Rockette HE, Gur D (2005) A
20. Shapiro DE (1999) The interpretation of diag- permutation test sensitive to differences in
nostic tests. Statist Methods Med Res 8(2): areas for comparing ROC curves from a paired
113–134 design. Statist Med 24(18):2873–2893
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 21
Abstract
Reproducibility is a cornerstone of science, as the replication of findings is the process through which they
become knowledge. It is widely considered that many fields of science are undergoing a reproducibility
crisis. This has led to the publications of various guidelines in order to improve research reproducibility.
This didactic chapter intends at being an introduction to reproducibility for researchers in the field of
machine learning for medical imaging. We first distinguish between different types of reproducibility. For
each of them, we aim at defining it, at describing the requirements to achieve it, and at discussing its utility.
The chapter ends with a discussion on the benefits of reproducibility and with a plea for a nondogmatic
approach to this concept and its implementation in research practice.
Key words Reproducibility, Replicability, Reliability, Repeatability, Open science, Machine learning,
Artificial intelligence, Deep learning, Medical imaging
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_21,
© The Author(s) 2023
631
632 Olivier Colliot et al.
Box 1: Glossary
The readers will find the definition of the terms we used in the
present document.
• Reproducibility, replicability, repeatability. In the present
document, these will be used as synonyms of reproducibility.
• Original study. Study that first showed a finding.
• Replication study. Study that subsequently aimed at replicat-
ing an original study, with the hope to support its findings.
• Research artifact. Any output of scientific research: papers,
code, data, protocols. . . . Not to be confused with imaging
artifacts which are defects of imaging data.
• Claims. The conclusions of a study. Basically a set of state-
ments describing the results and a set of limitations which
delineate the boundaries within which the claims are stated
(the term “claim” is here used in the broad scientific sense not
with the specific meaning it has in the context of regulation of
medical devices although the two may be related).
• Limitations. A set of restrictions under which the claims may
not hold (usually because the corresponding settings have not
been explored).
(continued)
Reproducibility 633
Box 1 (continued)
• Method. The ML approach described in the paper, indepen-
dently of its implementation.
• Code. The implementation of the method.
• Software dependencies. Other software packages that the
main code relies on and which are necessary for its execution.
• Public data. Data that can be accessed by anybody with no or
little restriction (for instance, the data hosted at https://
openneuro.org).
• Semi-public data. Data which requires approval of a research
project (for instance, the Alzheimer’s Disease Neuroimaging
Initiative [ADNI] http://www.adni-info.org). The research-
ers can then use the data only for the intended research
purpose and cannot redistribute it.
• Data split. Separation into training, validation, and test sets.
• Data leakage. Faulty procedure which has led information
from the training set to leak into the test set. See refs. 4, 5 for
details.
• Error margins. A general term for providing the precision of
the performance estimates (e.g., standard-error or confidence
intervals).
• Researcher degrees of freedom. Number of different com-
ponents (e.g., different architectures, hyperparameter values,
subsamples. . .) which have been tried before arriving to the
final method [6]. Too many degrees of freedom tend to
produce methods that do not generalize.
• p-hacking. A bad practice that involves too many degrees of
freedom and which consists is trying many different statistical
procedures until a significant p-value is found.
• Acquisition settings. Factors that influence the scan of a
given patient (imaging device, acquisition paratemeters,
image quality).
• Image artifacts. Defects of a medical image, these may
include noise, field heterogeneity, motion artifacts, and
others.
• Preregistration. The deposit of the study protocol prior to
performing the study. Limits degrees of freedom and
increases likelihood of robust findings.
often used for the ability to exactly reproduce the results (i.e., the
exact numbers) in a given study. In sciences which aim at providing
measurements (as is often the case in medical imaging), the word
may be used to describe the variability of a given measurement tool
under different acquisition settings. We shall provide more details
on these different meanings in Subheading 2. Finally, the topics of
reproducibility and open science are obviously related since the
latter favors the former. However, open science encompasses a
broader objective which is to make all research artifacts (code,
data, papers. . .) openly available for the benefit of the whole society.
Conversely, open research may still be unreproducible (e.g.,
because it has relied on faulty statistical procedures).
There has been increasing concern that science is undergoing a
reproducibility crisis [7–10]. This is present in various fields from
psychology [11] to preclinical oncology research [12]. ML [13–
17], digital medicine [18], ML for healthcare [19, 20], and ML for
medical imaging [21] are no exception. The concerns are multifac-
eted. In particular, they include two substantially different aspects:
the report of failures to reproduce previous studies and the obser-
vation that many papers do not provide sufficient information for
reproducing their results. It is important to have in mind that, while
the two may be related, there is not a direct relationship
between them: it may very well be that a paper seems to include
all the necessary information for reproduction and that reproduc-
tion attempts fail (for instance, because the original study had too
many degrees of freedom and led to a method that only works on a
single dataset, see Subheading 4).
Various guidelines have been proposed to improve research
reproducibility. Such guidelines may be general [10] or devoted
to specific fields including brain imaging [22–25] and ML for
healthcare and life sciences [26, 27]. Moreover many other papers,
even though not strictly providing guidelines, provide very valuable
pieces of advice for making research more trustworthy and in
particular more reproducible (e.g., [14, 19, 28–32]).
This chapter is an introduction to the topic of reproducibility
for researchers in the field of ML for medical imaging. It is not
meant at providing a replacement for the aforementioned previ-
ously published guidelines that we strongly encourage the reader to
refer to.
The remainder of the chapter is organized as follows. We first
start by introducing different types of reproducibility (Subheading
2). For each of them, we attempt to clearly define it and describe
what are the requirements to achieve it and the benefits it can
provide (Subheadings 3, 4, 5, and 6). All this information is given
with having the field of ML for medical imaging as a target, even
though part of it may apply to other fields. Finally, we conclude
with a discussion which both describes the benefits of reproducibil-
ity but also advocates for a nondogmatic point of view on the topic
(Subheading 7).
Reproducibility 635
3 Exact Reproducibility
Fig. 1 Different types of reproducibility. Note that, in the case of “statistical” and “conceptual” reproducibility,
the terms come from [19] but the exact definition provided in each corresponding section may differ
Access to data is obviously necessary [19, 22, 27]. Open data has
been described (together with code and papers) as one of the pillars
of open science [22]. It is widely accepted that scientific data should
adhere to the FAIR (Findable, Accessible, Interoperable, Reusable)
principles (please refer to https://www.go-fair.org/fair-principles
Reproducibility 637
1
See glossary Box 1.
2
https://zenodo.org/.
3
https://bids.neuroimaging.io/.
4
See BIDS extension proposal (BEP) number 25 (BEP025) https://bids.neuroimaging.io/get_involved.
html#extending-the-bids-specification
638 Olivier Colliot et al.
Theoretically, it does not mean that the code must come with an
open software license. However, doing so has many additional
benefits such as allowing other researchers to use the code or
parts of it for different purposes. The code should be made accord-
ing to good coding practices which include the use of a versioning
system and adequate documentation [20]. Furthermore, although
not strictly needed for reproducibility, the use of continuous inte-
gration makes the code more robust and eases its long-term main-
tenance. Besides, it is good to ease as much as possible the
installation of dependencies [27]. This can be done with pip5
when programming in Python. One can also use containers such
as Docker.6 One can find useful additional advice in the Tips for
Publishing Research Code.7 Note that we are not saying that all
these components should be present in any study or are prerequi-
sites for good research. They constitute an ideal
reproducibility goal.
Sharing well-curated notebooks is also a way to ease reproduc-
ibility of results by other researchers. This can be done through
standard means, but dedicated servers also exist. One can, for
instance, cite an interesting initiative called NeuroLibre8 which
provides a preprint server for reproducible data analysis in neuro-
science, in particular providing curated and reviewed Jupyter
notebooks [43].
In ML, sharing the code itself is not enough for exact repro-
ducibility. First, every element of the training procedure should be
stored: this includes the data splits and the criteria for model
selection. Moreover, there usually are non-deterministic compo-
nents so it is necessary to store random seeds [27]. Furthermore,
software/operating system versions, the GPU model/version, and
threading have been deemed necessary to obtain exact reproduc-
ibility [44]. The ClinicaDL software platform provides a framework
for easing exact reproducibility of deep learning for neuroimag-
ing [5].9 Although it is targeted at brain imaging, many of its
components and concepts are applicable to medical imaging in
general. Also in the field of brain imaging, NiLearn10 facilitates
the reproducibility of statistics and ML. One can also cite Pymia11
which provides data handling and validation functionalities for deep
learning in medical imaging [45].
5
https://pypi.org/project/pip/.
6
https://www.docker.com/.
7
https://github.com/paperswithcode/releasing-research-code.
8
https://neurolibre.org/.
9
https://clinicadl.readthedocs.io/en/latest/.
10
https://nilearn.github.io/stable/index.html.
11
https://github.com/rundherum/pymia.
Reproducibility 639
Finally, it may seem obvious, but, even when the code is shared,
the underlying theory of the method, all its components, and
implementation details need to be present in the paper [14].
It is in principle possible to retrain models identically if the
above elements are provided. It nevertheless remains a good prac-
tice to share trained models, in order to allow other researchers to
check that retraining indeed led to the same results but also to save
computational resources. However, models can be attacked to
recover training data [27, 46]. This is not a problem when the
training data is public. When it is privacy-sensitive, methods to
preserve privacy exist [27, 47].
In medical imaging, preprocessing and feature extraction are
often critical steps that will subsequently influence the ML results.
It is thus necessary to also provide code for such parts. Several
software initiatives including BIDSApps [48]12 and Clinica [49]13
provide ready-to-use tools for preprocessing and feature extraction
for various brain imaging modalities. Applicable to many medical
imaging modalities, the ITK [50, 51]14 framework provides a wide
range of processing tools. It can ease the work of researchers who
do not want to spend time on preprocessing and feature extraction
pipelines and focus on the ML part of their work.
12
https://bids-apps.neuroimaging.io/.
13
https://www.clinica.run/.
14
https://itk.org/.
640 Olivier Colliot et al.
4 Statistical Reproducibility
15
We use the term of [19] although with a slightly different (more extensive) meaning.
Reproducibility 641
16
https://osf.io.
Reproducibility 643
5 Conceptual Reproducibility
17
Again, we use the term of [19] although with a slightly different meaning.
644 Olivier Colliot et al.
6 Measurement Reproducibility
18
Note that the word is used to evaluate reproducibility of automatic methods across different scans of the same
subject but also when a human rater is involved (manual or semi-automated measurements), including intra-rater
(measurement twice by the same rater) and inter-rater (two different raters) reproducibility from a single scan.
19
It is of interest to compare which of them is the most accurate or robust, with respect to a ground truth.
However, as mentioned above, we do not believe it falls within the topic of reproducibility.
646 Olivier Colliot et al.
7 Discussion
7.1 About the We have presented different types of reproducibility. Our taxonomy
Different Types of is not original nor aims at being universal. The boundaries between
Reproducibility types are partly fuzzy. For instance, to which degree replication
with a different but similar dataset should be considered statistical
or conceptual reproducibility? We do not believe such questions to
be of great importance. Rather, it is fruitful, following Gundersen
and Kjensmo [14] and Peng [76], to consider reproducibility as a
spectrum. In particular, one can consider that the first three types
provide increasing support for a finding: conceptual provides more
support than statistical which in turns provides more support than
exact. The amount of components necessary to perform them is in
the reverse order: exact requires more than statistical which requires
more than conceptual. Does it mean that only conceptual repro-
ducibility matters? Absolutely not. As we mentioned, other types of
reproducibility are necessary to dissect why a given replication has
failed as well as to better specify the bounds within which a scientific
claim is valid. Last but not least, exact reproducibility also helps
build trust in science.
We must admit that measurement reproducibility does not fit
very well in this landscape. Moreover, one could also argue that it is
a type of conceptual reproducibility, which is partly true as it aims at
studying the reproducibility when varying the input data. We nev-
ertheless believe it deserves a special treatment, for several reasons.
First, here reproducibility is studied at the individual (i.e., patient)
level and not at the population level. Also, the emphasis is on the
measurement rather than the finding. Even if it has its role in the
building of scientific knowledge, it has specific practical implica-
tions for the user. Moreover, as mentioned above, this is actually
the most widely used meaning of reproducibility in medical imag-
ing, and it seemed important that the reader is acquainted with it.
Reproducibility 647
7.2 The Many “Der Weg ist das Ziel” is a German saying which can be roughly
Benefits of translated as: “the path is the goal.” Indeed, reproducibility allows
Reproducibility researchers to discover many new places down the road before
reaching the final destination. Even if this destination is never
reached, the benefits of the travel are of major importance. Let us
try to list some of them.
There are many individual benefits for researchers and labs. An
important one is that aiming at reproducible research results in
reusable research artifacts. How agreeable it is for a researcher to
easily reuse an old code for a new project! How useful it is for a
research lab to have data organized according to community stan-
dards making it easier to reuse and share! Moreover, papers that
come with shared data [22, 77, 78] or code attract [79], on
average, more citations. Thus aiming at reproducibility is also in
the researchers’ self-interest.
There are also considerable benefits for the scientific commu-
nity as a whole. As mentioned before, reproducible research is often
associated to open code, open data, and available trained models.
This allows researchers not only to use them to perform replication
studies but also to use these research artifacts for completely differ-
ent purposes such as building new methods or conducting analysis
on pooled datasets. In the specific case of ML for medical imaging,
it also allows assessing independently the influence of preproces-
sing, feature extraction, and ML method. This is particularly impor-
tant when claims of superiority of a new ML method are made, but
the original paper uses overly specific preprocessing steps.
Of course, at the end of the path, the goal itself brings many
benefits. These have already largely described in the previous sec-
tions so we will just mention them briefly. Conceptual replication
studies are necessary for corroborating findings and thus building
new scientific knowledge. Statistical replication allows ensuring that
results are not due to cherry picking. Exact replication allows
detecting errors and increases trust in science in general.
and found that about 20% of papers came with a repository that was
deemed reproducible.
Various papers have been published providing advice and
guidelines [10, 17, 22–27]. Some of the guidelines include repro-
ducibility checklists. Some checklists are associated to a specific
journal or conference and are provided to the reviewers so that
they can take these aspects in consideration when evaluating papers.
One can cite, for example, the MICCAI (Medical Image Comput-
ing and Computer-Assisted Intervention conference) reproducibil-
ity checklist.20,21
Finally, it is very important that reproducibility studies, asses-
sing all aspects of reproducibility (exact, statistical, conceptual,
measurement), are performed, published, and widely read. Unfor-
tunately, it is still easier to publish in a high-impact journal a study
that is not reproducible but describes exciting results than a repli-
cation study. The good news is that this is starting to change.
Reproducibility challenges have been proposed in various fields
including machine learning22 and medical image computing
[80]. In the field of neuroimaging, the journal NeuroImage:
Reports publishes Open Data Replication Reports.23 the Organi-
zation for Human Brain Mapping has a replication award.24 and the
MRITogether workshop25 emphasizes reproducibility.
7.4 One Size Does We hope the reader is now convinced of the benefits of aiming
Not Fit All towards reproducible research. Does it mean that reproducibility
requirements should be the same for all studies? We strongly believe
the opposite. To take an extreme example, requiring all studies to
be exactly reproducible with minimal efforts (like with running a
single command) would be an awful idea. We believe, on the
contrary, that reproducibility efforts should vary according to
many factors including the type of study and the context in which
it is performed. One would probably not have the same level of
requirement for a methodological paper and for an extensive medi-
cal application with strong claims about clinical applicability. For
the former, one may be satisfied with an experiment on a single or a
few datasets. For the later, one would expect the study to include
20
https://miccai2021.org/files/downloads/MICCAI2021-Reproducibility-Checklist.pdf.
21
https://github.com/JunMa11/MICCAI-Reproducibility-Checklist.
22
https://paperswithcode.com/rc2022.
23
https://www.journals.elsevier.com/neuroimage-reports/infographics/neuroimage-reports-presents-open-
data-replication-reports?utm_campaign=STMJ_176479_SC&utm_medium=email&utm_acid=268008024&
SIS_ID=&dgcid=STMJ_176479_SC&CMX_ID=&utm_in=DM292849&utm_source=AC_.
24
https://www.humanbrainmapping.org/i4a/pages/index.cfm?pageid=3731.
25
https://mritogether.esmrmb.org/.
Reproducibility 649
Acknowledgements
References
1. Seab J, Jagust W, Wong S, Roos M, Reed BR, 13. Sonnenburg S, Braun ML, Ong CS, Bengio S,
Budinger T (1988) Quantitative NMR mea- Bottou L, Holmes G, LeCunn Y, Muller KR,
surements of hippocampal atrophy in Alzhei- Pereira F, Rasmussen CE et al (2007) The need
mer’s disease. Magn Reson Med 8(2):200–208 for open source software in machine learning. J
2. Lehericy S, Baulac M, Chiras J, Pierot L, Mach Learn Res 8(81):2443–2466
Martin N, Pillon B, Deweer B, Dubois B, Mar- 14. Gundersen OE, Kjensmo S (2018) State of
sault C (1994) Amygdalohippocampal MR vol- the art: reproducibility in artificial
ume measurements in the early stages of intelligence. In: Proceedings of the AAAI Con-
Alzheimer disease. Am J Neuroradiol 15(5): ference on Artificial Intelligence, vol 32
929–937 15. Hutson M (2018) Artificial intelligence faces
3. Jack CR, Petersen RC, Xu YC, Waring SC, reproducibility crisis. Science 359(6377):
O’Brien PC, Tangalos EG, Smith GE, Ivnik 725–726
RJ, Kokmen E (1997) Medial temporal atro- 16. Haibe-Kains B, Adam GA, Hosny A,
phy on MRI in normal aging and very mild Khodakarami F, Waldron L, Wang B,
alzheimer’s disease. Neurology 49(3):786–794 McIntosh C, Goldenberg A, Kundaje A,
4. Varoquaux G, Colliot O (2022) Evaluating Greene CS, et al (2020) Transparency and
machine learning models and their diagnostic reproducibility in artificial intelligence. Nature
value. HAL preprint hal-03682454. https:// 586(7829):E14–E16
hal.archives-ouvertes.fr/hal-03682454/ 17. Pineau J, Vincent-Lamarre P, Sinha K,
5. Thibeau-Sutre E, Diaz M, Hassanaly R, Larivière V, Beygelzimer A, d’Alché Buc F,
Routier A, Dormont D, Colliot O, Burgos N Fox E, Larochelle H (2021) Improving repro-
(2022) ClinicaDL: an open-source deep ducibility in machine learning research: a
learning software for reproducible neuroimag- report from the neurips 2019 reproducibility
ing processing. Comput Methods Prog program. J Mach Learn Res 22:1–20
Biomed 220:106818 18. Stupple A, Singerman D, Celi LA (2019) The
6. Simmons JP, Nelson LD, Simonsohn U (2011) reproducibility crisis in the age of digital medi-
False-positive psychology: undisclosed flexibil- cine. NPJ Digit Med 2(1):1–3
ity in data collection and analysis allows pre- 19. McDermott M, Wang S, Marinsek N,
senting anything as significant. Psychol Sci 22: Ranganath R, Ghassemi M, Foschini L (2019)
1359–1366 Reproducibility in machine learning for health.
7. Baker M (2016) 1,500 scientists lift the lid on arXiv preprint arXiv:190701463
reproducibility. Nature 533:452–454 20. Beam AL, Manrai AK, Ghassemi M (2020)
8. Gundersen OE (2020) The reproducibility cri- Challenges to the reproducibility of machine
sis is real. AI Mag 41(3):103–106 learning models in health care. JAMA 323(4):
9. Ioannidis JP (2005) Why most published 305–306
research findings are false. PLoS Med 2(8): 21. Simko A, Garpebring A, Jonsson J, Nyholm T,
e124 Löfstedt T (2022) Reproducibility of the
10. Begley CG, Ioannidis JP (2015) Reproducibil- methods in medical imaging with deep
ity in science: improving the standard for basic learning. arXiv preprint arXiv:221011146
and preclinical research. Circ Res 116(1): 22. Gorgolewski KJ, Poldrack RA (2016) A practi-
116–126 cal guide for improving transparency and
11. Collaboration OS (2015) Estimating the reproducibility in neuroimaging research.
reproducibility of psychological science. Sci- PLoS Biol 14(7):e1002506
ence 349(6251):aac4716 23. Nichols TE, Das S, Eickhoff SB, Evans AC,
12. Begley CG (2013) An unappreciated challenge Glatard T, Hanke M, Kriegeskorte N, Milham
to oncology drug discovery: pitfalls in preclini- MP, Poldrack RA, Poline JB et al (2017) Best
cal research. Am Soc Clin Oncol Educ Book practices in data analysis and sharing in neuro-
33(1):466–468 imaging using MRI. Nat Neurosci 20(3):
299–303
Reproducibility 651
24. Poldrack RA, Baker CI, Durnez J, Gorgolewski 37. Gabelica M, Bojčić R, Puljak L (2022) Many
KJ, Matthews PM, Munafò MR, Nichols TE, researchers were not compliant with their pub-
Poline JB, Vul E, Yarkoni T (2017) Scanning lished data sharing statement: mixed-methods
the horizon: towards transparent and repro- study. J Clin Epidemiol 150:33–41
ducible neuroimaging research. Nat Rev Neu- 38. Gorgolewski KJ, Auer T, Calhoun VD, Crad-
rosci 18(2):115–126 dock RC, Das S, Duff EP, et al (2016) The
25. Niso G, Botvinik-Nezer R, Appelhoff S, De La brain imaging data structure, a format for orga-
Vega A, Esteban O, Etzel JA, Finc K, Ganz M, nizing and describing outputs of neuroimaging
Gau R, Halchenko YO et al (2022) Open and experiments. Sci Data 3(1):1–9
reproducible neuroimaging: from study incep- 39. Bourget MH, Kamentsky L, Ghosh SS,
tion to publication. Neuroimage 263:119623 Mazzamuto G, Lazari A, Markiewicz CJ,
26. Turkyilmaz-van der Velden Y, Dintzner N, Oostenveld R, Niso G, Halchenko YO,
Teperek M (2020) Reproducibility starts from Lipp I, et al (2022) Microscopy-BIDS: an
you today. Patterns 1(6):100099 extension to the Brain imaging data structure
27. Heil BJ, Hoffman MM, Markowetz F, Lee SI, for microscopy data. Front Neurosci 16:
Greene CS, Hicks SC (2021) Reproducibility e871228
standards for machine learning in the life 40. Saborit-Torres J, Saenz-Gamboa J, Montell J,
sciences. Nat Methods 18(10):1132–1135 Salinas J, Gómez J, Stefan I, Caparrós M, Gar-
28. Varoquaux G (2018) Cross-validation failure: cı́a-Garcı́a F, Domenech J, Manjón J, et al
small sample sizes lead to large error bars. Neu- (2020) Medical imaging data structure
roimage 180:68–77 extended to multiple modalities and anatomi-
29. Button KS, Ioannidis J, Mokrysz C, Nosek BA, cal regions. arXiv preprint arXiv:201000434
Flint J, Robinson ES, Munafò MR (2013) 41. Cuingnet R, Gerardin E, Tessieras J, Auzias G,
Power failure: why small sample size under- Lehéricy S, Habert MO, Chupin M, Benali H,
mines the reliability of neuroscience. Nat Rev Colliot O (2011) Automatic classification of
Neurosci 14(5):365–376 patients with Alzheimer’s disease from
30. Varoquaux G, Cheplygina V (2022) Machine structural MRI: a comparison of ten methods
learning for medical imaging: methodological using the ADNI database. Neuroimage 56(2):
failures and recommendations for the future. 766–781
NPJ Digit Med 5(1):1–8 42. Samper-González J, Burgos N, Bottani S,
31. Bouthillier X, Laurent C, Vincent P (2019) Fontanella S, Lu P, Marcoux A, Routier A,
Unreproducible research is reproducible. In: Guillon J, Bacci M, Wen J, et al (2018) Repro-
International Conference on Machine ducible evaluation of classification methods in
Learning, PMLR, pp 725–734 Alzheimer’s disease: framework and application
to MRI and PET data. Neuroimage 183:504–
32. Langer SG, Shih G, Nagy P, Landman BA 521
(2018) Collaborative and reproducible
research: goals, challenges, and strategies. J 43. Karakuzu A, DuPre E, Tetrel L, Bermudez P,
Digit Imaging 31(3):275–282 Boudreau M, Chin M, Poline JB, Das S,
Bellec P, Stikov N (2022) Neurolibre: a pre-
33. Goodman SN, Fanelli D, Ioannidis JP (2016) print server for full-fledged reproducible neu-
What does research reproducibility mean? Sci roscience. OSF Preprints
Transl Med 8(341):341ps12–341ps12
44. Crane M (2018) Questionable answers in ques-
34. Plesser HE (2018) tion answering research: reproducibility and
Reproducibility vs. replicability: a brief history variability of published results. Trans Assoc
of a confused terminology. Front Neuroinfor Comput Linguist 6:241–252
11:76
45. Jungo A, Scheidegger O, Reyes M, Balsiger F
35. McDermott MB, Wang S, Marinsek N, (2021) pymia: a python package for data
Ranganath R, Foschini L, Ghassemi M (2021) handling and evaluation in deep learning-
Reproducibility in machine learning for health based medical image analysis. Comput Meth-
research: still a ways to go. Sci Transl Med ods Program Biomed 198:105796. https://
13(586):eabb1655 doi.org/10.1016/j.cmpb.2020.105796
36. Wilkinson MD, Dumontier M, Aalbersberg IJ, 46. Carlini N, Tramer F, Wallace E, Jagielski M,
Appleton G, Axton M, Baak A, Blomberg N, Herbert-Voss A, Lee K, Roberts A, Brown T,
Boiten JW, da Silva Santos LB, Bourne PE, et al Song D, Erlingsson U, et al (2021) Extracting
(2016) The FAIR guiding principles for scien- training data from large language models. In:
tific data management and stewardship. Sci 30th USENIX Security Symposium (USENIX
Data 3(1):1–9 Security 21), pp 2633–2650
652 Olivier Colliot et al.
47. Abadi M, Chu A, Goodfellow I, McMahan deep learning for fast detection of COVID-19
HB, Mironov I, Talwar K, Zhang L (2016) in X-Rays using nCOVnet. Chaos, Solitons
Deep learning with differential privacy. In: Pro- Fractals 138:109944
ceedings of the 2016 ACM SIGSAC confer- 57. Bussola N, Marcolini A, Maggio V, Jurman G,
ence on computer and communications Furlanello C (2021) AI slipping on tiles: Data
security, pp 308–318 leakage in Digital Pathology. In: Del Bimbo A,
48. Gorgolewski KJ, Alfaro-Almagro F, Auer T, Cucchiara R, Sclaroff S, Farinella GM, Mei T,
Bellec P, Capotă M, Chakravarty MM, Church- Bertini M, Escalante HJ, Vezzani R (eds) Pat-
ill NW, Cohen AL, Craddock RC, Devenyi GA, tern recognition. ICPR International Work-
et al (2017) BIDS apps: Improving ease of use, shops and Challenges, Springer International
accessibility, and reproducibility of neuroimag- Publishing, Cham. Lecture notes in computer
ing data analysis methods. PLoS Comput Biol science, pp 167–182. https://doi.org/10.100
13(3):e1005209 7/978-3-030-68763-2_13
49. Routier A, Burgos N, Dı́az M, Bacci M, 58. Head ML, Holman L, Lanfear R, Kahn AT,
Bottani S, El-Rifai O, Fontanella S, Gori P, Jennions MD (2015) The extent and conse-
Guillon J, Guyot A, et al (2021) Clinica: an quences of p-hacking in science. PLoS Biol
open-source software platform for reproduc- 13(3):e1002106
ible clinical neuroscience studies. Front Neu- 59. Henderson EL (2022) A guide to preregistra-
roinform 15:e689675 tion and registered reports. Preprint. https://
50. McCormick M, Liu X, Jomier J, Marion C, osf.io/preprints/metaarxiv/x7aqr/download
Ibanez L (2014) ITK: enabling reproducible 60. Bottani S, Burgos N, Maire A, Wild A, Ströer S,
research and open science. Front Neuroinform Dormont D, Colliot O, Group AS et al (2022)
8:13 Automatic quality control of brain
51. Yoo TS, Ackerman MJ, Lorensen WE, T1-weighted magnetic resonance images for a
Schroeder W, Chalana V, Aylward S, clinical data warehouse. Med Image Anal 75:
Metaxas D, Whitaker R (2002) Engineering 102219
and algorithm design for an image 61. Perkuhn M, Stavrinou P, Thiele F, Shakirin G,
processing API: a technical report on ITK-the Mohan M, Garmpis D, Kabbasch C, Borggrefe
insight toolkit. In: Medicine Meets Virtual J (2018) Clinical evaluation of a multipara-
Reality 02/10, IOS press, pp 586–592 metric deep learning model for glioblastoma
52. Drummond C (2009) Replicability is not segmentation using heterogeneous magnetic
reproducibility: nor is it good science. In: Pro- resonance imaging data from clinical routine.
ceedings of the Evaluation Methods for Investig Radiol 53(11):647
Machine Learning Workshop at the 26th 62. Lukas C, Hahn HK, Bellenberg B, Rexilius J,
ICML, vol 1 Schmid G, Schimrigk SK, Przuntek H,
53. Bouthillier X, Delaunay P, Bronzi M, Köster O, Peitgen HO (2004) Sensitivity and
Trofimov A, Nichyporuk B, Szeto J, Moham- reproducibility of a new fast 3D segmentation
madi Sepahvand N, Raff E, Madan K, Voleti V, technique for clinical MR-based brain volume-
et al (2021) Accounting for variance in try in multiple sclerosis. Neuroradiology
machine learning benchmarks. Proc Mach 46(11):906–915
Learn Syst 3:747–769 63. Borga M, Ahlgren A, Romu T, Widholm P,
54. Wen J, Thibeau-Sutre E, Diaz-Melo M, Sam- Dahlqvist Leinhard O, West J (2020) Repro-
per-González J, Routier A, Bottani S, ducibility and repeatability of MRI-based body
Dormont D, Durrleman S, Burgos N, Colliot composition analysis. Magn Reson Med 84(6):
O (2020) Convolutional neural networks for 3146–3156
classification of Alzheimer’s disease: overview 64. Chard DT, Parker GJ, Griffin CM, Thompson
and reproducible evaluation. Med Image Anal AJ, Miller DH (2002) The reproducibility and
63:101694 sensitivity of brain tissue volume measurements
55. Samala RK, Chan HP, Hadjiiski L, Koneru S derived from an SPM-based segmentation
(2020) Hazards of data leakage in machine methodology. J Magn Reson Imaging: Off J
learning: a study on classification of breast can- Int Soc Magn Reson Med 15(3):259–267
cer using deep neural networks. In: Proceed- 65. de Boer R, Vrooman HA, Ikram MA, Vernooij
ings of SPIE Medical Imaging 2020: MW, Breteler MM, van der Lugt A, Niessen WJ
Computer-Aided Diagnosis, International (2010) Accuracy and reproducibility study of
Society for Optics and Photonics, vol 11314, automatic MRI brain tissue segmentation
p 1131416 methods. Neuroimage 51(3):1047–1056
56. Panwar H, Gupta PK, Siddiqui MK, Morales-
Menendez R, Singh V (2020) Application of
Reproducibility 653
66. Lemieux L, Hagemann G, Krakow K, Woer- brain phantom. IEEE Trans Med Imaging
mann FG (1999) Fast, accurate, and reproduc- 17(3):463–468
ible automatic segmentation of the brain in 73. Shaw R, Sudre C, Ourselin S, Cardoso MJ
T1-weighted volume MRI data. Magn (2018) MRI K-space motion artefact augmen-
Reson Med: Off J Int Soc Magn Reson Med tation: Model robustness and task-specific
42(1):127–135 uncertainty. In: Medical Imaging with Deep
67. Tudorascu DL, Karim HT, Maronge JM, Learning – MIDL 2018
Alhilali L, Fakhran S, Aizenstein HJ, 74. Duffy BA, Zhang W, Tang H, Zhao L (2018)
Muschelli J, Crainiceanu CM (2016) Repro- Retrospective correction of motion artifact
ducibility and bias in healthy brain segmenta- affected structural MRI images using deep
tion: comparison of two popular neuroimaging learning of simulated motion. In: Medical
platforms. Front Neurosci 10:503 Imaging with Deep Learning – MIDL 2018
68. Yamashita R, Perrin T, Chakraborty J, Chou 75. Loizillon S, Bottani S, Maire A, Ströer S,
JF, Horvat N, Koszalka MA, Midya A, Dormont D, Colliot O, Burgos N (2023)
Gonen M, Allen P, Jarnagin WR et al (2020) Transfer learning from synthetic to routine
Radiomic feature reproducibility in contrast- clinical data for motion artefact detection in
enhanced CT of the pancreas is affected by brain t1-weighted MRI. In: SPIE Medical
variabilities in scan parameters and manual seg- Imaging 2023: Image Processing
mentation. Euro Radiol 30(1):195–205 76. Peng RD (2011) Reproducible research in
69. Poldrack RA, Whitaker K, Kennedy DN computational science. Science 334(6060):
(2019) Introduction to the special issue on 1226–1227
reproducibility in neuroimaging. Neuroimage 77. Piwowar HA, Day RS, Fridsma DB (2007)
218:116357 Sharing detailed research data is associated
70. Palumbo L, Bosco P, Fantacci M, Ferrari E, with increased citation rate. PLoS One 2(3):
Oliva P, Spera G, Retico A (2019) Evaluation e308
of the intra-and inter-method agreement of 78. Piwowar HA, Vision TJ (2013) Data reuse and
brain MRI segmentation software packages: a the open data citation advantage. PeerJ 1:e175
comparison between SPM12 and FreeSurfer
v6. 0. Phys Med 64:261–272 79. Vandewalle P (2012) Code sharing is asso-
ciated with research impact in image proces-
71. Laurienti PJ, Field AS, Burdette JH, Maldjian sing. Comput Sci Eng 14(4):42–47
JA, Yen YF, Moody DM (2002) Dietary caf-
feine consumption modulates fMRI measures. 80. Balsiger F, Jungo A, Chen J, Ezhov I, Liu S,
Neuroimage 17(2):751–757 Ma J, Paetzold JC, Sekuboyina A, Shit S, Suter
Y et al (2021) The miccai hackathon on repro-
72. Collins DL, Zijdenbos AP, Kollokian V, Sled ducibility, diversity, and selection of papers at
JG, Kabani NJ, Holmes CJ, Evans AC (1998) the miccai conference. arXiv preprint
Design and construction of a realistic digital arXiv:210305437
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution,
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other
third-party material in this chapter are included in the chapter’s Creative Commons license, unless indicated
otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder.
Chapter 22
Abstract
Deep learning methods have become very popular for the processing of natural images and were then
successfully adapted to the neuroimaging field. As these methods are non-transparent, interpretability
methods are needed to validate them and ensure their reliability. Indeed, it has been shown that deep
learning models may obtain high performance even when using irrelevant features, by exploiting biases in
the training set. Such undesirable situations can potentially be detected by using interpretability methods.
Recently, many methods have been proposed to interpret neural networks. However, this domain is not
mature yet. Machine learning users face two major issues when aiming to interpret their models: which
method to choose and how to assess its reliability. Here, we aim at providing answers to these questions by
presenting the most common interpretability methods and metrics developed to assess their reliability, as
well as their applications and benchmarks in the neuroimaging context. Note that this is not an exhaustive
survey: we aimed to focus on the studies which we found to be the most representative and relevant.
Key words Interpretability, Saliency, Machine learning, Deep learning, Neuroimaging, Brain
disorders
1 Introduction
1.1 Need for Many metrics have been developed to evaluate the performance of
Interpretability machine learning (ML) systems. In the case of supervised systems,
these metrics compare the output of the algorithm to a ground
truth, in order to evaluate its ability to reproduce a label given by a
physician. However, the users (patients and clinicians) may want
more information before relying on such systems. On which fea-
tures is the model relying to compute the results? Are these features
close to the way a clinician thinks? If not, why? This questioning
coming from the actors of the medical field is justified, as errors in
real life may lead to dramatic consequences. Trust into ML systems
cannot be built only based on a set of metrics evaluating the
performance of the system. Indeed, various examples of machine
learning systems taking correct decisions for the wrong reasons
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_22,
© The Author(s) 2023
655
656 Elina Thibeau-Sutre et al.
Fig. 1 Example of an interpretability method highlighting why a network took the wrong decision. The
explained classifier was trained on the binary task “Husky” vs “Wolf.” The pixels used by the model are
actually in the background and highlight the snow. (Adapted from [1]. Permission to reuse was kindly granted
by the authors)
1.2 How to Interpret According to [4], model interpretability can be broken down into
Models two categories: transparency and post hoc explanations.
A model can be considered as transparent when it (or all parts
of it) can be fully understood as such, or when the learning process
is understandable. A natural and common candidate that fits, at first
sight, these criteria is the linear regression algorithm, where coeffi-
cients are usually seen as the individual contributions of the input
features. Another candidate is the decision tree approach where
model predictions can be broken down into a series of understand-
able operations. One can reasonably consider these models as
transparent: one can easily identify the features that were used to
take the decision. However, one may need to be cautious not to
push too far the medical interpretation. Indeed, the fact that a
feature has not been used by the model does not mean that it is
not associated with the target. It just means that the model did not
need it to increase its performance. For instance, a classifier aiming
at diagnosing Alzheimer’s disease may need only a set of regions
(for instance, from the medial temporal lobe of the brain) to achieve
an optimal performance. This does not mean that other brain
regions are not affected by the disease, just that they were not
used by the model to take its decision. This is the case, for example,
for sparse models like LASSO, but also standard multiple linear
regressions. Moreover, features given as input to transparent mod-
els are often highly engineered, and choices made before the train-
ing step (preprocessing, feature selection) may also hurt the
transparency of the whole framework. Nevertheless, in spite of
these caveats, such models can reasonably be considered transpar-
ent, in particular when compared to deep neural networks which
are intrinsically black boxes.
The second category of interpretability methods, post hoc
interpretations, allows dealing with non-transparent models. Xie
et al. [5] proposed a taxonomy in three categories: visualization
methods consist in extracting an attribution map of the same size as
the input whose intensities allow knowing where the algorithm
focused its attention, distillation approaches consist in reproducing
the behavior of a black box model with a transparent one, and
intrinsic strategies include interpretability components within the
framework, which are trained along with the main task (e.g., a
classification). In the present work, we focus on this second cate-
gory of methods (post hoc) and proposed a new taxonomy includ-
ing other methods of interpretation (see Fig. 2). Post hoc
interpretability is the most used category nowadays, as it allows
interpreting deep learning methods that became the state of the art
for many tasks in neuroimaging, as in other application fields.
658
Elina Thibeau-Sutre et al.
2 Interpretability Methods
Table 1
Mathematical notations
X0 is the input tensor given to the network, and X refers to any input, sampled from the set X .
y is a vector of target classes corresponding to the input.
f is a network of L layers. The first layer is the closest to the input; the last layer is the closest to the output.
A layer is a function.
g is a transparent function which aims at reproducing the behavior of f.
w and b are the weights and the bias associated to a linear function (e.g., in a fully connected layer).
u and v are locations (set of coordinates) corresponding to a node in a feature map. They belong
respectively to the set U and V .
ðlÞ
A k ðuÞ is the value of the feature map computed by layer l, of K channels at channel k, at position u.
ðlÞ
Rk ðuÞ is the value of a property back-propagated through the l + 1, of K channels at channel k, at
position u. R(l ) and A(l ) have the same number of channels.
oc is the output node of interest (in a classification framework, it corresponds to the node of the class c).
Sc is an attribution map corresponding to the output node oc.
m is a mask of perturbations. It can be applied to X to compute its perturbed version Xm.
Φ is a function producing a perturbed version of an input X.
Γc is the function computing the attribution map Sc from the black-box function f and an input X0.
Table 2
Abbreviations
2.1 Weight At first sight, one of can be tempted to directly visualize the weights
Visualization learned by the algorithm. This method is really simple, as it does
not require further processing. However, even though it can make
sense for linear models, it is not very informative for most networks
unless they are specially designed for this interpretation.
This is the case for AlexNet [7], a convolutional neural network
(CNN) trained on natural images (ImageNet). In this network the
size of the kernels in the first layer is large enough (11 × 11) to
distinguish patterns of interest. Moreover, as the three channels in
the first layer correspond to the three color channels of the images
(red, green, and blue), the values of the kernels can also be repre-
sented in terms of colors (this is not the case for hidden layers, in
which the meaning of the channels is lost). The 96 kernels of the
first layer were illustrated in the original article as in Fig. 3.
662 Elina Thibeau-Sutre et al.
Fig. 3 96 convolutional kernels of size 3@11 × 11 learned by the first convolutional layer on the 3@224 × 224
input images by AlexNet. (Adapted from [7]. Permission to reuse was kindly granted by the authors)
Fig. 4 The weights of small kernels in hidden layers (here 5 × 5) can be really difficult to interpret alone. Here
some context allows better understanding how it modulates the interaction between concepts conveyed by the
input and the output. (Adapted from [8] (CC BY 4.0))
2.2 Feature Map Feature maps are the results of intermediate computations done
Visualization from the input and resulting in the output value. Then, it seems
natural to visualize them or link them to concepts to understand
how the input is successively transformed into the output.
Methods described in this section aim at highlighting which
concepts a feature map (or part of it) A conveys.
Interpretability 663
2.2.1 Direct The output of a convolution has the same shape as its input: a 2D
Interpretation image processed by a convolution will become another 2D image
(the size may vary). Then, it is possible to directly visualize these
feature maps and compare them to the input to understand the
operations performed by the network. However, the number of
filters of convolutional layers (often a hundred) makes the interpre-
tation difficult as a high number of images must be interpreted for a
single input.
Instead of directly visualizing the feature map A, it is possible to
study the latent space including all the values of the samples of a
data set at the level of the feature map A. Then, it is possible to
study the deformations of the input by drawing trajectories
between samples in this latent space, or more simply to look at
the distribution of some label in a manifold learned from the latent
space. In such a way, it is possible to better understand which
patterns were detected, or at which layer in the network classes
begin to be separated (in the classification case). There is often no
theoretical framework to illustrate these techniques, and then we
referred to studies in the context of the medical application (see
Subheading 3.2 for references).
2.2.2 Input Optimization Olah et al. [9] proposed to compute an input that maximizes the
value of a feature map A (see Fig. 5). However, this technique leads
to unrealistic images that may be themselves difficult to interpret,
particularly for neuroimaging data. To have a better insight of the
behavior of layers or filters, another simple technique illustrated by
the same authors consists in isolating the inputs that led to the
highest activation of A. The combination of both methods, dis-
played in Fig. 6, allows a better understanding of the concepts
conveyed by A of a GoogLeNet trained on natural images.
2.3 Back- The goal of these interpretability methods is to link the value of an
Propagation Methods output node of interest oc to the image X0 given as input to a
network. They do so by back-propagating a signal from oc to X0:
Fig. 5 Optimization of the input for different levels of feature maps. (Adapted from [9] (CC BY 4.0))
664 Elina Thibeau-Sutre et al.
Fig. 6 Interpretation of a neuron of a feature map by optimizing the input associated with a bunch of training
examples maximizing this neuron. (Adapted from [9] (CC BY 4.0))
Fig. 7 Attribution map of an image found with gradients back-propagation. (Adapted from [10]. Permission to
reuse was kindly granted by the authors)
2.3.1 Gradient Back- During network training, gradients corresponding to each layer are
Propagation computed according to the loss to update the weights. Then, we
can see these gradients as the difference needed at the layer level to
improve the final result: by adding this difference to the weights,
the probability of the true class y increases.
In the same way, the gradients can be computed at the image
level to find how the input should vary to change the value of oc (see
example on Fig. 7. This gradient computation was proposed by
[10], in which the attribution map Sc corresponding to the input
Interpretability 665
N
Sc = w kc A k : ð5Þ
k=1
This map has the same size as Ak, which might be smaller than
the input if the convolutional part performs downsampling opera-
tions (which is very often the case). Then, the map is upsampled to
the size of the input to be overlaid on the input.
Selvaraju et al. [14] proposed an extension of CAM that can be
applied to any architecture: Grad-CAM (illustrated on Fig. 8). As in
CAM, the attribution map is a linear combination of the channels of
a feature map computed by a convolutional layer. But, in this case,
the weights of each channel are computed using gradient back-
propagation
1 ∂oc
αkc = : ð6Þ
jUj u∈U ∂A k ðuÞ
Interpretability 667
Fig. 8 Grad-CAM explanations highlighting two different objects in an image. (a) the original image, (b) the
explanation based on the “dog” node, (c) the explanation based on the “cat” node. ©2017 IEEE. (Reprinted,
with permission, from [14])
2.3.2 Relevance Back- Instead of back-propagating gradients to the level of the input or of
Propagation the last convolutional layer, Bach et al. [15] proposed to back-
propagate the score obtained by a class c, which is called the
relevance. This score corresponds to oc after some post-processing
(e.g., softmax), as its value must be positive if class c was identified
in the input. At the end of the back-propagation process, the goal is
to find the relevance Ru of each feature u of the input (e.g., of each
pixel of an image) such that oc = u∈U Ru .
In their paper, Bach et al. [15] take the example of a fully
connected function defined by a matrix of weights w and a bias
668 Elina Thibeau-Sutre et al.
Fig. 9 LRP attribution maps explaining the decision of a neural network trained on MNIST. ©2017 IEEE.
(Reprinted, with permission, from [16])
The main issue of the method comes from the fact that the
denominator may become (close to) zero, leading to the explosion
of the relevance back-propagated. Moreover, it was shown by [11]
that when all activations are piece-wise linear (such as ReLU or
leaky ReLU), the layer-wise relevance (LRP) method reproduces
the output of gradient input, questioning the usefulness of the
method.
This is why Samek et al. [16] proposed two variants of the
standard LRP method [15]. Moreover they describe the behavior
of the back-propagation in other layers than the linear ones (the
convolutional one following the same formula as the linear). They
illustrated their method with a neural network trained on MNIST
(see Fig. 9). To simplify the equations in the following paragraphs,
we now denote the weighted activations as zuv = A(l )(u)wuv.
e-rule The E-rule integrates a parameter E > 0, used to avoid
numerical instability. Though it avoids the case of a null denomina-
tor, this variant breaks the rule of relevance conservation across
layers
Interpretability 669
z uv
RðlÞ ðuÞ = Rðlþ1Þ ðvÞ : ð10Þ
v∈V
z u0 v þ ε × sign z u0 v
u0 ∈U u0 ∈U
zþ -
z uv
RðlÞ ðuÞ = Rðlþ1Þ ðvÞ ð1 þ βÞ uv
-β : ð11Þ
v∈V
zþu0 v z u-0 v
u0 ∈U u0 ∈U
2.4 Perturbation Instead of relying on a backward pass (from the output to the
Methods input) as in the previous section, perturbation methods rely on
the difference between the value of oc computed with the original
inputs and a locally perturbed input. This process is less abstract for
humans than back-propagation methods as we can reproduce it
ourselves: if the part of the image that is needed to find the good
output is hidden, we are also not able to predict correctly. More-
over, it is model-agnostic and can be applied to any algorithm or
deep learning architecture.
The main drawback of these techniques is that the nature of the
perturbation is crucial, leading to different attribution maps
depending on the perturbation function used. Moreover,
Montavon et al. [18] suggest that the perturbation rule should
670 Elina Thibeau-Sutre et al.
Fig. 10 Attribution maps obtained with standard perturbation. Here the perturbation is a gray patch covering a
specific zone of the input as shown in the left column. The attribution maps (second row) display the
probability of the true label: the lower the value, the most important it is for the network to correctly identify
the label. This kind of perturbation takes the perturbed input out of the training distribution. (Reprinted by
permission from Springer Nature Customer Service Centre GmbH: Springer Nature, ECCV 2014: Visualizing
and Understanding Convolutional Networks, [19], 2014)
2.4.1 Standard Zeiler and Fergus [19] proposed the most intuitive method relying
Perturbation on perturbations. This standard perturbation procedure consists in
removing information locally in a specific zone of an input X0 and
evaluating if it modifies the output node oc. The more the pertur-
bation degrades the task performance, the more crucial this zone is
for the network to correctly perform the task. To obtain the final
attribution map, the input is perturbed according to all possible
locations. Examples of attribution maps obtained with this method
are displayed in Fig. 10.
As evaluating the impact of the perturbation at each pixel
location is computationally expensive, one can choose not to per-
turb the image at each pixel location but to skip some of them (i.e.,
scan the image with a stride > 1). This will lead to a smaller
attribution map, which needs to be upsampled to be compared to
the original input (in the same way as CAM & Grad-CAM).
Interpretability 671
2.4.2 Optimized To deal with these two issues, Fong and Vedaldi [2] proposed to
Perturbation optimize a perturbation mask covering the whole input. This per-
turbation mask m has the same size as the input X0. Its application
is associated with a perturbation function Φ and leads to the com-
putation of the perturbed input X m 0 . Its value at a coordinate
u reflects the quantity of information remaining in the perturbed
image:
• If m(u) = 1, the pixel at location u is not perturbed and has the
same value in the perturbed input as in the original input
0 ðuÞ = X 0 ðuÞ).
(X m
• If m(u) = 0, the pixel at location u is fully perturbed and the
value in the perturbed image is the one given by the perturba-
0 ðuÞ = ΦðX 0 ÞðuÞ).
tion function only (X m
This principle can be extended to any value between 0 and
1 with the a linear interpolation
Fig. 11 In this example, the network learned to classify objects in natural images. Instead of masking the
maypole at the center of the image, it creates artifacts in the sky to degrade the performance of the network.
©2017 IEEE. (Reprinted, with permission, from [2])
2.5.1 Local LIME Ribeiro et al. [1] proposed local interpretable model-
Approximation agnostic explanations (LIME). This approach is:
• Local, as the explanation is valid in the vicinity of a specific input
X0
• Interpretable, as an interpretable model g (linear model, deci-
sion tree. . .) is computed to reproduce the behavior of f on X0
• Model-agnostic, as it does not depend on the algorithm trained
This last property comes from the fact that the vicinity of X0 is
explored by sampling variations of X0 that are perturbed versions of
X0. Then LIME shares the advantage (model-agnostic) and draw-
back (perturbation function dependent) of perturbation methods
presented in Subheading 2.4. Moreover, the authors specify that, in
the case of images, they group pixels of the input in d super-pixels
(contiguous patches of similar pixels).
Fig. 12 Visualization of a soft decision tree trained on MNIST. (Adapted from [21]. Permission to reuse was
kindly granted by the authors)
f 1 ðX m
0 Þ - f ðX 0 Þ ≥ f ðX 0 Þ - f ðX 0 Þ
1 m∖i 2 m 2 m∖i
then for all m ∈{0, 1}N ϕ1i ≥ ϕ2i (ϕk are the coefficients associated
with model f k).
Lundberg and Lee [20] show that only one expression is possi-
ble for the coefficients ϕ, which can be approximated with different
algorithms:
jmj!ðN - jmj - 1Þ!
ϕi = f ðX m
0 Þ - f ðX 0 Þ :
m∖i
ð17Þ
N!
m∈f0, 1g N
2.6.1 Attention Modules Attention is a concept in machine learning that consists in produc-
ing an attribution map from a feature map and using it to improve
learning of another task (such as classification, regression,
reconstruction. . .) by making the algorithm focus on the part of
the feature map highlighted by the attribution map.
In the deep learning domain, we take as reference [22], in
which a network is trained to produce a descriptive caption of
natural images. This network is composed of three parts:
1. A convolutional encoder that reduces the dimension of the
input image to the size of the feature maps A
Fig. 13 Examples of images correctly captioned by the network. The focus of the attribution map is highlighted
in white and the associated word in the caption is underlined. (Adapted from [22]. Permission to reuse was
kindly granted by the authors)
676 Elina Thibeau-Sutre et al.
2.6.2 Modular Contrary to the studies of the previous sections, the frameworks of
Transparency these categories are composed of several networks (modules) that
interact with each other. Each module is a black box, but the
transparency of the function, or the nature of the interaction
Fig. 14 Framework with modular transparency browsing an image to compute the output at the global scale.
(Adapted from [24]. Permission to reuse was kindly granted by the authors)
Interpretability 677
2.7 Interpretability To evaluate the reliability of the methods presented in the previous
Metrics sections, one cannot only rely on qualitative evaluation. This is why
interpretability metrics that evaluate attribution maps were pro-
posed. These metrics may evaluate different properties of attribu-
tion maps.
• Fidelity evaluates if the zones highlighted by the map influence
the decision of the network.
• Sensitivity evaluates how the attribution map changes accord-
ing to small changes in the input X0.
678 Elina Thibeau-Sutre et al.
2
INFDðΓ, f , X 0 Þ = m m ij Γðf , X 0 Þij - ðf ðX 0 Þ - f ðX m
0 ÞÞ : ð18Þ
i j
2.7.2 Sensitivity Yeh et al. [25] also gave a measure of sensitivity. As suggested by the
definition, it relies on the construction of attribution maps accord-
ing to inputs similar to X0: X~0 . As changes are small, sensitivity
depends on a scalar E set by the user, which corresponds to the
maximum difference allowed between X0 and X~0 . Then sensitivity
corresponds to the following formula:
SENSmax ðΓ, f , X 0 , εÞ =
max kΓðf , X 0 Þ - Γðf , X 0 Þk: ð19Þ
X0 -X0 ≤ε
Table 3
Summary of the studies applying interpretability methods to neuroimaging data which are presented
in Subheading 3
Interpretability
Study Data set Modality Task method Section
Ball et al. [33] PING T1w Age prediction Weight visualization, 3.1, 3.5
SHAP
Böhle et al. [34] ADNI T1w AD classification LRP, Guided back- 3.3
propagation
Cecotti and in-house EEG P300 signals detection Weight visualization 3.1
Gr€aser [26]
Eitel and Ritter [37] ADNI T1w AD classification Gradient 3.3, 3.4
Input, Guided
back-propagation,
LRP, Perturbation
(continued)
680 Elina Thibeau-Sutre et al.
Table 3
(continued)
Interpretability
Study Data set Modality Task method Section
Eitel et al. [38] ADNI, T1w Multiple Sclerosis Gradient Input, 3.3
in-house detection LRP
Leming et al. [31] OpenFMRI, fMRI Autism classification Sex FM visualization, 3.2, 3.3
ADNI, classification Task vs Grad-CAM
ABIDE, rest classification
ABIDE II,
ABCD,
NDAR
ICBM, UK
Biobank,
1000FC
(continued)
Interpretability 681
Table 3
(continued)
Interpretability
Study Data set Modality Task method Section
Rieke et al. [48] ADNI T1w AD classification Standard back- 3.3, 3.4
propagation,
Guided back-
propagation,
Perturbation,
Brain area
occlusion
Tang et al. [49] UCD-ADC, Histology Detection of amyloid-β Guided back- 3.3, 3.4
Brain Bank pathology propagation,
Perturbation
Fig. 15 Relative importance of the electrodes for signal detection in EEG using
two different architectures (CNN-1 and CNN-3) and two subjects (A and B) using
CNN weight visualization. Dark values correspond to weights with a high
absolute value while white values correspond to weights close to 0. ©2011
IEEE. (Reprinted, with permission, from [26])
3.2 Feature Map Contrary to the limited application of weight visualization, there is
Visualization Applied an extensive literature about leveraging individual feature maps and
to Neuroimaging latent spaces to better understand how models work. This goes
from the visualization of these maps or their projections [27–29],
to the analysis of neuron behavior [30, 31], through sampling in
latent spaces [29].
Oh et al. [27] displayed the feature maps associated with the
convolutional layers of CNNs trained for various Alzheimer’s dis-
ease status classification tasks (Fig. 16). In the first two layers, the
extracted features were similar to white matter, cerebrospinal fluid,
and skull segmentations, while the last layer showcased sparse,
global, and nearly binary patterns. They used this example to
emphasize the advantage of using CNNs to extract very abstract
and complex features rather than using custom algorithms for
feature extraction [27].
Another way to visualize a feature map is to project it in a two-
or three-dimensional space to understand how it is positioned with
respect to other feature maps. Abrol et al. [28] projected the
features obtained after the first dense layer of a ResNet architecture
onto a two-dimensional space using the classical t-distributed sto-
chastic neighbor embedding (t-SNE) dimensionality reduction
technique. For the classification task of Alzheimer’s disease
Interpretability 683
Fig. 16 Representation of a selection of feature maps (outputs of 4 filters on 10 for each layer) obtained for a
single individual. (Adapted from [27] (CC BY 4.0))
Fig. 17 Difference in neuroimaging space between groups defined thanks to t-SNE projection. Voxels showing
significant differences post false discovery rate (FDR) correction ( p < 0.05) are highlighted. (Reprinted from
Journal of Neuroscience Methods, 339, [28], 2020, with permission from Elsevier)
3.3 Back- Back-propagation methods are the most popular methods to inter-
Propagation Methods pret models, and a wide range of these algorithms have been used
Applied to to study brain disorders: standard and guided back-propagation
Neuroimaging [27, 34, 37, 41, 48], gradient input [36–38], Grad-CAM
[35, 36], guided Grad-CAM [49], LRP [34, 36–38], DeconvNet
[36], and deep Taylor Decomposition [36].
Fig. 18 Distribution of discriminant regions obtained with gradient back-propagation in the classification of
demented patients and cognitively normal participants (top part, AD vs CN) and the classification of stable and
progressive mild cognitive impairment (bottom part, sMCI vs pMCI). (Adapted from [27] (CC BY 4.0))
686 Elina Thibeau-Sutre et al.
3.3.2 Comparison of Papers described in this section used several interpretability meth-
Several Interpretability ods and compared them in their particular context. However, as the
Methods benchmark of interpretability methods is the focus of Subheading
4.3, which also include other types of interpretability than back-
propagation, we will only focus here on what conclusions were
drawn from the attribution maps.
Dyrba et al. [36] compared DeconvNet, guided back-
propagation, deep Taylor decomposition, gradient input, LRP
(with various rules), and Grad-CAM methods for classification of
Alzheimer’s disease, mild cognitive impairment, and normal cogni-
tion statuses. In accordance with the literature, they obtained a
highest attention given to the hippocampus for both prodromal
and demented patients.
Böhle et al. [34] compared two methods, LRP with β-rule and
guided back-propagation for Alzheimer’s disease status classifica-
tion. They found that LRP attribution maps highlight the individ-
ual differences between patients and then that they could be used as
a tool for clinical guidance.
Interpretability 687
3.4 Perturbation The standard perturbation method has been widely used in the
Methods Applied to study of Alzheimer’s disease [32, 37, 45, 48] and related symptoms
Neuroimaging (amyloid-β pathology) [49]. However, most of the time, authors
do not train their model with perturbed images. Hence, to generate
explanation maps, the perturbation method uses images outside the
distribution of the training set, which may call into question the
relevance of the predictions and thus the reliability of
attention maps.
3.4.1 Variants of the Several variations of the perturbation method have been developed
Perturbation Method to adapt to neuroimaging data. The most common variation in
Tailored to Neuroimaging brain imaging is the brain area perturbation method, which consists
in perturbing entire brain regions according to a given brain atlas,
as done in [27, 28, 48]. In their study of Alzheimer’s disease, Abrol
et al. [28] obtained high values in their attribution maps for the
usually discriminant brain regions, such as the hippocampus, the
amygdala, the inferior and superior temporal gyruses, and the
fusiform gyrus. Rieke et al. [48] also obtained results in accordance
with the medical literature and noted that the brain area perturba-
tion method led to a less scattered attribution map than the stan-
dard method (Fig. 19). Oh et al. [27] used the method to compare
the attribution maps of two different tasks: (1) demented patients
vs cognitively normal participants and (2) stable vs progressive mild
cognitively impaired patients and noted that the regions targeted
Fig. 19 Mean attribution maps obtained on demented patients. The first row corresponds to the standard and
the second one to the brain area perturbation method. (Reprinted by permission from Springer Nature
Customer Service Centre GmbH: Springer Nature, MLCN 2018, DLF 2018, IMIMIC 2018: Understanding and
Interpreting Machine Learning in Medical Image Computing Applications, [48], 2018)
688 Elina Thibeau-Sutre et al.
for the first task were shared with the second one (medial temporal
lobe) but that some regions were specific to the second task (parts
of the parietal lobe).
Gutiérrez-Becker and Wachinger [40] adapted the standard
perturbation method to a network that classified clouds of points
extracted from neuroanatomical shapes of brain regions (e.g., left
hippocampus) between different states of Alzheimer’s disease. For
the perturbation step, the authors set to 0 the coordinates of a
given point x and the ones of its neighbors to then assess the
relevance of the point x. This method allows easily generating and
visualizing a 3D attribution map of the shapes under study.
3.4.2 Advanced More advanced perturbation-based methods have also been used in
Perturbation Methods the literature. Nigri et al. [45] compared a classical perturbation
method to a swap test. The swap test replaces the classical pertur-
bation step by a swapping step where patches are exchanged
between the input brain image and a reference image chosen
according to the model prediction. This exchange is possible as
brain images were registered and thus brain regions are positioned
in roughly the same location in each image.
Finally, Thibeau-Sutre et al. [51] used the optimized version of
the perturbation method to assess the robustness of CNNs in
identifying regions of interest for Alzheimer’s disease detection.
They applied optimized perturbations on gray matter maps
extracted from T1w MR images, and the perturbation method
consisted in increasing the value of the voxels to transform patients
into controls. This process aimed at stimulating gray matter recon-
struction to identify the most important regions that needed to be
“de-atrophied” to be considered again as normal. However, they
unveiled a lack of robustness of the CNN: different retrainings led
to different attribution maps (shown in Fig. 20) even though the
performance did not change.
3.5 Distillation Distillation methods are less commonly used, but some very inter-
Methods Applied to esting use cases can be found in the literature on brain disorders,
Neuroimaging with methods such as LIME [44] or SHAP [33].
Fig. 20 Coronal view of the mean attribution masks on demented patients obtained for five reruns of the same
network with the optimized perturbation method. (Adapted with permission from Medical Imaging 2020:
Image Processing, [51].)
Interpretability 689
Fig. 21 Mean absolute feature importance (SHAP values) averaged across all subjects for XGBoost on regional
thicknesses (red) and areas (green). (Adapted from [33] (CC BY 4.0))
3.6 Intrinsic Methods Attention modules have been increasingly used in the past couple of
Applied to years, as they often allow a boost in performance while being rather
Neuroimaging easy to implement and interpret. To diagnose various brain diseases
from brain CT images, Fu et al. [39] built a model integrating a
3.6.1 Attention Modules “two-step attention” mechanism that selects both the most
690 Elina Thibeau-Sutre et al.
Fig. 22 Attribution maps (left, in-house database; right, ADNI database) generated by an attention mechanism
module, indicating the discriminant power of various brain regions for Alzheimer’s disease diagnosis. (Adapted
from [42] (CC BY 4.0))
important slices and the most important pixels in each slice. The
authors then leveraged these attention modules to retrieve the five
most suspicious slices and highlight the areas with the more signifi-
cant attention.
In their study of Alzheimer’s disease, Jin et al. [42] used a 3D
attention module to capture the most discriminant brain regions
used for Alzheimer’s disease diagnosis. As shown in Fig. 22, they
obtained significant correlations between attention patterns for two
independent databases. They also obtained significant correlations
between regional attention scores of two different databases, which
indicated a strong reproducibility of the results.
3.6.2 Modular Modular transparency has often been used in brain imaging analy-
Transparency sis. A possible practice consists in first generating a target probabil-
ity map of a black box model, before feeding this map to a classifier
to generate a final prediction, as done in [43, 46].
Qiu et al. [46] used a convolutional network to generate an
attribution map from patches of the brain, highlighting brain
regions associated with Alzheimer’s disease diagnosis (see Fig. 23).
Lee et al. [43] first parcellated gray matter density maps into
93 regions. For each of these regions, several deep neural networks
were trained on randomly selected voxels, and their outputs were
averaged to obtain a mean regional disease probability. Then, by
concatenating these regional probabilities, they generated a region-
wise disease probability map of the brain, which was further used to
perform Alzheimer’s disease detection.
The approach of Ba et al. [24] was also applied to Alzheimer’s
disease detection [50] (preprint). Though that work is still a pre-
print, the idea is interesting as it aims at reproducing the way a
radiologist looks at an MR image. The main difference with [24] is
Interpretability 691
Fig. 23 Randomly selected samples of T1-weighted full MRI volumes are used as input to learn the
Alzheimer’s disease status at the individual level (Step 1). The application of the model to whole images
leads to the generation of participant-specific disease probability maps of the brain (Step 2). (Adapted from
Brain: A Journal of Neurology, 143, [46], 2020, with permission of Oxford University Press)
the initialization, as the context network does not take as input the
whole image but clinical data of the participant. Then the frame-
work browses the image in the same way as in the original paper: a
patch is processed by a recurrent neural network and from its
internal state the glimpse network learns which patch should be
looked at next. After a fixed number of iterations, the internal state
of the recurrent neural network is processed by a classification
network that gives the final outcome. The whole system is inter-
pretable as the trajectory of the locations (illustrated in Fig. 24)
processed by the framework allows understanding which regions
are more important for the diagnosis. However, this framework
may have a high dependency to clinical data: as the initialization
depends on scores used to diagnose Alzheimer’s disease, the classi-
fication network may learn to classify based on the initialization
only, and most of the trajectory may be negligible to assess the
correct label.
Another framework, the DaniNet, proposed by Ravi et al. [47],
is composed of multiple networks, each with a defined function, as
illustrated in Fig. 25.
• The conditional deep autoencoder (in orange) learns to reduce
the size of the slice x to a latent variable Z (encoder part) and
692 Elina Thibeau-Sutre et al.
Fig. 24 Trajectory taken by the framework for a participant from the ADNI test set. A bounding box around the
first location attended to is included to indicate the approximate size of the glimpse that the recurrent neural
network receives; this is the same for all subsequent locations. (Adapted from [50]. Permission to reuse was
kindly granted by the authors)
Fig. 25 Pipeline used for training the proposed DaniNet framework that aims to learn a longitudinal model of
the progression of Alzheimer’s disease. (Adapted from [47] (CC BY 4.0))
3.7 Benchmarks This section describes studies that compared several interpretability
Conducted in the methods. We separated evaluations based on metrics from those
Literature which are purely qualitative. Indeed, even if the interpretability
metrics are not mature yet, it is essential to try to measure quanti-
tatively the difference between methods rather than to only rely on
human perception, which may be biased.
3.7.1 Quantitative Eitel and Ritter [37] tested the robustness of four methods: stan-
Evaluations dard perturbation, gradient input, guided back-propagation, and
LRP. To evaluate these methods, the authors trained ten times the
same model with a random initialization and generated attribution
maps for each of the ten runs. For each method, they exhibited
significant differences between the averaged true positives/nega-
tives attribution maps of the ten runs. To quantify this variance,
they computed the ℓ2-norm between the attribution maps and
determined for each model the brain regions with the highest
attribution. They concluded that LRP and guided back-
propagation were the most consistent methods, both in terms of
distance between attribution maps and most relevant brain regions.
However, this study makes a strong assumption: to draw these
conclusions, the network should provide stable interpretations
across retrainings. Unfortunately, Thibeau-Sutre et al. [51] showed
that the study of the robustness of the interpretability method and
of the network should be done separately, as their network retrain-
ing was not robust. Indeed, they first showed that the interpretabil-
ity method they chose (optimized perturbation) was robust
according to different criteria, and then they observed that network
retraining led to different attribution maps. The robustness of an
interpretability method thus cannot be assessed from the protocol
described in [37]. Moreover, the fact that guided back-propagation
694 Elina Thibeau-Sutre et al.
is one of the most stable method meets the results of [6], who
observed that guided back-propagation always gave the same result
independently from the weights learned by a network (see
Subheading 4.1).
Böhle et al. [34] measured the benefit of LRP with β-rule
compared to guided back-propagation by comparing the intensities
of the mean attribution map of demented patients and the one of
cognitively normal controls. They concluded that LRP allowed a
stronger distinction between these two classes than guided back-
propagation, as there was a greater difference between the mean
maps for LRP. Moreover, they found a stronger correlation
between the intensities of the LRP attribution map in the hippo-
campus and the hippocampal volume than for guided back-
propagation. But as [6] demonstrated that guided back-
propagation has serious flaws, it does not allow drawing strong
conclusions.
Nigri et al. [45] compared the standard perturbation method
to a swap test (see Subheading 3.4) using two properties: the
continuity and the sensitivity. The continuity property is verified if
two similar input images have similar explanations. The sensitivity
property affirms that the most salient areas in an explanation map
should have the greater impact in the prediction when removed.
The authors carried out experiments with several types of models,
and both properties were consistently verified for the swap test,
while the standard perturbation method showed a significant
absence of continuity and no conclusive fidelity values [45].
Finally, Rieke et al. [48] compared four visualization methods:
standard back-propagation, guided back-propagation, standard
perturbation, and brain area perturbation. They computed the
Euclidean distance between the mean attribution maps of the
same class for two different methods and observed that both gradi-
ent methods were close, whereas brain area perturbation was dif-
ferent from all others. They concluded that as interpretability
methods lead to different attribution maps, one should compare
the results of available methods and not trust only one
attribution map.
3.7.3 Conclusions from The most extensively compared method is LRP, and each time it has
the Benchmarks been shown to be the best method compared to others. However,
its equivalence with gradient input for networks using ReLU
activations still questions the usefulness of the method, as gra-
dient input is much easier to implement. Moreover, the studies
reaching this conclusion are not very insightful: [37] may suffer
from methodological biases; [34] compared LRP only to guided
back-propagation, which was shown to be irrelevant [6]; and [36]
only performed a qualitative assessment.
As proposed in conclusion by Rieke et al. [48], a good way to
assess the quality of interpretability methods could be to produce
some form of ground truth for the attribution maps, for example,
by implementing simulation models that control for the level of
separability or location of differences.
4.1 Limitations of the It is not often clear whether the interpretability methods really
Methods highlight features relevant to the algorithm they interpret. This
way, Adebayo et al. [6] showed that the attribution maps produced
by some interpretability methods (guided back-propagation and
guided Grad-CAM) may not be correlated at all with the weights
learned by the network during its training procedure. They prove it
with a simple test called “cascading randomization.” In this test,
the weights of a network trained on natural images are randomized
layer per layer, until the network is fully randomized. At each step,
696 Elina Thibeau-Sutre et al.
4.2 Methodological Using interpretability methods is more and more common in med-
Advice ical research. Even though this field is not yet mature and the
methods have limitations, we believe that using an interpretability
method is usually a good thing because it may spot cases where the
model took decisions from irrelevant features. However, there are
methodological pitfalls to avoid and good practices to adopt to
make a fair and sound analysis of your results.
You should first clearly state in your paper which interpretabil-
ity method you use as there exist several variants for most of the
methods (see Subheading 2), and its parameters should be clearly
specified. Implementation details may also be important: for the
Grad-CAM method, attribution maps can be computed at various
levels in the network; for a perturbation method, the size and the
nature of the perturbation greatly influence the result. The data on
which methods are applied should also be made explicit: for a
classification task, results may be completely different if samples
are true positives or true negatives, or if they are taken from the
train or test sets.
Taking a step back from the interpretability method and espe-
cially attribution maps is fundamental as they present several limita-
tions [34]. First, there is no ground truth for such maps, which are
usually visually assessed by authors. Comparing obtained results
Interpretability 697
5 Conclusion
Acknowledgements
The research leading to these results has received funding from the
French government under management of Agence Nationale de la
Recherche as part of the “Investissements d’avenir” program, ref-
erence ANR-19-P3IA-0001 (PRAIRIE 3IA Institute) and refer-
ence ANR-10-IAIHU-06 (Agence Nationale de la Recherche-10-
IA Institut Hospitalo-Universitaire-6).
Appendices
A Short Reminder on During the training phase, a neural network updates its weights to
Network Training make a series of inputs match with their corresponding target
Procedure labels:
1. Forward pass The network processes the input image to com-
pute the output value.
700 Elina Thibeau-Sutre et al.
B Description of the This appendix aims at shortly presenting the diseases considered by
Main Brain Disorders the studies reviewed in Subheading 3.
Mentioned in the The majority of the studies focused on the classification of
Reviewed Studies Alzheimer’s disease (AD), a neurodegenerative disease of the
elderly. Its pathological hallmarks are senile plaques formed by
amyloid-β protein and neurofibrillary tangles that are tau protein
aggregates. Both can be measured in vivo using either PET imaging
or CSF biomarkers. Several other biomarkers of the disease exist. In
particular, atrophy of gray and white matter measured from T1w
MRI is often used, even though it is not specific to AD. There is
strong and early atrophy in the hippocampi that can be linked to the
memory loss, even though other clinical signs are found and other
brain areas are altered. The following diagnosis statuses are
often used:
• AD refers to demented patients.
• CN refers to cognitively normal participants.
• MCI refers to patients in with mild cognitive impairment (they
have an objective cognitive decline, but it is not sufficient yet to
cause a loss of autonomy).
• Stable MCI refers to MCI patients who stayed stable during a
defined period (often three years).
• Progressive MCI refers to MCI patients who progressed to
Alzheimer’s disease during a defined period (often three years).
Most of the studies analyzed T1w MRI data, except [49] where the
patterns of amyloid-β in the brain are studied.
Frontotemporal dementia is another neurodegenerative disease
in which the neuronal loss dominates in the frontal and temporal
lobes. Behavior and language are the most affected cognitive
functions.
Interpretability 701
References
1. Ribeiro MT, Singh S, Guestrin C (2016) Why 2. Fong RC, Vedaldi A (2017) Interpretable
should I trust you?: Explaining the predictions explanations of black boxes by meaningful
of any Classifier. In: Proceedings of the 22nd perturbation. In: 2017 IEEE international
ACM SIGKDD international conference on conference on computer vision (ICCV), pp
knowledge discovery and data mining – KDD 3449–3457. https://doi.org/10.1109/
’16, ACM Press, San Francisco, pp ICCV.2017.371
1135–1144. https://doi.org/10.1145/293 3. DeGrave AJ, Janizek JD, Lee SI (2021) AI for
9672.2939778 radiographic COVID-19 detection selects
702 Elina Thibeau-Sutre et al.
shortcuts over signal. Nat Mach Intell 3(7): One 10(7):e0130140. https://doi.org/10.
610–619. https://doi.org/10.1038/s42256- 1371/journal.pone.0130140
021-00338-7 16. Samek W, Binder A, Montavon G,
4. Lipton ZC (2018) The mythos of model inter- Lapuschkin S, Müller KR (2017) Evaluating
pretability. Commun ACM 61(10):36–43. the visualization of what a deep neural network
https://doi.org/10.1145/3233231 has learned. IEEE Trans Neural Netw Learn
5. Xie N, Ras G, van Gerven M, Doran D (2020) Syst 28(11):2660–2673. https://doi.org/10.
Explainable deep learning: a field guide for the 1109/TNNLS.2016.2599820
uninitiated. arXiv:200414545 [cs, stat] 17. Montavon G, Lapuschkin S, Binder A,
2004.14545 Samek W, Müller KR (2017) Explaining non-
6. Adebayo J, Gilmer J, Muelly M, Goodfellow I, linear classification decisions with deep Taylor
Hardt M, Kim B (2018) Sanity checks for decomposition. Pattern Recogn 65:211–222.
saliency maps. In: Advances in Neural Informa- https://doi.org/10.1016/j.patcog.201
tion Processing Systems, pp 9505–9515 6.11.008
7. Krizhevsky A, Sutskever I, Hinton GE (2012) 18. Montavon G, Samek W, Müller KR (2018)
ImageNet classification with deep convolu- Methods for interpreting and understanding
tional neural networks. In: Pereira F, Burges deep neural networks. Digit Signal Process
CJC, Bottou L, Weinberger KQ (eds) 73:1–15. https://doi.org/10.1016/j.dsp.201
Advances in Neural information processing sys- 7.10.011
tems, vol 25. Curran Associates, pp 19. Zeiler MD, Fergus R (2014) Visualizing and
1097–1105 understanding convolutional networks. In:
8. Voss C, Cammarata N, Goh G, Petrov M, Fleet D, Pajdla T, Schiele B, Tuytelaars T
Schubert L, Egan B, Lim SK, Olah C (2021) (eds) Computer vision – ECCV 2014. Lecture
Visualizing weights. Distill 6(2):e00024.007. notes in computer science. Springer, Berlin, pp
https://doi.org/10.23915/distill.00024.007 818–833
9. Olah C, Mordvintsev A, Schubert L (2017) 20. Lundberg SM, Lee SI (2017) A unified
Feature visualization. Distill 2(11):e7. approach to interpreting model
https://doi.org/10.23915/distill.00007 predictions. In: Proceedings of the 31st inter-
10. Simonyan K, Vedaldi A, Zisserman A (2013) national conference on neural information pro-
Deep inside convolutional networks: visualis- cessing systems, NIPS’17. Curran Associates,
ing image classification models and saliency Red Hook, pp 4768–4777
maps. arXiv:13126034 [cs] 1312.6034 21. Frosst N, Hinton G (2017) Distilling a Neural
11. Shrikumar A, Greenside P, Shcherbina A, Kun- network into a soft decision tree.
daje A (2017) Not just a black box: learning arXiv:171109784 [cs, stat] 1711.09784
important features through propagating activa- 22. Xu K, Ba J, Kiros R, Cho K, Courville A,
tion differences. arXiv:160501713 Salakhutdinov R, Zemel R, Bengio Y (2016)
[cs] 1605.01713 Show, attend and tell: neural image caption
12. Springenberg JT, Dosovitskiy A, Brox T, Ried- generation with visual attention.
miller M (2014) Striving for Simplicity: the all arXiv:150203044 [cs] 1502.03044
convolutional net. arXiv:14126806 23. Wang F, Jiang M, Qian C, Yang S, Li C,
[cs] 1412.6806 Zhang H, Wang X, Tang X (2017) Residual
13. Zhou B, Khosla A, Lapedriza A, Oliva A, Tor- attention network for image classification. In:
ralba A (2015) Learning deep features for dis- 2017 IEEE conference on computer vision and
criminative localization. arXiv:151204150 pattern recognition (CVPR). IEEE, Honolulu,
[cs] 1512.04150 pp 6450–6458. https://doi.org/10.1109/
CVPR.2017.683
14. Selvaraju RR, Cogswell M, Das A, Vedantam R,
Parikh D, Batra D (2017) Grad-CAM: visual 24. Ba J, Mnih V, Kavukcuoglu K (2015) Multiple
explanations from deep networks via gradient- object recognition with visual attention.
based localization. In: 2017 IEEE international arXiv:14127755 [cs] 1412.7755
conference on computer vision (ICCV), pp 25. Yeh CK, Hsieh CY, Suggala A, Inouye DI,
618–626. https://doi.org/10.1109/ICCV. Ravikumar PK (2019) On the (In)fidelity and
2017.74 sensitivity of explanations. In: Wallach H,
15. Bach S, Binder A, Montavon G, Klauschen F, Larochelle H, Beygelzimer A, d∖textquotesin-
Müller KR, Samek W (2015) On pixel-wise gle Alché-Buc F, Fox E, Garnett R (eds)
explanations for non-linear classifier decisions Advances in neural information processing sys-
by layer-wise relevance propagation. PLOS tems, vol 32. Curran Associates, pp
10967–10978
Interpretability 703
26. Cecotti H, Gr€aser A (2011) Convolutional classification. Front Aging Neurosci 10(JUL).
neural networks for P300 detection with appli- https://doi.org/10.3389/fnagi.2019.00194
cation to brain-computer interfaces. IEEE 35. Burduja M, Ionescu RT, Verga N (2020) Accu-
Trans on Pattern Anal Mach Intell 33(3): rate and efficient intracranial hemorrhage
433–445. https://doi.org/10.1109/TPAMI. detection and subtype classification in 3D CT
2010.125 scans with convolutional and long short-term
27. Oh K, Chung YC, Kim KW, Kim WS, Oh IS memory neural networks. Sensors 20(19):
(2019) Classification and visualization of Alz- 5611. https://doi.org/10.3390/s20195611
heimer’s disease using volumetric convolu- 36. Dyrba M, Pallath AH, Marzban EN (2020)
tional neural network and transfer learning. Comparison of CNN visualization methods to
Sci Rep 9(1):1–16. https://doi.org/10.103 aid model interpretability for detecting Alzhei-
8/s41598-019-54548-6 mer’s disease. In: Tolxdorff T, Deserno TM,
28. Abrol A, Bhattarai M, Fedorov A, Du Y, Plis S, Handels H, Maier A, Maier-Hein KH, Palm C
Calhoun V (2020) Deep residual learning for (eds) Bildverarbeitung für die Medizin 2020,
neuroimaging: an application to predict pro- Springer Fachmedien, Wiesbaden, Informatik
gression to Alzheimer’s disease. J Neurosci aktuell, pp 307–312. https://doi.org/10.100
M e t h o d s 3 3 9 : 1 0 8 7 0 1 . h t t p s : // d o i . 7/978-3-658-29267-6_68
org/10.1016/j.jneumeth.2020.108701 37. Eitel F, Ritter K (2019) Testing the robustness
29. Biffi C, Cerrolaza J, Tarroni G, Bai W, De of attribution methods for convolutional neu-
Marvao A, Oktay O, Ledig C, Le Folgoc L, ral networks in MRI-based Alzheimer’s disease
Kamnitsas K, Doumou G, Duan J, Prasad S, classification. In: Interpretability of machine
Cook S, O’Regan D, Rueckert D (2020) intelligence in medical image computing and
Explainable anatomical shape analysis through multimodal learning for clinical decision sup-
deep Hierarchical generative models. IEEE port. Lecture notes in computer science.
Trans Med Imaging 39(6):2088–2099. Springer, Cham, pp 3–11. https://doi.
h t t p s : // d o i . o r g / 1 0 . 1 1 0 9 / T M I . 2 0 2 0 . org/10.1007/978-3-030-33850-3_1
2964499 38. Eitel F, Soehler E, Bellmann-Strobl J, Brandt
30. Martinez-Murcia FJ, Ortiz A, Gorriz JM, AU, Ruprecht K, Giess RM, Kuchling J,
Ramirez J, Castillo-Barnes D (2020) Studying Asseyer S, Weygandt M, Haynes JD,
the manifold structure of Alzheimer’s disease: a Scheel M, Paul F, Ritter K (2019) Uncovering
deep learning approach using convolutional convolutional neural network decisions for
autoencoders. IEEE J Biomed Health Inf diagnosing multiple sclerosis on conventional
24(1):17–26. https://doi.org/10.1109/ MRI using layer-wise relevance propagation.
JBHI.2019.2914970 Neuroimage: Clinical 24:102003. https://
31. Leming M, Górriz JM, Suckling J (2020) doi.org/10.1016/j.nicl.2019.102003
Ensemble deep learning on large, mixed-site 39. Fu G, Li J, Wang R, Ma Y, Chen Y (2021)
fMRI aatasets in autism and other tasks. Int J Attention-based full slice brain CT image diag-
N e u r a l S y s t 2 0 5 0 0 1 2 . h t t p s : // d o i . nosis with explanations. Neurocomputing 452:
org/10.1142/S0129065720500124, 263–274. https://doi.org/10.1016/j.
2002.07874 neucom.2021.04.044
32. Bae J, Stocks J, Heywood A, Jung Y, Jenkins L, 40. Gutiérrez-Becker B, Wachinger C (2018) Deep
Katsaggelos A, Popuri K, Beg MF, Wang L multi-structural shape analysis: application to
(2019) Transfer learning for predicting conver- neuroanatomy. In: Lecture notes in computer
sion from mild cognitive impairment to science (including subseries lecture notes in
dementia of Alzheimer’s type based on artificial intelligence and lecture notes in bioin-
3D-convolutional neural network. bioRxiv. formatics), LNCS, vol 11072, pp 523–531.
https://doi.org/10.1101/2019.12.20.884 https://doi.org/10.1007/978-3-030-00931-
932 1_60
33. Ball G, Kelly CE, Beare R, Seal ML (2021) 41. Hu J, Qing Z, Liu R, Zhang X, Lv P, Wang M,
Individual variation underlying brain age esti- Wang Y, He K, Gao Y, Zhang B (2021) Deep
mates in typical development. Neuroimage learning-based classification and voxel-based
235:118036. https://doi.org/10.1016/j. visualization of frontotemporal dementia and
neuroimage.2021.118036 Alzheimer’s disease. Front Neurosci 14.
34. Böhle M, Eitel F, Weygandt M, Ritter K, on https://doi.org/10.3389/fnins.2020.626154
botADNI (2019) Layer-wise relevance propa- 42. Jin D, Zhou B, Han Y, Ren J, Han T, Liu B,
gation for explaining deep neural network deci- Lu J, Song C, Wang P, Wang D, Xu J, Yang Z,
sions in MRI-based Alzheimer’s disease Yao H, Yu C, Zhao K, Wintermark M, Zuo N,
704 Elina Thibeau-Sutre et al.
Zhang X, Zhou Y, Zhang X, Jiang T, Wang Q, 48. Rieke J, Eitel F, Weygandt M, Haynes JD,
Liu Y (2020) Generalizable, reproducible, and Ritter K (2018) Visualizing convolutional net-
neuroscientifically interpretable imaging bio- works for MRI-based diagnosis of Alzheimer’s
markers for Alzheimer’s disease. Adv Sci disease. In: Understanding and interpreting
7(14):2000675. https://doi.org/10.1002/ machine learning in medical image computing
advs.202000675 applications. Lecture notes in computer sci-
43. Lee E, Choi JS, Kim M, Suk HI (2019) Alzhei- ence. Springer, Cham, pp 24–31. https://doi.
mer’s disease neuroimaging initiative toward org/10.1007/978-3-030-02628-8_3
an interpretable Alzheimer’s disease diagnostic 49. Tang Z, Chuang KV, DeCarli C, Jin LW,
model with regional abnormality representa- Beckett L, Keiser MJ, Dugger BN (2019)
tion via deep learning. NeuroImage 202: Interpretable classification of Alzheimer’s dis-
1 1 6 1 1 3 . h t t p s :// d o i . o r g / 1 0 . 1 0 1 6 / j . ease pathologies with a convolutional neural
neuroimage.2019.116113 network pipeline. Nat Commun 10(1):1–14.
44. Magesh PR, Myloth RD, Tom RJ (2020) An https://doi.org/10.1038/s41467-019-
explainable machine learning model for early 10212-1
detection of Parkinson’s disease using LIME 50. Wood D, Cole J, Booth T (2019) NEURO-
on DaTSCAN imagery. Comput Biol Med DRAM: a 3D recurrent visual attention model
126:104041. https://doi.org/10.1016/j. for interpretable neuroimaging classification.
compbiomed.2020.104041 arXiv:191004721 [cs, stat] 1910.04721
45. Nigri E, Ziviani N, Cappabianco F, Antunes A, 51. Thibeau-Sutre E, Colliot O, Dormont D, Bur-
Veloso A (2020) Explainable deep CNNs for gos N (2020) Visualization approach to assess
MRI-bBased diagnosis of Alzheimer’s the robustness of neural networks for medical
disease. In: Proceedings of the international image classification. In: Medical imaging 2020:
joint conference on neural networks. https:// image processing, international society for
doi.org/10.1109/IJCN N48605.2020. optics and photonics, vol 11313, p 113131J.
9206837 https://doi.org/10.1117/12.2548952
46. Qiu S, Joshi PS, Miller MI, Xue C, Zhou X, 52. Tomsett R, Harborne D, Chakraborty S,
Karjadi C, Chang GH, Joshi AS, Dwyer B, Gurram P, Preece A (2020) Sanity checks for
Zhu S, Kaku M, Zhou Y, Alderazi YJ, saliency metrics. Proc AAAI Conf Artif Intell
Swaminathan A, Kedar S, Saint-Hilaire MH, 34(04):6021–6029. https://doi.org/10.
Auerbach SH, Yuan J, Sartor EA, Au R, Kola- 1609/aaai.v34i04.6064
chalama VB (2020) Development and valida- 53. Lapuschkin S, Binder A, Montavon G, Muller
tion of an interpretable deep learning KR, Samek W (2016) Analyzing classifiers:
framework for Alzheimer’s disease classifica- fisher vectors and deep neural networks. In:
tion. Brain: J Neurol 143(6):1920–1933. 2016 IEEE conference on computer vision
https://doi.org/10.1093/brain/awaa137 and pattern recognition (CVPR). IEEE, Las
47. Ravi D, Blumberg SB, Ingala S, Barkhof F, Vegas, pp 2912–2920. https://doi.org/10.
Alexander DC, Oxtoby NP (2022) Degenera- 1109/CVPR.2016.318
tive adversarial neuroimage nets for brain scan 54. Ozbulak U (2019) Pytorch cnn visualizations.
simulations: application in ageing and demen- https://github.com/utkuozbulak/pytorch-
tia. Med Image Anal 75:102257. https://doi. cnn-visualizations
org/10.1016/j.media.2021.102257
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution,
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other
third-party material in this chapter are included in the chapter’s Creative Commons license, unless indicated
otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder.
Chapter 23
Abstract
This chapter presents a regulatory science perspective on the assessment of machine learning algorithms in
diagnostic imaging applications. Most of the topics are generally applicable to many medical imaging
applications, while brain disease-specific examples are provided when possible. The chapter begins with
an overview of US FDA’s regulatory framework followed by assessment methodologies related to ML
devices in medical imaging. Rationale, methods, and issues are discussed for the study design and data
collection, the algorithm documentation, and the reference standard. Finally, study design and statistical
analysis methods are overviewed for the assessment of standalone performance of ML algorithms as well as
their impact on clinicians (i.e., reader studies). We believe that assessment methodologies and regulatory
science play a critical role in fully realizing the great potential of ML in medical imaging, in facilitating ML
device innovation, and in accelerating the translation of these technologies from bench to bedside to the
benefit of patients.
Key words Machine learning, Performance assessment, Standalone performance, Reader study, Sta-
tistical analysis plan, Regulatory science
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_23,
© The Author(s) 2023
705
706 Weijie Chen et al.
1
https://www.fda.gov/about-fda/center-devices-and-radiological-health/cdrh-mission-vision-and-shared-
values
2
https://www.fda.gov/science-research/science-and-research-special-topics/advancing-regulatory-science
Regulatory Science Perspective on Performance Assessment 707
Table 1
Summary characteristics of exemplar FDA-cleared ML devices for brain disorders
2 Regulatory Framework
2.1 Overview The US FDA classifies medical devices into three classes, Classes I,
II, and III. The classification determines the extent of regulatory
controls necessary to provide reasonable assurance of the safety and
3
https://www.fda.gov/training-and-continuing-education/cdrh-learn
708 Weijie Chen et al.
4
https://www.fda.gov/medical-devices/regulatory-controls/general-controls-medical-devices
Regulatory Science Perspective on Performance Assessment 709
2.2 Imaging Device A majority of medical image processing devices have been classified
Regulation as Class II devices. Most of the software-only devices or software as
a medical device that are intended for image processing have been
classified under 21 CFR 892.2050 as picture archiving and com-
munications systems. On April 19, 2021, FDA updated the name of
the regulation 21 CFR 892.2050 to “medical image management
and processing system.” There are no published, mandatory spe-
cific special controls related to software-only devices classified
under 21 CFR 892.2050, and therefore, the primary resource to
understand the legal requirements for performance data associated
with these devices is the comparative standard of substantial equiv-
alence as described in detail in the guidance document on the 510
(k) Program [4]. In contrast, several devices more recently classified
under the De Novo pathway have specific special controls that
manufacturers marketing such devices must adhere to.
Devices originally classified via the De Novo pathway often
include special controls defined in the CFR describing require-
ments for manufacturers of these devices. Devices that may imple-
ment machine learning that include software or software-only
devices must adhere to the special controls defined in the specific
regulations associated with the appropriate device class. The classi-
fication with the associated special controls is published with a
Federal Register notice and appears in the Electronic Code of
Federal Regulations (eCFR).5 A De Novo classification, including
any special control, is effective on the date the order letter is issued
granting the De Novo request [3]. For the specific examples cited
below, the De Novo submission (DEN number) is cited for classi-
fications that have not been published in CFR at the time of
writing, and the associated order with special controls may be
found by searching FDA’s De Novo database.6 Examples include:
• 21 CFR 870.2785 (DEN200019): Software for optical camera-
based measurement of pulse rate, heart rate, breathing rate,
and/or respiratory rate
• 21 CFR 870.2790 (DEN200038): Hardware and software for
optical camera-based measurement of pulse rate, heart rate,
breathing rate, and/or respiratory rate
5
https://www.ecfr.gov/cgi-bin/ECFR?page=browse
6
https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/denovo.cfm
Regulatory Science Perspective on Performance Assessment 711
3.1 Study Objectives The first consideration in study design is the objective of the study.
A general principle is that the study design should aim at generating
data to support what the ML algorithm claims to accomplish. The
required data are closely related to the intended use of the device,
including the target patient population. Important considerations
include the significance of information provided by an ML algo-
rithm to a healthcare decision, the state of the healthcare situation
or condition that the algorithm addresses, and how the ML algo-
rithm is intended to be integrated into the current standard of care.
Examples of study objectives for ML algorithms include standalone
performance characterization, standalone performance comparison
with another algorithm or device, performance characterization of
human users when equipped with the algorithm, and performance
comparison of human users with and without the algorithm.
Regulatory Science Perspective on Performance Assessment 713
3.2 Pilot and Pivotal For the purpose of this chapter, a pivotal study is defined as a
Studies definitive study in which evidence is gathered to support the safety
and effectiveness evaluation of a medical device for its intended use.
A pivotal study is the key formal performance assessment of ML
devices in medical imaging, and the design of a pivotal study is often
the culmination of a significant amount of previous work. An often
overlooked, important step toward the design of a pivotal study is a
pilot (or exploratory) study. Pilot studies may include different
phases, including those that demonstrate the engineering proof of
concept, those that lead to a better understanding of the mechan-
isms involved, those that may lead to iterative improvements in
performance, and those that yield essential information for design-
ing a pivotal study. When a pilot study involves patients, sample size
is typically small, and data are often conveniently acquired rather
than representative of an intended population [8]. Such pilot stud-
ies provide information about the estimates of the effect size and
variance components that are critical for estimating the sample size
for a pivotal study. In addition, a pilot study can uncover basic issues
in data collection, including issues about missing or incomplete
data and poor imaging protocols. For pivotal studies that include
clinicians (typically radiologists or pathologists who interpret
images when equipped with the ML algorithm), a pilot study can
reveal poor reading protocols and poor reader training [8]. Run-
ning one or more pilot studies is therefore highly advisable prior to
the design of a pivotal study.
3.3 Data Collection An important prerequisite for a study that supports the claims of an
ML algorithm is that the data collection process should allow the
replication of the conclusions drawn from this particular study by
independent studies in the future. In this regard, the composition
and independence of training and test datasets and dataset repre-
sentativeness are central issues.
3.3.1 Training and Test Training data are defined as the set of patient-related attributes (raw
Datasets data, images, and other associated information) used for inferring a
function between these attributes and the desired output for the
ML algorithm. During training, investigators may explore different
algorithm architectures for this function and fine-tune the para-
meters of a selected architecture. The algorithm designer can also
partition this data into different sets for preliminary
(or exploratory) performance analysis, utilizing, for example,
cross-validation techniques [9]. Typically, these cross-validation
results are used for further model development, model selection,
and hyperparameter tuning. In other words, cross-validation is
typically used as an informative step before the ML algorithm is
finalized. In many machine learning texts, a subset of data left out
714 Weijie Chen et al.
3.3.2 Independence A central principle in performance assessment is that the test dataset
is required to be independent of the training dataset, meaning that
the data for the cases in the test set do not depend on the data for
the cases in the training set. It is well-known that the violation of
this principle results in optimistically biased performance estimates
[12]. To avoid this bias, developers typically set aside a dedicated
test data for performance estimation aimed to be independent of
the training dataset. There are subtle ways in which the indepen-
dence principle can be violated if the test dataset is not carefully
7
https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/cfrsearch.cfm?fr=820.3
Regulatory Science Perspective on Performance Assessment 715
3.3.3 Represen- ML algorithms are data-driven, and the distributions of the training
tativeness and test data have direct implications for algorithm performance
and its measurement. Ideally, training and test sets should be large
and representative enough so that the collected data provides a
good approximation to the true distribution in the target popula-
tion. As discussed above, well-characterized and annotated medical
imaging datasets are typically limited in size. When the dataset size
is a constraint, informativeness of a case to be selected for training
716 Weijie Chen et al.
3.3.4 Dataset Mismatch Dataset mismatch is defined as a condition where training and test
data follow different distributions, which is popularly known as
“dataset shift” in the ML literature [18]. We prefer using “mis-
match” because “shift” specifically refers to adding a constant value
to each member of a dataset in probability distribution theory,
which does not convey all types of mismatches that the term is
intended for. Dataset mismatch can also be between test data and
real-world deployment data (rather than test and training) or cur-
rent real-world data vs. future real-world data (e.g., due to changes
in clinical practice). There may be many potential reasons for data-
set mismatch, with sample selection bias and non-stationary envir-
onments cited as the most important ones [19]. Storkey [20]
grouped these mismatches into six main categories, including sam-
ple selection bias, imbalanced data, simple covariate shift, prior
probability shift, domain shift, and source component shift. Dataset
Regulatory Science Perspective on Performance Assessment 717
3.4 Bias Bias is a critical factor to consider in study design and analysis for
ML assessment, and here we intend to give an overview of sources
of bias in ML development and assessment. Note that the general
artificial intelligence and machine learning literature currently lacks
a consensus on the terminology regarding bias. We consider that
performance assessment of an ML system from a finite sample can
be cast as a statistical estimation problem. In statistics, a biased
estimator is one that provides estimates which are systematically
too high or too low [25]. Paralleling this definition, we define
statistical bias as a systematic difference between the average per-
formance estimate of an ML system tested in a specified manner and
its true performance on the intended population. This systematic
difference may result from flaws in any of the components of the
assessment framework shown in Fig. 1: collection of patient data
and the definition of a reference standard (for both algorithm
design and testing stages), algorithm training, analysis methods,
and algorithm deployment in the clinic.
Note that the definition of statistical bias above includes sys-
tematically different results for different subgroups. ISO/IEC
Draft International Standard 22,989 (artificial intelligence con-
cepts and terminology) defines bias as systematic difference in
treatment of certain objects, people, or groups in comparison to
others, where treatment is any kind of action, including perception,
observation, representation, prediction, or decision. As such, sta-
tistical bias may result in the type of bias defined in the ISO/IEC
Draft International Standard.
We start our discussion of bias with the effect of the dataset
representativeness, which has direct implications for ML algorithm
performance and its measurement, as described above. When the
dataset is not representative of the target population, this can lead
to selection bias. For example, if all the images in the training or test
datasets are acquired with a particular type of scanner while the
target patient population may be scanned by many types of scan-
ners, this may lead to an ML algorithm performance estimate that is
systematically different from that on the intended population or
lead to different results for different subgroups. Spectrum bias,
718 Weijie Chen et al.
4 Algorithm Documentation
4.1 Why Algorithm Machine learning (ML) algorithms have been evolving from tradi-
Documentation Is tional techniques with hand-crafted features and interpretable sta-
Important tistical learning models to the more recent deep learning-based
neural network models with drastically increased complexity.
Appropriate documentation of ML algorithms is critically impor-
tant for reproducibility and transparency from a scientific point of
view. Algorithm description with sufficient details is particularly
important in a regulatory setting reviewed by regulators for the
assessment of technical quality, for comparing with a legally mar-
keted device, and for the assessment of changes of the algorithm in
future versions.
Reproducibility is a well-known cornerstone of science; for
scientific findings to be valid and reliable, it is fundamentally impor-
tant that the experimental procedure is reproducible, whether the
experiments are conducted physically or in silico. ML studies for
detection, diagnosis, or other means of characterization of brain
disorders or other diseases are in silico experiments involving com-
plex algorithms and big data. As such, we adopt the definition of
reproducibility from a National Academies of Sciences, Engineer-
ing, and Medicine report [30] as “obtaining consistent results
using the same input data; computational steps, methods, and
code; and conditions of analysis.” It has been widely recognized
that poor documentation such as incomplete data annotation or
specification of data processing and analysis is a primary culprit for
poor reproducibility in many biomedical studies [31]. Lack of
reproducibility may result in not only inconvenience or inferior
quality but sometimes a flawed model that can bring real danger
to patients when such models are used to tailor treatments in drug
clinical trials, as reported by Baggerly and Coombes [32] in their
forensic bioinformatics study on a model of gene expression signa-
tures to predict patient response to multidrug regimens.
Appropriate documentation of algorithm design and develop-
ment is essential for the assessment of technical quality. Identifica-
tion of the various sources of bias discussed in Subheading 3.4 may
not be possible without appropriate algorithm documentation.
Furthermore, while there is currently no principled guidance on
the design of deep neural network architectures, consensus on
good practices and empirical evidence do provide basis for the
assessment of technical soundness of an ML algorithm. For exam-
ple, the choice of loss function is closely related to the clinical task:
mean squared error is appropriate for quantification tasks, cross-
entropy is often used for classification tasks, and so on. Moreover,
the design and optimization of algorithms involve trial-and-error
and ad hoc procedures to tune parameters; as such, a developer may
introduce bias even unconsciously if the use of patient data and
truth labels is not properly documented.
720 Weijie Chen et al.
4.2 Essential Many efforts in academia have been devoted to developing check-
Elements in Algorithm lists for ML algorithm development and reporting to enhance
Description transparency, improve quality, and facilitate reproducibility. A
report from the NeurIPS 2019 Reproducibility Program [33]
provided a checklist for general machine learning research. Norgeot
et al. [34] presented the minimum information about clinical arti-
ficial intelligence modeling (MI-CLAIM) checklist as a tool to
improve transparency reporting of AI algorithms in medicine.
The journal Radiology published an editorial with a checklist for
artificial intelligence in medical imaging (CLAIM) [35] as a guide
for authors and reviewers. Similarly, an editorial of the journal
Medical Physics introduced a required checklist to ensure rigorous
and reproducible research of AI/ML in the field of medical physics
[36]. Consensus groups also published the SPIRIT-AI (Standard
Protocol Items: Recommendations for Interventional Trials–Artifi-
cial Intelligence) as guidelines for clinical trial protocols for inter-
ventions involving artificial intelligence [37]. Also, there are
undergoing efforts on guidelines for diagnostic and predictive AI
models such as the TRIPOD-ML (Transparent Reporting of a
Multivariable Prediction Model for Individual Prognosis or
Diagnosis–Machine Learning) [38] and STARD-AI (Standards
for Reporting of Diagnostic Accuracy Studies–Artificial
Intelligence).
Besides the abovementioned references, the FDA has published
a guidance document for premarket notification [510(k)] submis-
sions on computer-assisted detection devices applied to radiology
images and radiology device data [39]. Here we provide a list of key
elements in describing ML algorithms for medical imaging applica-
tions, which we believe are essential (but not necessarily complete)
for understanding and technical assessment of an ML algorithm.
Regulatory Science Perspective on Performance Assessment 721
• Input.
The types of data the algorithm takes as input may include
images and possibly non-imaging data. For input data, essential
information includes modality (e.g., CT, MRI, clinical data),
compatible acquisition systems (e.g., image scanner manufac-
turer and model), acquisition parameter ranges (e.g., kVp range,
slice thickness in CT imaging), and clinical data collection pro-
tocols (e.g., use of contrast agent, MRI sequence).
• Preprocessing.
Input images are often preprocessed such that they are in a
suitable form or orientation for further processing. Preproces-
sing often includes data normalization, which refers to calibra-
tion or transformation of image data to that of a reference image
(e.g., warping to the reference frame) or to certain numerical
range, e.g., slice thickness normalization. Other examples of
preprocessing include elimination of irrelevant structures such
as a head holder, image size normalization, image orientation
normalization, and so on. Sometimes an image quality checker is
applied to exclude data with severe artifacts or insufficient qual-
ity from further processing and analyses. It is important to
describe the specific techniques for normalization and image
quality checking. Furthermore, it is critical to make clear how
cases failing the quality check are handled clinically (e.g.,
re-imaging or reviewed by a physician) and account for the
excluded cases in the performance assessment.
• Algorithm architecture.
Algorithm architecture is the core module of a machine learning
algorithm. In traditional ML techniques, hand-crafted features
that are often motivated by physician’s experiences are first
derived from medical images. A feature selection procedure can
be applied to the initially extracted features to select the most
useful features for the clinical task of interest. The selected
features are then merged by a classifier into a decision variable.
There are many choices of the classifier depending on the nature
of the data and the purpose of the classifier: linear or quadratic
discriminant analysis, k nearest neighbor (kNN) classifiers, arti-
ficial neural networks (ANNs), support vector machines, ran-
dom forests, etc. As such, the algorithm description typically
includes the definition of features, the feature selection meth-
ods, and the specific classification model. Moreover, it is impor-
tant to document hyperparameters and the method with which
these hyperparameters are determined, for example, the number
of neighbors in the kNN method, the number of layers in
ANNs, etc.
722 Weijie Chen et al.
5 Reference Standard
8
https://www.accessdata.fda.gov/cdrh_docs/reviews/DEN170022.pdf
724 Weijie Chen et al.
TP
TPR =
TP þ FN
• Voxel true-negative rate (TNR), specificity: proportion of cor-
rectly segmented background voxels.
TN
TNR =
TN þ FP
728 Weijie Chen et al.
Fig. 2 Diagram of a segmented object (blue solid line) overlaid by the reference segmentation (red dashed
line). The false-positive (FP) voxels (green hashed region), true-positive (TP) voxels (yellow hashed region),
false-negative (FN) voxels (purple wave region), and true-negative (TN) voxels (gray region) are shown in the
figure as well
TP þ TN
Accuracy =
TP þ TN þ FP þ FN
• Dice similarity coefficient (DSC), F1 metric [48]
2TP 2JI
DSC = =
2TP þ FP þ FN 1 þ JI
• Jaccard index (JI), intersection over union (IoU) metric [49]
TP DSC
JI = =
TP þ FP þ FN 2 - DSC
TP
TPR = Se =
TP þ FN
• True-negative rate (TNR), specificity
TN
TNR = Sp =
TN þ FP
TPR
LR þ =
1 - TNR
• Negative likelihood ratio (LR-)
1 - TPR
LR - =
TNR
TP
PPV = Precision =
FP þ TP
• Negative prediction value (NPV)
TN
NPV =
FN þ TN
Regulatory Science Perspective on Performance Assessment 731
• Accuracy
TP þ TN
Accuracy =
TP þ TN þ FP þ FN
• F1 score
PPV × TPR
F 1 = 2∙
PPV þ TPR
Fig. 3 Plot of an ROC curve for an ML algorithm in a binary classification task. The ROC curve (blue line) shows
the trade-off between sensitivity (TPR) and specificity (FPR) for all possible operating points. Both the AUC
(includes both shaded regions) and the PAUC for sensitivity ≥0.8 are shown in the figure
Fig. 4 Plot of an FROC curve for an ML detection algorithm. The FROC curve (blue line) shows the trade-off
between object detection sensitivity (TPR) and the number of FPs per patient for all possible operating points.
The full area under the FROC curve is not well defined, but a partial area may facilitate comparisons across
algorithms. The figure shows a PAUC (shaded region) for ≤3.0 FPs/patient. However, AFROC-based summary
metrics are more commonly used for characterizing/comparing FROC performance
Fig. 5 Plot of an AFROC curve for the same ML detection algorithm given in Fig. 4. The AFROC curve (blue line)
shows the trade-off between sensitivity (TPR) and the false patient fraction (fraction of patients with at least
one FP) for all possible operating points. The area under the AFROC curve (shaded region) is often used to
facilitate comparisons across object detection algorithms
Fig. 6 Plot of a P–R curve for the same ML object detection algorithm as in Figs. 4 and 5. The P–R curve shows
the trade-off in precision (PPV) as a function of recall (TPF). The area under the P–R curve (AUCPR) is an
aggregate summary metric, for characterizing and comparing P–R curves across object detection algorithms
736 Weijie Chen et al.
= max P R
R:R ≥ Rnþ1
6.6 Modifications One of the potential advantages of ML is its ability to quickly learn
and Continuous from new data such that it can remain current to changing patient
Learning demographics, clinical practice, and image acquisition technolo-
gies. This ability may result in large numbers of updates to an ML
algorithm after it becomes available for clinical use. However, each
update requires a systematic assessment. Modifications can range
from infrequent algorithm updates all the way to continuously
learning ML that adapts or learns from real-world experience/
data on a continuous basis. This presents a challenge to both ML
developers and regulatory bodies such as the FDA.
FDA’s traditional paradigm of regulating ML devices is not
designed for adaptive technologies, which adapt and optimize per-
formance on a rapid timescale. With this in mind, the FDA is
exploring a new, total product life cycle (TPLC) regulatory
Regulatory Science Perspective on Performance Assessment 739
9
https://www.fda.gov/media/106331/download
740 Weijie Chen et al.
10
Software | Medical Image Perception Laboratory Department of Radiology (uiowa.edu): https://perception.
lab.uiowa.edu/software-0
11
iMRMC: Software to do multi-reader multi-case analysis of reader studies: https://github.com/DIDSR/
iMRMC
742 Weijie Chen et al.
Fig. 7 Illustration of the fully crossed design and the paired split-plot design. The squares with grid can be
understood as the data matrix collected in the reader study with each row representing a case, each column
representing a reader, and each data element representing the rating of the case by the reader
number of readers and cases, but the PSP design can be more
powerful given the same number of image interpretations with a
price of collecting extra cases [83].
The design of an MRMC reader study involves a great deal of
considerations including patient data collection (see Subheading 3),
establishment of a reference standard (see Subheading 5), and many
other aspects such as the recruitment and training of readers and
744 Weijie Chen et al.
8 Statistical Analysis
H 0 : AUCwith ML = AUCwithout ML ; H 1
: AUCwith ML > AUCwithout ML :
And the secondary hypotheses may be stated as
H 0 : S p with ML = S p without ML ; H 1 : S p with ML > S p without ML :
9 Summary Remarks
Acknowledgments
The authors thank Drs. Robert Ochs, Alexej Gossmann, and Aldo
Badano for carefully reviewing the manuscript and providing help-
ful comments.
References
1. Sahiner B, Pezeshk A, Hadjiiski LM, Wang X, 9. Hastie T, Tibshirani R, Friedman J (2017) The
Drukker K, Cha KH, Summers RM, Giger ML elements of statistical learning. Series in statis-
(2019) Deep learning in medical imaging and tics, 2nd (corrected 12th printing) edn.
radiation therapy. Med Phys 46(1):e1–e36. Springer, New York
https://doi.org/10.1002/mp.13264 10. Chan H-P, Sahiner B, Wagner RF, Petrick N
2. Lui YW, Chang PD, Zaharchuk G, Barboriak (1999) Classifier design for computer-aided
DP, Flanders AE, Wintermark M, Hess CP, diagnosis: effects of finite sample size on the
Filippi CG (2020) Artificial intelligence in neu- mean performance of classical and neural net-
roradiology: current status and future direc- work classifiers. Med Phys 26(12):2654–2668
tions. Am J Neuroradiol 41(8):E52–E59. 11. Gulshan V, Peng L, Coram M, Stumpe MC,
https://doi.org/10.3174/ajnr.A6681 Wu D, Narayanaswamy A, Venugopalan S,
3. U.S. Food and Drug Administration (2017) Widner K, Madams T, Cuadros J, Kim R,
De Novo classification process (Evaluation of Raman R, Nelson PC, Mega JL, Webster R
Automatic Class III Designation). Guidance (2016) Development and validation of a deep
for Industry and Food and Drug Administra- learning algorithm for detection of diabetic
tion Staff retinopathy in retinal fundus photographs.
4. U.S. Food and Drug Administration (2014) JAMA 316(22):2402–2410. https://doi.org/
The 510(k) program: evaluating substantial 10.1001/jama.2016.17216
equivalence in premarket notifications [510 12. Fukunaga K (1990) Introduction to statistical
(k)]. Guidance for Industry and Food and pattern recognition, 2nd edn. Academic Press,
Drug Administration Staff New York
5. U.S. Food and Drug Administration (2012) 13. Moons KGM, Altman DG, Reitsma JB, Ioan-
Factors to consider when making benefit-risk nidis JPA, Macaskill P, Steyerberg EW, Vickers
determinations in medical device premarket AJ, Ransohoff DF, Collins GS (2015) Trans-
approval and De Novo classifications. Guid- parent reporting of a multivariable prediction
ance for Industry and Food and Drug Admin- model for individual prognosis or diagnosis
istration Staff (TRIPOD): explanation and elaboration. Ann
6. U.S. Food and Drug Administration (2018) Intern Med 162(1):W1–W73. https://doi.
Benefit-risk factors to consider when determin- org/10.7326/m14-0698
ing Substantial equivalence in premarket noti- 14. Du B, Wang Z, Zhang L, Zhang L, Liu W,
fications (510(k)) with different technological Shen J, Tao D (2019) Exploring representa-
characteristics. Guidance for Industry and tiveness and informativeness for active learning.
Food and Drug Administration Staff arXiv:1904.06685
7. U.S. Food and Drug Administration (2021) 15. Huang S, Jin R, Zhou Z (2014) Active learning
Requests for feedback and meetings for medi- by querying informative and representative
cal device submissions: the Q-submission pro- examples. IEEE Trans Pattern Anal Mach Intell
gram. Guidance for Industry and Food and 36:1936–1949
Drug Administration Staff 16. Sharma D, Shanis Z, Reddy CK, Gerber S,
8. Gallas BD, Chan HP, D’Orsi CJ, Dodd LE, Enquobahrie A (2019) Active learning tech-
Giger ML, Gur D, Krupinski EA, Metz CE, nique for multimodal brain tumor segmenta-
Myers KJ, Obuchowski NA, Sahiner B, Tole- tion using limited labeled images. In: Wang Q,
dano AY, Zuley ML (2012) Evaluating imag- Milletari F, Nguyen HV et al (eds) Domain
ing and computer-aided detection and adaptation and representation transfer and
diagnosis devices at the FDA. Acad Radiol medical image learning with less labels and
19(4):463–477. https://doi.org/10.1016/j. imperfect data. Springer International Publish-
acra.2011.12.016 ing, Cham, pp 148–156
Regulatory Science Perspective on Performance Assessment 749
17. Hao R, Namdar K, Liu L, Khalvati F (2021) A Appl Clin Inform 12(4):808–815. https://doi.
transfer learning–based active learning frame- org/10.1055/s-0041-1735184
work for brain tumor classification. Front Artif 30. National Academies of Sciences E, Medicine
Intell 4(61):635766. https://doi.org/10. (2019) Reproducibility and replicability in sci-
3389/frai.2021.635766 ence. The National Academies Press,
18. Quionero-Candela J, Sugiyama M, Washington, DC. https://doi.org/10.
Schwaighofer A, Lawrence N (2009) Dataset 17226/25303
shift in machine learning. MIT Press, 31. Ioannidis JPA, Allison DB, Ball CA,
Cambridge MA Coulibaly I, Cui X, Culhane AC, Falchi M,
19. Moreno-Torres JG, Raeder T, Alaiz- Furlanello C, Game L, Jurman G, Mangion J,
Rodriguez R, Chawla NV, Herrera F (2012) Mehta T, Nitzberg M, Page GP, Petretto E,
A unifying view on dataset shift in classification. van Noort V (2009) Repeatability of published
Pattern Recogn 45(1):521–530. https://doi. microarray gene expression analyses. Nat Genet
org/10.1016/j.patcog.2011.06.019 41(2):149–155. https://doi.org/10.1038/
20. Storkey A (2009) When training and test sets ng.295
are different: characterizing learning 32. Baggerly KA, Coombes KR (2009) Deriving
transfer. In: Quionero-Candela J, chemosensitivity from cell lines: forensic bioin-
Sugiyama M, Schwaighofer A, Lawrence N formatics and reproducible research in high-
(eds) Dataset shift in machine learning. MIT throughput biology. Ann Appl Stat 3(4):
Press, Cambridge, MA, pp 3–28 1309–1334, 1326
21. Goldenberg I, Webb G (2019) Survey of dis- 33. Pineau J, Vincent-Lamarre P, Sinha K,
tance measures for quantifying concept drift Larivière V, Beygelzimer A, d’Alché-Buc F,
and shift in numeric data. Knowl Inf Syst 60: Fox E, Larochelle H (2020) Improving repro-
591–615. https://doi.org/10.1007/s10115- ducibility in machine learning research
018-1257-z (a report from the NeurIPS 2019 reproducibil-
22. Rabanser S, Günnemann S, Lipton ZC (2018) ity program). arXiv:2003.12206
Failing loudly: an empirical study of methods 34. Norgeot B, Quer G, Beaulieu-Jones BK,
for detecting dataset shift. arXiv:1810.11953 Torkamani A, Dias R, Gianfrancesco M,
23. Dockès J, Varoquaux G, Poline J-B (2021) Arnaout R, Kohane IS, Saria S, Topol E,
Preventing dataset shift from breaking Obermeyer Z, Yu B, Butte AJ (2020) Mini-
machine-learning biomarkers. mum information about clinical artificial intel-
arXiv:2107.09947 ligence modeling: the MI-CLAIM checklist.
24. Turhan B (2012) On the dataset shift problem Nat Med 26(9):1320–1324. https://doi.org/
in software engineering prediction models. 10.1038/s41591-020-1041-y
Empir Softw Eng 17(1):62–74. https://doi. 35. Mongan J, Moy L, Charles E, Kahn J (2020)
org/10.1007/s10664-011-9182-8 Checklist for artificial intelligence in medical
25. U.S. Food and Drug Administration (2007) imaging (CLAIM): a guide for authors and
Guidance for industry and FDA staff: statistical reviewers. Radiol Artif Intell 2(2):e200029.
guidance on reporting results from studies https://doi.org/10.1148/ryai.2020200029
evaluating diagnostic tests. vol 2007. U.S 36. El Naqa I, Boone JM, Benedict SH, Goodsitt
Food and Drug Administration, Silver Spring MM, Chan HP, Drukker K, Hadjiiski L,
26. Zhou XH, Obuchowski NA, McClish DK Ruan D, Sahiner B (2021) AI in medical phys-
(2002) Statistical methods in diagnostic medi- ics: guidelines for publication. Med Phys
cine. Wiley 48(9):4711–4714. https://doi.org/10.1002/
mp.15170
27. Suresh H, Guttag JV (2021) A framework for
understanding sources of harm throughout the 37. Cruz Rivera S, Liu X, Chan A-W, Denniston
machine learning life cycle. arXiv:190110002 AK, Calvert MJ, Darzi A, Holmes C, Yau C,
[cs, stat] Moher D, Ashrafian H, Deeks JJ, Ferrante di
Ruffano L, Faes L, Keane PA, Vollmer SJ, Lee
28. Hooker S (2021) Moving beyond “algorithmic AY, Jonas A, Esteva A, Beam AL, Panico MB,
bias is a data problem”. Patterns 2(4):100241. Lee CS, Haug C, Kelly CJ, Yau C, Mulrow C,
https://doi.org/10.1016/j.patter.2021. Espinoza C, Fletcher J, Moher D, Paltoo D,
100241 Manna E, Price G, Collins GS, Harvey H,
29. Guo LL, Pfohl SR, Fries J, Posada J, Fleming Matcham J, Monteiro J, ElZarrad MK, Fer-
SL, Aftandilian C, Shah N, Sung L (2021) rante di Ruffano L, Oakden-Rayner L,
Systematic review of approaches to preserve McCradden M, Keane PA, Savage R,
machine learning performance in the presence Golub R, Sarkar R, Rowley S, The S-A,
of temporal dataset shift in clinical medicine. Group C-AW, Spirit AI, Group C-AS, Spirit
750 Weijie Chen et al.
AI, Group C-AC (2020) Guidelines for clinical Starkey A, Max D, Croft BY, Clarke LP (2006)
trial protocols for interventions involving arti- Evaluation of lung MDCT nodule annotation
ficial intelligence: the SPIRIT-AI extension. across radiologists and methods. Acad Radiol
Nat Med 26(9):1351–1363. https://doi.org/ 13(10):1254–1265
10.1038/s41591-020-1037-7 47. Taha AA, Hanbury A (2015) Metrics for eval-
38. Collins G, Moons K (2019) Reporting of arti- uating 3D medical image segmentation: analy-
ficial intelligence prediction models. Lancet sis, selection, and tool. BMC Med Imaging
393:1577–1579. https://doi.org/10.1016/ 15(1):29. https://doi.org/10.1186/s12880-
S0140-6736(19)30037-6 015-0068-x
39. U.S. Food and Drug Administration (2012) 48. Dice LR (1945) Measures of the amount of
Clinical performance assessment: Considera- ecologic association between species. Ecology
tions for computer-assisted detection devices 26(3):297–302. https://doi.org/10.2307/
applied to radiology images and radiology 1932409
device data – premarket approval (PMA) and 49. Jaccard P (1912) The distribution of the flora
premarket notification [510(k)] submissions – in the alpine zone. New Phytol 11(2):37–50
Guidance for industry and FDA staff. https:// 50. Willem (2017) FI/Dice-Score vs IoU. Cross
www.fda.gov/media/77642/download. Validated. https://stats.stackexchange.com/
Accessed 31 Oct 2021 questions/273537/f1-dice-score-vs-iou/2
40. Warfield SK, Zou KH, Wells WM (2004) 76144#276144. Accessed 9/29/2021
Simultaneous truth and performance level esti- 51. Fenster A, Chiu B (2005) Evaluation of seg-
mation (STAPLE): an algorithm for the valida- mentation algorithms for medical imaging. In:
tion of image segmentation. IEEE Trans Med 2005 IEEE engineering in medicine and biol-
Imaging 23(7):903–921 ogy 27th annual conference, 17–18 Jan 2006.
41. Petrick N, Sahiner B, Armato SG III, Bert A, pp 7186–7189. https://doi.org/10.1109/
Correale L, Delsanto S, Freedman MT, IEMBS.2005.1616166
Fryd D, Gur D, Hadjiiski L, Huo Z, Jiang Y, 52. Tharwat A (2021) Classification assessment
Morra L, Paquerault S, Raykar V, methods. Appl Comput Inform 17(1):
Salganicoff M, Samuelson F, Summers RM, 168–192. https://doi.org/10.1016/j.aci.
Tourassi G, Yoshida H, Zheng B, Zhou C, 2018.08.003
Chan H-P (2013) Evaluation of computer-
aided detection and diagnosis systems. Med 53. Hossin M, Sulaiman MN (2015) A review on
Phys 40:087001–087017 evaluation metrics for data classification evalua-
tions. Int J Data Min Knowl Manag Process
42. Steyerberg EW (2019) Overfitting and opti- 5(2):1
mism in prediction models. In: Clinical predic-
tion models. Springer, pp 95–112 54. Obuchowski NA (2003) Receiver operating
characteristic curves and their use in radiology.
43. Akkus Z, Galimzianova A, Hoogi A, Rubin Radiology 229(1):3–8
DL, Erickson BJ (2017) Deep learning for
brain MRI segmentation: state of the art and 55. Wagner RF, Metz CE, Campbell G (2007)
future directions. J Digit Imaging 30(4): Assessment of medical imaging systems and
449–459. https://doi.org/10.1007/s10278- computer aids: a tutorial review. Acad Radiol
017-9983-4 14(6):723–748
44. Zhang YJ (1996) A survey on evaluation meth- 56. Chakraborty DP (2018) Observer perfor-
ods for image segmentation. Pattern Recogn mance methods for diagnostic imaging: foun-
29(8):1335–1346. https://doi.org/10.1016/ dations, modeling, and applications with
0031-3203(95)00169-7 r-based examples. Imaging in medical diagnosis
and therapy. CRC Press, Boca Raton, FL
45. Zhang YJ (2001) A review of recent evaluation
methods for image segmentation. In: Proceed- 57. ICRU (2008) Receiver operating characteristic
ings of the sixth international symposium on analysis in medical imaging. Report 79. Inter-
signal processing and its applications (Cat. national Commission of Radiation Units and
No.01EX467), 13–16 Aug 2001. vol. 141, pp Measurements, Bethesda, MD
148–151. https://doi.org/10.1109/ISSPA. 58. He X, Frey E (2009) ROC, LROC, FROC,
2001.949797 AFROC: an alphabet soup. J Am Coll Radiol
46. Meyer CR, Johnson TD, McLennan G, Aberle 6(9):652–655
DR, Kazerooni EA, Macmahon H, Mullan BF, 59. Bunch PC, Hamilton JF, Sanderson GK, Sim-
Yankelevitz DF, van Beek EJR, Armato SG 3rd, mons AH (1977) A free response approach to
McNitt-Gray MF, Reeves AP, Gur D, the measurement and characterization of radio-
Henschke CI, Hoffman EA, Bland PH, graphic observer performance. Proc SPIE 127:
Laderach G, Pais R, Qing D, Piker C, Guo J, 124–135
Regulatory Science Perspective on Performance Assessment 751
with binary agreement data: simulation, analy- submissions. Guidance for Industry and Food
sis, validation, and sizing. J Med Imaging (Bel- and Drug Administration Staff
lingham) 1(3):031011–031011. https://doi. 85. Dwork C, Feldman V, Hardt M, Pitassi T,
org/10.1117/1.JMI.1.3.031011 Reingold O, Roth A (2015) The reusable hold-
81. Obuchowski NA (2009) Reducing the number out: preserving validity in adaptive data analy-
of reader interpretations in MRMC studies. sis. Science 349(6248):636–638
Acad Radiol 16(2):209–217 86. Gossmann A, Pezeshk A, Wang Y-P, Sahiner B
82. Obuchowski NA, Gallas BD, Hillis SL (2012) (2021) Test data reuse for the evaluation of
Multi-reader ROC studies with split-plot continuously evolving classification algorithms
designs: a comparison of statistical methods. using the area under the receiver operating
Acad Radiol 19(12):1508–1517. https://doi. characteristic curve. SIAM J Math Data Sci 3:
org/10.1016/j.acra.2012.09.012 6 9 2 – 7 1 4 . h t t p s : // d o i . o r g / 1 0 . 1 1 3 7 /
83. Chen W, Gong Q, Gallas BD (2018) Paired 20M1333110
split-plot designs of multireader multicase 87. Hillis SL, Obuchowski NA, Berbaum KS
studies. J Med Imaging (Bellingham) 5(3): (2011) Power estimation for multireader
031410. https://doi.org/10.1117/1.JMI.5. ROC methods an updated and unified
3.031410 approach. Acad Radiol 18(2):129–142
84. U.S. Food and Drug Administration (2020) 88. Huang Z, Samuelson F, Tcheuko L, Chen W
Clinical performance assessment: considera- (2020) Adaptive designs in multi-reader multi-
tions for computer-assisted detection devices case clinical trials of imaging devices. Stat
applied to radiology images and radiology Methods Med Res 29(6):1592–1611.
device data in premarket notification (510(k)) h t t p s : // d o i . o r g / 1 0 . 1 1 7 7 /
0962280219869370
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 24
Abstract
Recent advances in technology have made possible to quantify fine-grained individual differences at many
levels, such as genetic, genomics, organ level, behavior, and clinical. The wealth of data becoming available
raises great promises for research on brain disorders as well as normal brain function, to name a few,
systematic and agnostic study of disease risk factors (e.g., genetic variants, brain regions), the use of natural
experiments (e.g., evaluate the effect of a genetic variant in a human population), and unveiling disease
mechanisms across several biological levels (e.g., genetics, cellular gene expression, organ structure and
function). However, this data revolution raises many challenges such as data sharing and management, the
need for novel analysis methods and software, storage, and computing.
Here, we sought to provide an overview of some of the main existing human datasets, all accessible to
researchers. Our list is far from being exhaustive, and our objective is to publicize data sharing initiatives and
help researchers find new data sources.
Key words Genetic, Methylation, Gene expression, Brain MRI, PET, EEG/MEG, Omics, Electronic
health records, Wearables
1 Aims
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_24,
© The Author(s) 2023
753
754 Baptiste Couvy-Duchesne et al.
2 Introduction
Whole .cram, .bcf, .vcf Curated using TOPMed variant calling algorithm. All ~5 MB
genome freezes are fully processed
sequencing
755
(continued)
756
Table 1
(continued)
Genomics Methylation .idats Raw binary intensity file (two per sample, one for ~16 MB (450 K)
green and red channels) ~27 MB (EPIC)
.txt or csv. Processed, normalized, and filtered beta matrix ~8 MB (450 K)
~14 MB (EPIC)
Expression .bed (.bim + .bed + .fam) Fully processed, filtered, and normalized gene ~1 MB per individual
expression matrices (in BED format) for each tissue
Text Covariates used in eQTL analysis. Includes ~6 KB
genotyping principal components and PEER
factors
Smartphone Actigraphy Varies with product (e.g., Raw accelerator data ~500 MB (GENEActiv, 14 days of
and sensors GENEActiv uses .bin) recording)
Size of one sample may vary based on the technology used to generate the data (e.g., MRI resolution, genotyping chip)
Main Existing Datasets 757
Alzheimer’s 2215 55–90 USA and Longitudinal (up Neurological MRI (T1w, NA Genotyping, Methylation NA Transcriptomics, http://adni.loni.usc. [11, 191] Four waves
Disease Canada to 9 years of (Alzheimer’s), T2, DWI, WGS (subset CSF edu/data-samples/ (ADNI1, 2,
Neuroima follow-up, cognition, lumbar rsfMRI) of 653 proteomics, access-data/ GO, 3) with
ging Initiative ongoing) puncture PET (18F-FDG, individuals— metabolomics, different inclusion
(ADNI) FBB, AV45, 3 time lipidomics and protocols
PiB) points)
Australian 726 60+ Australia Longitudinal (up Neurological MRI (T1w, T2, NA Genotyping Methylation ActiGraph activity NA https://ida.loni [12] Neuroimaging and
Imaging to 6 years of (Alzheimer’s), DWI, (10% of .usc.edu/ selected clinical
Biomarkers follow-up) cognition, lumbar rsfMRI) sample) collaboration/ data available via
and Lifestyle puncture PET (PiB, AV45, access/appLicense. the LONI platform
Study of Aging Flute) jsp Full sample (N ~ 1200),
(AIBL) https://aibl.csiro.au/ including extended
clinical,
genotyping, or
methylation
available via
application to
CSIRO
Open Access 1096 42–95 USA Longitudinal (up Neurological MRI (T1w, T2w, NA NA NA NA NA https://www.oasis- [13] Retrospective dataset
Series of to 12 years of (Alzheimer’s), FLAIR, ASL, brains.org from imaging
Imaging follow-up) cognition SWI, time https://www.oasis- projects collected
Studies v3 of flight, brains.org/#access by WUSTL Knight
(OASIS3) rsfMRI, and ADRC over 30
DWI) years
PET (PIB, AV45, Two other
FDG) (non-independent)
datasets available:
OASIS1 and
OASIS2
Adolescent ~11,878 9–12 USA Longitudinal (up Self- and parental MRI (T1w, T2, NA Genotyping, iPAD tasks NA https://abcdstudy.org [16, 17] Objective of 10 years of
Brain to 2 years of rating. Substance rsfMRI, pedigree and testing https://nda.nih.gov/ follow-up
Cognitive follow-up, use, mental health tfMRI) (twinning) abcd/request-access
Development ongoing) (psychiatry),
(ABCD) cognition,
physical health
Enhancing >50,000 3–90 (no Worldwide Cross-sectional Psychiatry, neurology, T1w, DWI, Resting-state Genotyping, CNVs, Methylation NA NA https://enigma.ini.usc. Neuroimaging Consortium organized
NeuroImaging indiv. restriction) (43+ and addiction, rsfMRI, EEG pedigree edu/join/ projects around genetics
Genetics incl.: 9572 countries) longitudinal suicidality, tfMRI (twinning) http://enigma.ini.usc. [27 and/or disease/
through Meta- (SCZ), brain injury, HIV, edu/research/ [112, 114] trait working
analysis 6503 (BD), antisocial download-enigma- groups as well as
(ENIGMA) 10,105 behavior gwas-results/ non-clinical
(MDD), working groups
1868 with focus on sex,
(PTSD), healthy aging,
3240 plasticity, etc.
(SUD), Imaging and genetic
3665 protocols and
(OCD), genome-wide
4180 association statistics
(ADHD), are available for
18,605 (life download on the
span) ENIGMA website
ChiNese Brain PET 116 27–81 China Cross-sectional Healthy subjects PET (18F-FDG) NA NA NA NA NA https://www.nitrc.org/ [39] Dataset made to build a
Template doi/landing_page. Chinese-specific
(CNPET) php?table=groups& SPM PET brain
id=1486&doi= template
https://www.nitrc.org/
projects/cnpet
Centre for ~2000 17–93 Denmark Cross-sectional Mental and physical MRI (T1w, T2w, NA Genetic NA NA NA https://www.cimbi.dk/ [38] Data sharing only
Integrated state, personality, DWI, fMRI) polymorphisms index.php within the
Molecular and background PET (11C-5-HT) relevant for the European Union
Brain Imaging Neuropsychological 5-HT system (EU). Visiting
(CIMBI) measures access for non-EU
(memory, researchers
language, etc.)
Motor Imagery 52 24.8 SD = South Cross-sectional Healthy subjects NA EEG NA NA NA NA http://moabb. [192] BCI experiments
dataset from 3.86 years Korea (64 neurotechx.com/
Cho et al. old channels); docs/generated/
[192] real hand moabb.datasets.
movements, Cho2017.
motor html#moabb.
imagination datasets.Cho2017
hand
movement
tasks
EEG Alpha Waves 20 19–44 years France Cross-sectional Healthy subjects NA EEG NA NA NA NA https://zenodo.org/ [193] Alpha waves dataset.
dataset old (16 record/2605110#. BCI experiments
channels); YTeT5Z4zZhE
resting state,
eyes open/
closed
Multitaper spectra 55 USA Cross-sectional Healthy participants NA EEG (64 for NA NA NA NA https://physionet.org/ [194] Patients under
recorded and patients healthy content/eeg- anesthesia
during receiving an volunteers, 6 power- during surgery
GABAergic anesthesia care in for patients anesthesia/1.0.0/
anesthetic an operating under
unconsciousness room context anesthesia)
A multi-subject, 19 (16 for 23–37 years old UK Longitudinal (2 Healthy subjects fMRI MEG, EEG (70 NA NA NA NA https://legacy.openfmri. [195] OpenfMRI database,
multi-modal MEG and visits over 3 channels); org/dataset/ accession number
human EEG, 3 for months) resting state, ds000117/ ds000117
neuroimaging fMRI) sensory–
dataset motor tasks
Motor imagery, 10 26–46 years old Germany Cross-sectional Healthy subjects NA EEG (59 NA NA NA NA http://www.bbci.de/ [196] Data used for a BCI
uncued channels); competition/iv/ competition. Also
classifier hands, feet, desc_1.html include simulated/
application and tongue synthetic data
motor
imagination
tasks
BCI Competition 9 NA Austria Cross-sectional Healthy subjects NA EEG (22 NA NA NA NA http://www.bbci.de/ [197] Data used for a BCI
2008—Graz channels); competition/iv/ competition
data set A hands, feet, desc_2a.pdf
and tongue
motor
imagination
tasks
(continued)
Table 2
(continued)
BCI Competition 9 NA Austria Cross-sectional Healthy subjects NA EEG (3 NA NA NA NA http://www.bbci.de/ [197] Data for a BCI
2008—Graz channels); competition/iv/ competition
data set B hands motor desc_2b.pdf
imagination
tasks
EEG dataset from 64 (34 MDD, 40.3 ± 12.9 Malaysia Longitudinal Case control for major NA EEG (19 NA NA NA NA https://journals.plos. [198] EEG (resting task),
Mumtaz et al. 30 healthy) years (multiple depressive channels) org/plosone/ MDD based on
[198] old visits to the disorder article?id=10.13 medical history of
clinic) 71/journal.pone.01 patients
71409
EEG data for 121 7–12 years South Korea Cross-sectional ADHD (61 children) NA EEG (19 NA NA NA NA https://ieee-dataport. [199] Visual cognitive tests on
ADHD/control old and healthy channels); org/open-access/ children with
children controls [61] visual eeg-data-adhd- ADHD, using
attention control-children videos
tasks
EEG and EOG data 4 NA Germany Longitudinal (2 Amyotrophic lateral NA EEG (116 NA NA NA NA https://doi.org/10. [200] Spelling task with eye
from Jaramillo- to 10 visits) sclerosis, locked- channels) 6084/m9.figshare. movement-based
Gonzalez et al. in state 13148762 answers
[61] Electrooculography
(EOG)
Temple University 10,874 1–90 USA Cross-sectional Epilepsy, stroke, NA EEG (20–41 NA NA NA NA https://isip.piconepress. [201] Extensive dataset of 30
Hospital concussion, channels) com/projects/tuh_ K clinical EEG
(TUH) EEG healthy eeg/html/ recordings
Corpus downloads.shtml collected at TUH
from 2002 to today.
Epilepsy subset:
TUEP dataset
Bern–Barcelona 5 NA Spain Longitudinal Epilepsy NA EEG (64 NA NA NA NA https://repositori.upf. [202] Intracranial EEG
EEG database channels); edu/handle/ samples before and
rest 10230/42829 after surgery
CHB-MIT Scalp 22 1.5–22 USA Longitudinal Seizures and NA EEG (21 NA NA NA NA https://www.physionet. [203] Pediatric subjects with
EEG Database intractable channels); org/content/ intractable seizures
seizures resting state chbmit/1.0.0/
Brain/Neural– 2 NA Austria Longitudinal Chronic stroke NA EEG (2 NA NA NA NA http://bnci- [204] The BNCI is an open-
Computer channels); horizon-2020.eu/ source project with
Interaction eye staring database/data-sets several datasets.
(BNCI) task Stroke is “6. SCP
Horizon training in stroke
(006-2014)”
Queensland Twin 422 (baseline) 9–14 (baseline) Australia Longitudinal Population sample. MRI (T1w, T2w, NA Pedigree (twin/ NA Wrist-worn Gut microbiome https:// [205] Includes participants
Adolescent Parental and/or FLAIR, sibling), accelerometer imaginggenomics. from Queensland
Brain (QTAB) self-report mental DWI, genotyping net.au/ Twin Registry and
health, cognition, rsfMRI, Twins Research
and social tfMRI, ASL) Australia
behavior
measures
Queensland Twin 1200+ 12–30 Australia Cross-sectional Population sample (as MRI (T1w, DWI, NA Pedigree (twin/ NA NA NA https:// [206] Include participants
Imaging Study part of BATS). rsfMRI, sibling), imaginggenomics. from Brisbane
(QTIM) Self-report tfMRI) genotyping net.au/ adolescent twin
mental health, study (BATS)
cognition,
substance use,
and personality
measures
Human ~1200 22–35 USA Cross-sectional Population sample. MRI (T1w, T2w, MEG (n = 95) Pedigree (twin/ NA NA NA https://db.humanconne [29] Expanded to
Connectome Self-report DWI, sibling). ctome.org development and
Project, young mental health, rsfMRI, Genotyping aging projects
adults cognition, tfMRI) (similar imaging
(HCP-YA) personality, and protocols,
substance use but extensions do
measures not include twins)
Vietnam Era Twin 1237 51–60 USA Longitudinal US veterans, males MRI (T1w, DWI, NA Pedigree (twin). NA NA NA https://medschool. [83] Subset of the Vietnam
Study of Aging (baseline; (baseline) only ASL (subset Genotyping ucsd.edu/som/ Era Twin Registry,
(VETSA) 500+ with of sample)) psychiatry/ males only
MRI) research/
VETSA/
Researchers/
Pages/default.aspx
Older Australian 623 (baseline) >65 Australia Longitudinal Population sample. MRI (T1w, DWI, NA Pedigree (twin/ NA NA NA https://cheba.unsw.edu. [84] Recruited through the
Twins Study Self-report tfMRI); PET sibling). au/research- Australian Twin
(OATS) medical and Genotyping projects/older- Registry
mental australian-twins-
health and study
neuropsy
chological
measures
Swedish Twin 87,000 twin All ages Sweden Longitudinal NA NA NA Genome-wide Methylation is NA NA https://ki.se/en/ [74] Twin Registry
Registry (STR) pairs (since the single not available research/the-
late 1950s) nucleotide in the STR swedish-twin-
polymorphism yet but is registry
array available in
genotyping the Swedish
adoption/
twin study of
aging
(SATSA), a
sub-study of
the STR
UK Biobank ~502,000 37–73 UK Longitudinal Self-reported and MRI (T1w, T2w, NA Genotyping, NA Actigraphy Metabolomics https://bbams.ndph.ox. [14, 101, 186] Population-based
(UKB) ~50,000 44–82 EHR medical FLAIR, exome, WGS, (future release) ac.uk/ams/ sample
(imaging (imaging) history (incl. DWI, SWI, pedigree resApplications (volunteers),
subset) 40–69 cancer, rsfMRI, healthy bias,
~100,000 (actigraphy) neurology, tfMRI) ongoing resource
(actigraphy COVID-19) and data
subset) collectiontarget
MRI sample 100 K,
retest 10 K. also
available: Cardiac
and abdominal
MRI, whole-body
DXA (dual-energy
X-ray
absorptiometry),
(continued)
Table 2
(continued)
carotid ultrasound,
electrocardiogram,
hematological
assays, serological
antibody response
assay, telomere
length
Avon Longitudinal 2044 Average age: UK Longitudinal (2 Clinical evaluations, MRI (T1w, DWI, NA Genotyping Methylation Transcriptomics http://www. [207] General population
Study of Mothers time points obstetric data, mcDESPOT ariesepigenomics. study (health and
Parents and (antenatal for mother, 3 cognition, (subset of org.uk/ development)
Children = 28.7, for questionnaires offspring https://github.com/ following 1022
(ALSPAC): follow-up = offspring) cohort)) MRCIEU/aries mother–offspring
Accessible 47.5), pairs; 2 time points
Resource for offspring for mother, 3 for
Integrated (birth = 40 offspring
Epigenomic weeks,
Studies childhood
(ARIES) = 7.5,
adolescence
= 17.1)
Biobank-based ~4000 18–87 Netherlands Longitudinal Clinical information NA NA Genotyping, Methylation NA Transcriptomics, https://www.bbmri.nl/ [208] Population study
Integrative and bioassays pedigree metabolomics node/24 including various
Omics Studies (depending on sub-cohorts,
(The BIOS the sub-cohort) covering differing
Consortium) research designs
(LifeLines, Leiden
longevity study,
Netherlands twin
registry, Rotterdam
study, CODAM,
and the prospective
ALS study
Netherlands);
access via European
genome-phenome
archive (EGA) and
SURFsara high
performance
computing cloud
Framingham Heart 4241 Offspring USA Cross-sectional Extensive clinical MRI (T2w) NA Genotyping, WGS Methylation NA Transcriptomics, https://www.ncbi.nlm. [209, 210] Data collected over two
Study cohort (multi- evaluations and metabolomics nih.gov/projects/ generations of
mean age = generational) bioassays, original gap/cgi-bin/study. individuals;
66, third- focus on cgi?study_id= cardiovascular MRI
generation cardiovascular phs000724.v9.p13 A subset of individuals
cohort = 45 diseases are used to identify
rare variants
influencing brain
imaging
phenotypes. Data
included in NHLBI
TOPMed
Women’s Health 2129 50–79 USA Cross-sectional Clinical evaluations NA NA Genotyping Methylation NA Transcriptomics, https://www.ncbi.nlm. [211] Women only
Initiative and bioassays; metabolomics, nih.gov/projects/
cardiovascular miRNA gap/cgi-bin/study.
diseases, women cgi?study_id=
only phs001335.v2.p3
Sporadic ALS 1395 Predicted mean Australia Cross-sectional Case control of NA NA Genotyping Methylation NA NA https://www.ncbi.nlm. [212]
Australia age (from amyotrophic nih.gov/projects/
Systems DNA methy lateral sclerosis gap/cgi-bin/study.
Genomics lation) 60.4 cgi?study_id=
Consortium (controls), phs002068.v1.p1
(SALSA-SGC) 62.9 (cases)
System Genomics of 2333 Predicted mean Australia and Cross-sectional Case control of NA NA Genotyping Methylation NA NA https://www.ncbi.nlm. [213]
Parkinson’s age (from New Parkinson’s nih.gov/geo/
Disease DNA methy Zealand disease query/acc.cgi?acc=
(SGPD) lation), GSE145361
mean
70.05;
range 26–
93
Genetics of DNA 32,851 0–91 Worldwide Cross-sectional Multiple diseases, but NA NA Genotyping Methylation NA NA http://www.godmc.org. [79] Consortium gathering
Methylation also healthy aging uk/cohorts.html 38 independent
Consortium cohorts studies; data access
(goDMC) to be obtained from
each subsample
Psychiatric By 2025, ~2.5 All ages Worldwide Cross-sectional Psychiatric disorders, NA NA Genotyping and Expression and NA NA https://www.med.unc. [214, 215] Consortium organized
Genomics million substance use sequencing methylation edu/pgc/shared- around disease/
Consortium cases of disorders, and data methods/open- trait working
(PGC) psychiatric neurology: Major available in source-philosophy/ groups
disorders depressive some
disorder, cannabis working
use disorder, groups
alcohol use
disorder,
schizophrenia,
anorexia, bipolar,
ADHD,
Alzheimer’s
The genetic 4347 6–12 Costa Rica Cross-sectional Asthma cases and NA NA WGS NA NA NA https://www.ncbi.nlm. [216] A subset of individuals
epidemiology of controls nih.gov/ from this sample/
asthma in bioproject/?term= accession is used in
Costa Rica PRJNA295246 Pharmacogenetic
drug response in
racially diverse
children with
asthma. Data
included in NHLBI
TOPMed
Coronary Artery 3425 18–30 USA Longitudinal Coronary Artery Risk NA NA WGS NA NA NA https://anvilproject. [217] A subset of longitudinal
Risk org/ncpi/data/ dataset of 3087 self-
Development in studies/phs001612 identified Black and
Young Adults White participants
(CARDIA) from the CARDIA
study were used to
study multi-ethnic
polygenic risk score
(continued)
Table 2
(continued)
associated with
hypertension
prevalence and
progression
Genetic 1854 >60 USA Longitudinal Elucidate the genetics NA NA WGS NA NA NA https://www.ncbi.nlm. [217] Study was used to study
Epidemiology of target organ nih.gov/projects/ multi-ethnic
Network of complications of gap/cgi-bin/study. polygenic risk score
Arteriopathy hypertension cgi?study_id= associated with
(GENOA) phs001345.v3.p1 hypertension
prevalence and
progression
Hispanic 8093 18–74 USA Longitudinal A multicenter NA NA WGS NA NA NA https://www.ncbi.nlm. [217] Study was used to study
Community prospective nih.gov/projects/ multi-ethnic
Health Study/ cohort study for gap/cgi-bin/study. polygenic risk score
Study of Latinos asthma patients cgi?study_id= associated with
(HCHS/SOL) phs001395.v1.p1 hypertension
prevalence and
progression
Women’s Health 11,357 >65 USA Longitudinal Women’s Health NA NA WGS NA NA NA https://www.ncbi.nlm. [217] Study was used to study
Initiative Initiative cohort nih.gov/projects/ multi-ethnic
(WHI) involved study on gap/cgi-bin/study. polygenic risk score
ischemic stroke, cgi?study_id= associated with
900 cases of phs000200.v12.p3 hypertension
hemorrhagic prevalence and
stroke progression
Atherosclerosis Risk 8975 45–64 USA Cross-sectional/ Red blood cell NA NA WGS NA NA NA https://anvilproject. [218] WGS association study
in longitudinal phenotype org/data/studies/ of red blood cell
Communities phs001211 phenotypes, GWAS
(ARIC) statistics available.
Data included in
NHLBI TOPMed
Rare Variants for 2159 >35 Taiwan/China, Longitudinal Insulin-resistant cases NA NA WGS NA NA NA https://anvilproject. [219] Clustering and
Hypertension in Japan and controls org/ncpi/data/ heritability of
Taiwan studies/phs001387 insulin resistance in
Chinese Chinese and
(THRV) Japanese
hypertensive
families. Data
included in NHLBI
TOPMed
My Life, Our 7482 >18 USA Cross-sectional Hemophilia cases and NA NA WGS NA NA NAs https://athn.org/what- [220] Summary statistics,
Future controls we-do/national- different types of
initiative projects/mlof- DNA variants
(MLOF) research-repository. detected in
html hemophilia. Data
included in NHLBI
TOPMed
Genetic 19,996 45–80 USA Cross-sectional Pulmonary functions NA NA WGS NA NA Gene expression https://anvilproject. [221] Multi-omic data from
Epidemiology of (eQTL) and org/data/studies/ GTEx and
COPD methylation phs001211 TOPMed identify
(COPDGene) (mQTL): potential molecular
eQTLs in 48 mechanisms
tissues from underlying 4 of the
GTEx v7 22 novel loci. Data
included in NHLBI
TOPMed
Cardiovascular 4877 >65 USA Longitudinal Cardiovascular health NA WGS NA NA NA https://ega-archive. [221] Potential molecular
Health Study org/studies/ mechanisms
(CHS) phs001368 underlying 4 of the
DOI:https://doi.org/ 22 novel loci. Data
10.1038/s41467- included in NHLBI
020-18334-7 TOPMed
Cleveland Family 3576 >65 USA Longitudinal Epidemiological data NA WGS NA NA NA https://ega-archive. [221] Potential molecular
Study (CFS) on genetic and org/studies/ mechanisms
non-genetic risk phs000954 underlying 4 of the
factors for sleep DOI:https://doi.org/ 22 novel loci. Data
disordered 10.1038/s41467- included in NHLBI
breathing 020-18334-7 TOPMed
Framingham 4241 <65 USA Longitudinal Assess risk of NA WGS NA NA NA https://www.ncbi.nlm. [221] Potential molecular
Heart Study cardiovascular nih.gov/projects/ mechanisms
(FHS) disease study gap/cgi-bin/study. underlying 4 of the
cgi?study_id= 22 novel loci. Data
phs000724.v9.p13 included in NHLBI
TOPMed
Jackson Heart Study 3596 >55 USA Longitudinal Assess cardiovascular NA WGS NA NA NA https://www.ncbi.nlm. [221] Potential molecular
(JHS) disease nih.gov/projects/ mechanisms
gap/cgi-bin/study. underlying 4 of the
cgi?study_id= 22 novel loci. Data
phs000964.v5.p1 included in NHLBI
TOPMed
Multi-Ethnic Study 6814 45–84 USA Longitudinal Assess cardiovascular NA WGS NA NA NA https://www.omicsdi. [221] Potential molecular
of disease org/dataset/ mechanisms
Atherosclerosis dbgap/phs001416 underlying 4 of the
(MESA) 22 novel loci. Data
included in NHLBI
TOPMed
Boston Early-Onset 80 <53 USA Longitudinal Chronic obstructive NA NA WGS NA NA NA https://www.ncbi.nlm. [221] Potential molecular
COPD pulmonary nih.gov/projects/ mechanisms
(EOCOPD) disease (COPD) gap/cgi-bin/study. underlying 4 of the
cgi?study_id= 22 novel loci. Data
phs000946.v5.p1 included in NHLBI
TOPMed
Genetic 10,647 45–80 USA Longitudinal Chronic obstructive NA NA WGS NA NA NA https://www.ncbi.nlm. [221] Potential molecular
Epidemiology of pulmonary nih.gov/projects/ mechanisms
COPD disease (COPD) gap/cgi-bin/study. underlying 4 of the
(COPDGene) cgi?study_id= 22 novel loci. Data
phs000951.v5.p5 included in NHLBI
TOPMed
(continued)
Table 2
(continued)
Genotype-Tissue 948 individuals, 20–70 USA Cross-sectional General medical NA NA WGS Gene expression NA NA https://www.ncbi.nlm. [222] Postmortem samples
Expression 17,382 history, HIV (bulk nih.gov/projects/
(GTEx) samples status, potential RNA-seq) gap/cgi-bin/
(across 54 exposures, dataset.cgi?study_
tissues) medical history at id=phs000424.v8.
the time of death, p2&phv=169091&
death phd=3910&
circumstances, pha=&pht=2742&
serology results, phvf=&phdf=&
smoking status phaf=&phtf=&
dssp=1&
consent=&temp=1
Developmental 120 individuals, 0–18 USA Cross-sectional General medical NA NA WGS Gene expression NA NA https://dgtex.org/ NA Expected data release by
Genotype-Tissue 3600 history (bulk late 2022
Expression samples RNA-seq) Four developmental
(dGTEx) (across 30 time points:
tissues) Postnatal (0–
2 years), early
childhood (2–
8 years),
pre-pubertal (8–
12.5 years), and
post-pubertal
(12.5–18 years)
Greater-Paris Area Several millions All ages France Longitudinal ICD10 codes for All medical NA NA NA NA NA https://eds. [155] All patients visiting
Hospital ~130 K patients diagnoses, imaging aphp.fr/ greater-Paris
(AP-HP) with brain medical reports, modalities area hospital
MRI (~200 and a wide range and all (AP-HP)the figures
K images) of physiological organs: MRI, correspond to a
variables PET, specific
SPECT, CT, project approved
ultrasounds on the clinical
data warehouse
(CDW)
Swedish Total Nationwide All ages Sweden Longitudinal NA NA NA NA NA NA NA www.scb.se/vara- [158] Demographic
Population coverage (since 1968) tjanster/bestall- information
Register (TPR) (N = data-
9,775,572 och-statistik/
on April 30, bestalla-mikrodata/
2015) vilka-mikrodata-
finns/
individregister/
registret-over-
totalbefol
kningen-rtb/
Swedish All individuals ≥16 years (15 Sweden Longitudinal NA NA NA NA NA NA NA www.scb.se/lisa [159] Demographic and
Longitudinal ≥16 years since 2010) (since 1990) socioeconomic
Integrated in Sweden information
Database for
Health
Insurance and
Labour Market
Studies (LISA)
Swedish Multi- Over 11 million All ages Sweden Longitudinal NA NA NA Pedigree NA NA NA www.scb.se/en/ [160] Family relation
Generation (since 1961) finding-statistics/
Register statistics-by-subject-
(MGR) area/other/other/
other-publications-
non-statistical/
pong/publications/
multi-generation-
register-2016/
Swedish National Nationwide All ages Sweden Longitudinal Nationwide inpatient NA NA NA NA NA NA www.socialstyrelsen.se/ [161] Clinical diagnoses from
Patient coverage (since 1964) care since 1987 en/statistics-and- inpatient and
Register and outpatient data/registers/ outpatient care
(NPR) care since 2001 register-
information/the-
national-patient-
register/
Swedish Cancer Around 60, 000 All ages Sweden Longitudinal Clinical diagnoses of NA NA NA NA NA NA www.socialstyrelsen.se/ [165]
Register (SCR) malignant (since 1958) cancer en/statistics-and-
cases are data/registers/
included register-
annually information/
(statistics in swedish-cancer-
2020) register/
Swedish Medical Around Infants at birth Sweden Longitudinal Yes NA NA NA NA NA NA https://www. [167] Register of medical
Birth Register 80,000– (since 1973) socialstyrelsen.se/ birth
(MBR) 120,000 en/statistics-and-
deliveries data/registers/
annually register-
information/the-
swedish-medical-
birth-register/
Swedish Causes of Nearly 100,000 All ages Sweden Longitudinal Cause of death NA NA NA NA NA NA https://www. [170] Register of cause of
Death Register deaths (since 1952) socialstyrelsen.se/ death
(CDR) annually statistik-och-data/
register/alla-
register/
dodsorsa
ksregistret/
Swedish Prescribed More than 100 All ages Sweden Longitudinal Prescribed drug(s) NA NA NA NA NA NA https://www. [172] Register of prescribed
Drug Register million (since July socialstyrelsen.se/ drug
(PDR) records 2005) en/statistics-and-
annually data/registers/
register-
information/the-
swedish-prescribed-
drug-register/
(continued)
Table 2
(continued)
Swedish Dementia More than 27–103 Sweden Longitudinal Dementia MRI, PET, CT EEG NA NA NA NA www.svedem.se [176] Register of dementia
Registry (SDR) 100,000 (between (since 2007)
2007 and
2012)
Swedish Neuro- Around 16,000 All age Sweden Longitudinal Motor neuron MRI, PET, and MEG Genotyping NA NA NA www.neuroreg.se [177] Register of MND
Register (SNR) multiple (since 2004) disease, MS CT (especially MS)
sclerosis,
over 1500
Parkinson’s
disease, and
over
600
myasthenia
gravis
(numbers
from 2015)
Swedish Stroke Around 29, 000 Mean age 75 for Sweden Longitudinal Stroke and transient MRI and CT scan NA NA NA NA NA www.riksstroke.org/ [179] Register of stroke and
Register (Riks- cases stroke; (since 1994) ischemic attack TIA; includes
Stroke) (around mean age 74 (TIA) electrocardiograms
21,000 for TIA
stroke and
8000 TIA)
annually
(statistics in
2020)
Brisbane Adolescent 4000 >11 Australia Cross-sectional Population sample. NA EEG (15 Pedigree (twin/ NA Wrist-worn NA https:// [79, 223–225] Also known as the
Twin Sample and Self-report channels, sibling), accelerometer imaginggenomics. Brisbane
(BATS) longitudinal mental health, eyes closed genotyping (N ~ 130) net.au/ Longitudinal Twin
cognition, resting; N ~ Study (BLTS).
substance use, 1000) Includes
and personality participants from
measures the Queensland
Twin Registry
mPower ~8000 >18 USA Longitudinal Parkinson’s disease NA NA NA NA iPhone NA https://parkinsonm [226] Sample size varies
(subsample self- application power.org/team across surveys and
identified https://www.synapse. tasks completed
professional org/#!Synapse:syn4
diagnosis) 993293/wiki/24
7860
Main Existing Datasets 769
3 Neuroimaging
3.1 Magnetic Brain magnetic resonance images are 3D images that measure brain
Resonance structure (T1w, T2w, FLAIR, DWI, SWI) or function (fMRI). The
Imaging (MRI) different MRI sequences (or modalities) can characterize different
aspects of the brain. For example, T1w and T2w offer the maximal
contrast between tissue types (white matter, gray matter, and cere-
brospinal fluid), which can yield structural/shape/volume mea-
surements. They can also be used in conjunction with an injection
of a contrast agent (e.g., gadolinium) for detecting and character-
izing various types of lesions. FLAIR is also useful for detecting a
wide range of lesions (e.g., multiple sclerosis, leukoaraiosis, etc.).
SWI focuses on the neurovascular system, while DWI allows mea-
suring the integrity of the white matter tracts. Functional MRI
measures BOLD (blood oxygen level dependent) signal, which is
thought to measure dynamic oxygen consumption in the different
brain regions. Of note, fMRI consists of a series of 3D images
acquired over time (typically 5–10 min).
Brain MRI is available as a series of DICOM files (brain slices,
traditional format of the MRI machines) or a single NIfTI (single
3D image) format (see Table 1). The two formats are roughly
equivalent, and most image processing pipelines allow both data
sources as input. MR images are composed of voxels (3D pixel),
and their size (e.g., 1 × 1 × 1 mm) corresponds to the image
resolution.
In practice, most MR images are archived and shared via
web-based applications and more rarely using specific software
(e.g., UKB). The two major web platforms are XNAT (eXtensible
Neuroimaging Archive Toolkit) [9], an open-source platform
developed by the Neuroinformatics Research Group of the
Washington University School of Medicine (Missouri, (1, 2)), and
IDA (Image and Data Archive) created by the Laboratory of Neu-
roImaging of the University of South California (LONI, https://
loni.usc.edu/). Of note, XNAT also allows to perform some image
processing [9].
The neuroimaging community has developed BIDS (Brain
Imaging Data Structure), a standard for MR image organization
to accommodate multimodal acquisitions and facilitate processing.
770 Baptiste Couvy-Duchesne et al.
In practice, few datasets come in BIDS format, and tools have been
developed to assist with download and conversion (e.g., https://
clinica.run) [10].
We have listed a handful of datasets (see Table 1), which is far
from being exhaustive but aims at summarizing some of the largest
and/or most used samples. Our selection aims at presenting diverse
and complementary samples in terms of age range, populations,
and country of origin.
First, we have described three clinical elderly samples from the
USA and Australia, with a focus on Alzheimer’s disease and cogni-
tive disorders. The Alzheimer’s Disease Neuroimaging Initiative
(ADNI) was launched in 2004 and funded by a partnership
between private companies, foundations, the National Institute of
Health, and the National Institute for Aging. ADNI is a longitudi-
nal study, with data collected across 63 sites in the USA and
Canada. To date, four phases of the study have been funded,
which makes ADNI one of the largest clinical neuroimaging sam-
ples to study Alzheimer’s disease and cognitive impairment in
aging. ADNI collected a wide range of clinical, neuropsychological,
cognitive scales as well as biomarkers, in addition to multimodal
imaging and genotyping data [11]. Sites contribute data to the
LONI, which is automatically shared with approved researchers
without embargo. The breadth of data available and its accessibility
have made ADNI one of the most used neuroimaging samples, with
more than 1000 scientific articles published to date.
The Australian Imaging Biomarkers and Lifestyle Study of
Ageing (AIBL) started in 2006 and has since recruited about
1100 participants over 60 years of age, who have been followed
over several years (see Table 1) [12]. AIBL collected data across the
different Australian states and, similar to ADNI, consisted in an
in-depth assessment of individual cognition, clinical status, genet-
ics, genomics, as well as multimodal brain imaging [12]. In 2010,
AIBL partnered with ADNI to release the AIBL imaging subset and
selected clinical data via the LONI platform. Having the same MRI
protocols and similar fields collected, AIBL represents a great addi-
tion to the ADNI study, by boosting statistical power or allowing
for replication. The full clinical information as well as genetics,
genomics, and wearable data (actigraphy watches) are not available
via the LONI and require a direct application to the Common-
wealth Scientific and Industrial Research Organisation (CSIRO)
(see Table 1).
The Open Access Series of Imaging Studies v3 (OASIS3) is
another longitudinal sample comprising almost 1100 adult partici-
pants (see Table 1) [13]. Its main focus is around aging and neuro-
logical disorders, and the application/approval process is extremely
fast (typically a couple of days). OASIS3 is hosted on XNAT and is
the third dataset to be made available by the Washington University
in Saint Louis (WUSTL) Knight Alzheimer’s Disease Research
Main Existing Datasets 771
3.2 Positron Positron emission tomography (PET) images are 3D images that
Emission Tomography highlight the concentration of a radioactive tracer administered to
(PET–MRI) the patient. Here, we will focus on brain PET images, although
other parts of the body may also be imaged. The different tracers
allow to measure several aspects of brain metabolism (e.g., glucose)
or spatial distribution of a molecule of interest (e.g., amyloid).
PET relies on the nuclear properties of radioactive materials
that are injected in the patient intravenously. When the radioactive
isotope disintegrates, it emits a photon that will be detected by the
scanner. This signal is used to find the position of the emitted
positrons which allow us to reconstruct the concentration map of
the molecule we are tracing [35].
As for MRI, PET images are available as a series of DICOM files
or a single NIfTI format. They are composed of voxels (3D pixel),
and their size (e.g., 1 × 1 × 1 mm) corresponds to the image
resolution. A BIDS extension has also been developed for positron
emission tomography, in order to standardize data organization for
research purposes.
PET is considered invasive due to the injection of the tracer,
which results in very small risk of potential tissue damage. Overall,
the quantity of radioactive isotope remains small enough to make it
safe for most people, but this limits its widespread acquisition in
research, especially on healthy subjects or in children. Moreover,
PET requires to have a high-cost cyclotron to produce the radio-
tracers nearby because the half-life of the radioisotopes is typically
short (between a few minutes to few hours).
Several tracers are used for brain PET imaging, one of the most
common ones being the 18F-fluorodeoxyglucose (18F-FDG).
18
F-FDG concentrates in areas that consume a lot of glucose and
will thus highlight brain metabolism. In practice, 18F-FDG PET
Main Existing Datasets 773
4 EEG/MEG
with specific sensory stimuli (e.g., P300 wave) or task, like motor
preparation and execution, covert mental states, or other cognitive
processes. The amplitude, latency, and spatial location of the result-
ing waveform activity reveal the underlying mental state [56]. On
the other side, spectral analysis refers to the computation of the
energy distribution of the signals in the frequency domain. Most
spectral estimates are based on Fourier transform; this is the case of
non-parametric methods, such as Welch periodogram estimation,
which based their computation on data windowing [57].
Another approach is to study the interactions across sources
(inferring connection between two electrodes by means of tempo-
ral dependency between the registered signals), which is known as
functional connectivity. Multiple connectivity estimators have been
developed to quantify this interaction [58]. Through these func-
tional interactions, complex network analysis can also be implemen-
ted, where sensors are modeled as nodes and connectivity
interactions as links [59–61].
EEG and MEG are essential to evaluate several types of brain
disorders. One of the most documented is epilepsy, based on
seizure detection and prediction [62–64]. Other neurological con-
ditions can be characterized like Alzheimer’s disease, associated
with changes in signal synchrony [65, 66]. Furthermore, motor
task decoding in brain–computer interfaces (BCI) offers a
promising tool in rehabilitation [67]. This type of data, from
healthy to clinical cases, can be found on multiple open-access
repositories, such as Zenodo (https://zenodo.org) or PhysioNet
((1)), as well as via collaborative projects such as the BNCI Horizon
2020 (http://bnci-horizon-2020.eu), which gathers a collection of
BCI datasets (see Table 2). These repositories are also valuable in
that they contribute to establishing harmonization procedures in
processing and recording. All dataset-collected informed consent
and data were anonymized to protect the participants’ privacy.
Moreover, regulations may vary from one country to another,
which require, for example, studies to be approved by ethics com-
mittees. Additionally, licensing (that define copyrights of the data-
set) must be considered depending on the intended use of the
open-access datasets.
Data come in different formats according to the acquisition
system or the preprocessing software. The most common formats
for EEG are .edf, .gdf, .eeg, .csv, or .mat files. For MEG, it is very
often .fif and .bin (see Table 1). The different formats can create
challenges when working with multiple datasets. Luckily, some
tools have been developed to handle this problem, for example,
FieldTrip [68] or Brainstorm [69] implemented in MATLAB, or
the Python modules mne [70] and moabb [71]. Of note, these
tools also contain sets of algorithms and utility functions for analysis
and visualization.
776 Baptiste Couvy-Duchesne et al.
5 Genetics
5.1 Twin Samples Twins provide a powerful method to estimate the importance of
genetic and environmental influences on variation in complex traits.
Monozygotic (MZ, aka identical) twins develop from a single
zygote and are (nearly) genetically identical. In contrast, dizygotic
(DZ, aka fraternal) twins develop from two zygotes and are, on
average, no more genetically related than non-twin siblings. In the
classical twin design, the degree of similarity between MZ and DZ
twin pairs on a measured trait reveals the importance of genetic or
environmental influences on variation in the trait. Twin studies
often collect several different data types, including brain MRI
scans, assessments of cognition and behavior, self-reported mea-
sures of mental health and wellbeing, as well as biological samples
(e.g., saliva, blood, hair, urine). Datasets derived from twin studies
are text-based and include phenotypic data and background vari-
ables (e.g., individual and family IDs, sex, zygosity, age). Notably,
the correlated nature of twin data (i.e., the non-independence of
participants) should be considered during analysis as it may violate
statistical test assumptions [72, 73].
Raw data is typically stored locally by the data owner, with
de-identified data available upon request. In larger studies, data is
stored and distributed through online repositories. Recently, the
sharing of publicly available de-identified data with accompanying
publications has become commonplace.
Several extensive twin studies combine imaging, behavioral, or
biological data (see Table 2). These studies cover the whole life span
(STR) as well as specific age periods, for example, children/adoles-
cents (QTAB), young (QTIM, HCP-YA), middle-aged (VETSA),
or older (OATS) adults.
The Swedish Twin Registry (STR) was established in the late
1950s with the primary aim to explore the effect of environmental
factors (e.g., smoking and alcohol) on disorders [74]. Data were
first collected through questionnaires and interviews with the twins
and their parents. Later, the STR incorporated data from biobanks,
clinical blood chemistry assessments, genotyping, health checkups,
and linkages to various Swedish national population and health
registers [74]. The STR is now one of the largest twin registers in
the world [75] with information on more than 87,000 twin pairs
(https://ki.se/en/research/the-swedish-twin-registry). It has
been used extensively for the research of health and illness, includ-
ing various neurological disorders, including dementia [76], Par-
kinson’s disease [77], and motor neuron disease [78].
The Queensland Twin Adolescent Brain (QTAB, 2015–pres-
ent) was enabled through funding from the Australian National
Health and Medical Research Council (NHMRC). It focuses on
the period of late childhood/early adolescence, with brain imaging,
Main Existing Datasets 777
5.3.2 The Psychiatric The Psychiatric Genomics Consortium (PGC) began in 2007. The
Genomics central idea of the PGC is to use a global cooperative network to
Consortium (PGC) advance genetic discovery in psychiatric disorders in order to iden-
tify biologically, clinically, and therapeutically meaningful insights.
To date, the PGC is one of the largest, most innovative, and
productive consortia in the history of psychiatry. The Consortium
now consists of workgroups on 11 major psychiatric disorders, a
Cross-Disorder Workgroup, and a Copy-Number Variant Work-
group. In addition, the PGC provides centralized support to the
PGC researchers with a Statistical Analysis Group, Data Access
782 Baptiste Couvy-Duchesne et al.
5.4 Exome and The Trans-Omics for Precision Medicine (TOPMed) program,
Whole Genome sponsored by the National Institutes of Health (NIH) National
Sequencing: Trans- Heart, Lung, and Blood Institute (https://topmed.nhlbi.nih.
Omics for Precision gov), is part of a broader Precision Medicine Initiative, which
Medicine (TOPMed) aims to provide disease treatments tailored to an individual’s
unique genes and environment. TOPMed contributes to this Ini-
tiative through the integration of whole genome sequencing
(WGS) and other omics data. The initial phases of the program
focused on whole genome sequencing of individuals with rich
phenotypic data and diverse backgrounds. The WGS of the
TOPMed samples was performed over multiple studies, years, and
sequencing centers [118, 119]. Available data are processed peri-
odically to produce genotype data “freezes.” Individual-level data is
accessible to researchers with an approved dbGaP data access
request (https://topmed.nhlbi.nih.gov/data-sets), via Google
and Amazon cloud services. More information about data availabil-
ity and how to access it can be found on the dataset page (https://
topmed.nhlbi.nih.gov/data-sets).
As of September 2021, TOPMed consists of ~180 K partici-
pants from >85 different studies with varying designs. Prospective
cohorts provide large numbers of disease risk factors, subclinical
disease measures, and incident disease cases; case-control studies
provide improved power to detect rare variant effects. Most of the
TOPMed studies focus on HLBS (heart, lung, blood, and sleep)
phenotypes, which leads to 62 K (~35%) participants with heart
phenotype, 50 K (~28%) with lung data, 19 K (~11%) with blood,
4 K (~2%) with sleep, and 43 K (~24%) for multi-phenotype cohort
studies. TOPMed participants’ diversity is assessed using a combi-
nation of self-identified or ascriptive race/ethnicity categories and
observed genetics. Currently, 60% of the 180 K sequenced
784 Baptiste Couvy-Duchesne et al.
6 Genomics
6.2 Gene Expression Launched in 2010, the Genotype-Tissue Expression (GTEx) proj-
Data: GTEx ect is an ongoing effort that aims to characterize the genetic deter-
minants of tissue-specific gene expression [144]. It is a resource
database available to the scientific community, which is comprised
of multi-tissue RNA sequencing (RNA-seq: gene expression) and
whole genome sequence (WGS) data collected in 17,382 samples
across 54 tissue types from 948 postmortem donors (version
8 release). Sample size per tissue ranges from n = 4 in kidney
(medulla) to n = 803 in skeletal muscle. The majority of donors
are of European ancestry (84.6%) and male (67.1%) with ages
ranging from 20–70 years old. The primary cause of death for
donors 20–39 years old was traumatic injury (46.4%) and heart
disease for donors 60–70 years (40.9%).
Data is constantly being added to the database using sample
data from the GTEx Biobank. For example, recent efforts have
focused on gene expression profiling at the single-cell level to
achieve a higher resolution understanding of tissue-specific gene
expression and within tissue heterogeneity. As a result, single-cell
RNA-seq (scRNA-seq) data was generated in 8 tissues from
25 archived, frozen tissue samples collected on 16 donors. Further,
the Developmental Genotype-Tissue Expression (dGTEx) project
(https://dgtex.org/) is a relatively new extension of GTEx that was
788 Baptiste Couvy-Duchesne et al.
7.1 Clinical Data Clinical data warehouses (CDW) gather electronic health records
Warehouse: Example (EHR), which can gather demographic data, results from biological
from the Parisian tests, prescribed medications, and images acquired in clinical rou-
Hospitals (APHP) tine, sometimes for millions of patients from multiple sites. CDW
can allow for large-scale epidemiological studies, but they may also
be used to train and/or validate machine learning (ML) and deep
learning (DL) algorithms in a clinical context. For example, several
computer-aided diagnosis tools have been developed for the classi-
fication of neurodegenerative diseases. One of their main limita-
tions is that they are typically trained and validated using research
data or on a limited number of clinical images [149–154]. It is still
unclear how these algorithms would perform on large clinical
dataset, which would include participants with multiple diagnoses
and more generally heterogeneous data (e.g., multiple scanners,
hospitals, populations).
One of the first CDW in France was launched in 2017 by the
AP-HP (Assistance Publique – Hôpitaux de Paris), which gathers
most of the Parisian hospitals [155]. They obtained the
790 Baptiste Couvy-Duchesne et al.
7.2 Swedish National In Sweden, a unique 10-digit personal identification number has
Registries been assigned to each individual at birth or migration since 1947,
which allows linkages across different Swedish population and
health registers with almost 100% coverage [157]. The Swedish
Total Population Register (TPR) was established in 1968 and is
maintained by Statistics Sweden to obtain data on major life events,
such as birth, vital status, migration, and civil status [158]. TPR is a
Main Existing Datasets 791
References
1. UNESCO Recommendation on Open Sci- 5. Barton NTL Paul Resnick, and Genie (2019)
ence - UNESCO Digital Library. https:// Algorithmic bias detection and mitigation:
unesdoc.unesco.org/ark:/48223/pf00003 Best practices and policies to reduce consumer
79949.locale=en har m s. http s:// www.brookings .edu/
2. Fry A, Littlejohns TJ, Sudlow C et al (2017) research/algorithmic-bias-detection-and-miti
Comparison of sociodemographic and health- gation-best-practices-and-policies-to-reduce-
related characteristics of UK Biobank partici- consumer-harms/
pants with those of the general population. 6. Courtland R (2018) Bias detectives: the
Am J Epidemiol 186:1026–1034 researchers striving to make algorithms fair.
3. Ben-Eghan C, Sun R, Hleap JS et al (2020) Nature 558:357–360
Don’t ignore genetic data from minority 7. Bustamante CD, De La Vega FM, Burchard
populations. Nature 585:184–186 EG (2011) Genomics for the world. Nature
4. Yang H-C, Chen C-W, Lin Y-T et al (2021) 475:163–165
Genetic ancestry plays a central role in popu- 8. Sirugo G, Williams SM, Tishkoff SA (2019)
lation pharmacogenomics. Commun Biol 4: The missing diversity in human genetic stud-
1–14 ies. Cell 177:26–31
796 Baptiste Couvy-Duchesne et al.
9. Herrick R, Horton W, Olsen T et al (2016) individuals across the lifespan: results from
XNAT central: open sourcing imaging the ENIGMA ASD Working Group. Am J
research data. NeuroImage 124:1093–1096 Psychiatry 175:359–369
10. Routier A, Burgos N, Dı́az M et al (2021) 21. Logue MW, van Rooij SJH, Dennis EL et al
Clinica: an open source software platform for (2018) Smaller hippocampal volume in post-
reproducible clinical neuroscience studies. traumatic stress disorder: a multisite
Front Neuroinform 15:689675 ENIGMA-PGC study: subcortical volumetry
11. Petersen RC, Aisen PS, Beckett LA et al results from posttraumatic stress disorder
(2010) Alzheimer’s Disease Neuroimaging consortia. Biol Psychiatry 83:244–253
Initiative (ADNI): clinical characterization. 22. Boedhoe PSW, Schmaal L, Abe Y et al (2018)
Neurology 74:201–209 Cortical abnormalities associated with pediat-
12. Ellis KA, Bush AI, Darby D et al (2009) The ric and adult obsessive-compulsive disorder:
Australian Imaging, Biomarkers and Lifestyle findings from the ENIGMA Obsessive-
(AIBL) study of aging: methodology and Compulsive Disorder Working Group. Am J
baseline characteristics of 1112 individuals Psychiatry 175:453–462
recruited for a longitudinal study of Alzhei- 23. Mackey S, Allgaier N, Chaarani B et al (2019)
mer’s disease. Int Psychogeriatr 21:672–687 Mega-analysis of gray matter volume in sub-
13. LaMontagne PJ, Keefe S, Lauren W et al stance dependence: general and substance-
(2018) OASIS-3: longitudinal neuroimaging, specific regional effects. Am J Psychiatry
clinical and cognitive dataset for normal aging 176:119–128
and Alzheimer’s disease. Alzheimers Dement 24. van Erp TGM, Walton E, Hibar DP et al
J Alzheimers Assoc 14:P1097 (2018) Cortical brain abnormalities in 4474
14. Miller KL, Alfaro-Almagro F, Bangerter NK individuals with schizophrenia and 5098 con-
et al (2016) Multimodal population brain trol subjects via the Enhancing Neuroimaging
imaging in the UK Biobank prospective epi- Genetics Through Meta-analysis (ENIGMA)
demiological study. Nat Neurosci 19:1523– Consortium. Biol Psychiatry 84:644–654
1536 25. Hibar DP, Westlye LT, Doan NT et al (2018)
15. Alfaro-Almagro F, Jenkinson M, Bangerter Cortical abnormalities in bipolar disorder: an
NK et al (2018) Image processing and quality MRI analysis of 6503 individuals from the
control for the first 10,000 brain imaging ENIGMA Bipolar Disorder Working Group.
datasets from UK Biobank. NeuroImage Mol Psychiatry 23:932–942
166:400–424 26. Dima D, Modabbernia A, Papachristou E,
16. Casey BJ, Cannonier T, Conley MI et al et al (2021) Subcortical volumes across the
(2018) The adolescent brain cognitive devel- lifespan: data from 18,605 healthy individuals
opment (ABCD) study: imaging acquisition aged 3–90 years. Hum Brain Mapp
across 21 sites. Dev Cogn Neurosci 32:43–54 27. Thompson PM, Jahanshad N, Ching CRK
17. Barch DM, Albaugh MD, Avenevoli S et al et al (2020) ENIGMA and global neurosci-
(2018) Demographic, physical and mental ence: a decade of large-scale studies of the
health assessments in the adolescent brain brain in health and disease across more than
and cognitive development study: rationale 40 countries. Transl Psychiatry 10:1–28
and description. Dev Cogn Neurosci 32:55– 28. Kremen WS, Franz CE, Lyons MJ (2013)
66 VETSA: the Vietnam era twin study of
18. Schmaal L, Hibar DP, S€amann PG et al aging. Twin Res Hum Genet 16:399–402
(2017) Cortical abnormalities in adults and 29. Van Essen DC, Smith SM, Barch DM et al
adolescents with major depression based on (2013) The WU-Minn Human Connectome
brain scans from 20 cohorts worldwide in the Project: an overview. NeuroImage 80:62–79
ENIGMA Major Depressive Disorder Work- 30. Marek K, Chowdhury S, Siderowf A et al
ing Group. Mol Psychiatry 22:900–909 (2018) The Parkinson’s progression markers
19. Hoogman M, Muetzel R, Guimaraes JP et al initiative (PPMI) – establishing a PD bio-
(2019) Brain imaging of the cortex marker cohort. Ann Clin Transl Neurol 5:
in ADHD: a coordinated analysis of large- 1460–1477
scale clinical and population-based samples. 31. Dufouil C, Dubois B, Vellas B et al (2017)
Am J Psychiatry 176:531–542 Cognitive and imaging markers in
20. van Rooij D, Anagnostou E, Arango C et al non-demented subjects attending a memory
(2018) Cortical and subcortical brain mor- clinic: study design and baseline findings of
phometry differences between patients with the MEMENTO cohort. Alzheimers Res
autism Spectrum disorder and healthy Ther 9:67
Main Existing Datasets 797
59. Bullmore E, Sporns O (2009) Complex brain 73. Sainani K (2010) The importance of account-
networks: graph theoretical analysis of struc- ing for correlated observations. PM&R 2:
tural and functional systems. Nat Rev Neu- 858–861
rosci 10:186–198 74. Zagai U, Lichtenstein P, Pedersen NL et al
60. De Vico Fallani F, Richiardi J, Chavez M et al (2019) The Swedish twin registry: content
(2014) Graph analysis of functional brain net- and management as a research infrastructure.
works: practical issues in translational neuro- Twin Res Hum Genet 22:672–680
science. Philos Trans R Soc Lond Ser B Biol 75. Magnusson PKE, Almqvist C, Rahman I et al
Sci 369:20130521 (2013) The Swedish Twin Registry: establish-
61. Gonzalez-Astudillo J, Cattai T, Bassignana G ment of a biobank and other recent develop-
et al (2021) Network-based brain–computer ments. Twin Res Hum Genet 16:317–329
interfaces: principles and applications. J Neu- 76. Tomata Y, Li X, Karlsson IK et al (2020) Joint
ral Eng 18:011001 impact of common risk factors on incident
62. Mirowski P, Madhavan D, LeCun Y et al dementia: a cohort study of the Swedish
(2009) Classification of patterns of EEG syn- Twin Registry. J Intern Med 288:234–247
chronization for seizure prediction. Clin Neu- 77. Wirdefeldt K, Gatz M, Pawitan Y et al (2005)
rophysiol 120:1927–1940 Risk and protective factors for Parkinson’s
63. Siddiqui MK, Morales-Menendez R, Huang disease: a study in Swedish twins. Ann Neurol
X et al (2020) A review of epileptic seizure 57:27–33
detection using machine learning classifiers. 78. Fang F, Kamel F, Lichtenstein P et al (2009)
Brain Inform 7:5 Familial aggregation of amyotrophic lateral
64. Smith SJM (2005) EEG in the diagnosis, clas- sclerosis. Ann Neurol 66:94–99
sification, and management of patients with 79. Wright MJ, Martin NG (2004) Brisbane Ado-
epilepsy. J Neurol Neurosurg Psychiatry 76: lescent Twin Study: outline of study methods
ii2–ii7 and research projects. Aust J Psychol 56:65–78
65. Dauwels J, Vialatte F, Cichocki A (2010) 80. Miranda-Dominguez O, Feczko E, Grayson
Diagnosis of Alzheimer’s disease from EEG DS et al (2018) Heritability of the human
signals: where are we standing? Curr Alzhei- connectome: a connectotyping study. Netw
mer Res 7:487–505 Neurosci 2:175–199
66. Vecchio F, Babiloni C, Lizio R et al (2013) 81. Kochunov P, Jahanshad N, Marcus D et al
Resting state cortical EEG rhythms in Alzhei- (2015) Heritability of fractional anisotropy
mer’s disease: toward EEG markers for clinical in human white matter: a comparison of
applications: a review. Suppl Clin Neurophy- Human Connectome Project and ENIGMA-
siol 62:223–236 DTI data. NeuroImage 111:300–311
67. Lotte F, Bougrain L, Cichocki A et al (2018) 82. Schmitt JE, Raznahan A, Liu S et al (2020)
A review of classification algorithms for The genetics of cortical myelination in young
EEG-based brain–computer interfaces: a adults and its relationships to cerebral surface
10 year update. J Neural Eng 15:031005 area, cortical thickness, and intelligence: a
68. Oostenveld R, Fries P, Maris E et al (2011) magnetic resonance imaging study of twins
FieldTrip: open source software for advanced and families. NeuroImage 206:116319
analysis of MEG, EEG, and invasive electro- 83. Kremen WS, Franz CE, Lyons MJ (2019)
physiological data. Comput Intell Neurosci Current status of the Vietnam Era Twin
2011:156869 Study of Aging (VETSA). Twin Res Hum
69. Tadel F, Baillet S, Mosher JC et al (2011) Genet 22:783–787
Brainstorm: a user-friendly application for 84. Sachdev PS, Lammel A, Trollor JN et al
MEG/EEG analysis. Comput Intell Neurosci (2009) A comprehensive neuropsychiatric
2011:879716 study of elderly twins: the Older Australian
70. Gramfort A, Luessi M, Larson E et al (2013) Twins Study. Twin Res Hum Genet 12:573–
MEG and EEG data analysis with 582
MNE-python. Front Neurosci 7:267 85. Hur Y-M, Bogl LH, Ordoñana JR et al
71. Jayaram V, Barachant A (2018) MOABB: (2019) Twin family registries worldwide: an
trustworthy algorithm benchmarking for important resource for scientific research.
BCIs. J Neural Eng 15:066011 Twin Res Hum Genet 22:427–437
72. Carlin JB, Gurrin LC, Sterne JA et al (2005) 86. Ligthart L, van Beijsterveldt CEM, Kevenaar
Regression models for twin studies: a critical ST et al (2019) The Netherlands twin register:
review. Int J Epidemiol 34:1089–1099 longitudinal research based on twin and twin-
Main Existing Datasets 799
family designs. Twin Res Hum Genet 22: insights into dilated cardiomyopathy. Nat
623–636 Commun 11:2254
87. Glahn DC, Winkler AM, Kochunov P et al 99. Liu Y, Basty N, Whitcher B et al (2021)
(2010) Genetic control over the resting Genetic architecture of 11 organ traits derived
brain. Proc Natl Acad Sci 107:1223–1228 from abdominal MRI using deep learning.
88. Raffield LM, Cox AJ, Hugenschmidt CE et al elife 10:e65554
(2015) Heritability and genetic association 100. Chua SYL, Dhillon B, Aslam T et al (2019)
analysis of neuroimaging measures in the Dia- Associations with photoreceptor thickness
betes Heart Study. Neurobiol Aging 36:1602. measures in the UK Biobank. Sci Rep 9:
e7–1602.15 19440
89. Iacono WG, Heath AC, Hewitt JK et al 101. Bycroft C, Freeman C, Petkova D et al (2018)
(2018) The utility of twins in developmental The UK Biobank resource with deep pheno-
cognitive neuroscience research: how twins typing and genomic data. Nature 562:203–
strengthen the ABCD research design. Dev 209
Cogn Neurosci 32:30–42 102. Wain LV, Shrine N, Miller S et al (2015)
90. Brouwer RM, Schutte J, Janssen R et al Novel insights into the genetics of smoking
(2021) The speed of development of adoles- behaviour, lung function, and chronic
cent brain age depends on sex and is geneti- obstructive pulmonary disease
cally determined. Cereb Cortex 31:1296– (UK BiLEVE): a genetic association study in
1306 UK Biobank. Lancet Respir Med 3:769–781
91. Cole JH, Poudel RPK, Tsagkrasoulis D et al 103. Van Hout CV, Tachmazidou I, Backman JD
(2017) Predicting brain age with deep et al (2020) Exome sequencing and charac-
learning from raw imaging data results in a terization of 49,960 individuals in the UK
reliable and heritable biomarker. NeuroImage Biobank. Nature 586:749–756
163:115–124 104. Backman JD, Li AH, Marcketta A et al (2021)
92. Vandenbosch MMLJZ, van’t Ent D, Exome sequencing and analysis of 454,787
Boomsma DI et al (2019) EEG-based age-- UK Biobank participants. Nature 599:628–
prediction models as stable and heritable indi- 634
cators of brain maturational level in children 105. Wang Y, Guo J, Ni G et al (2020) Theoretical
and adolescents. Hum Brain Mapp 40:1919– and empirical quantification of the accuracy of
1926 polygenic scores in ancestry divergent popu-
93. Chung MK, Lee H, DiChristofano A et al lations. Nat Commun 11:3865
(2019) Exact topological inference of the 106. Medland SE, Grasby KL, Jahanshad N, et al
resting-state brain networks in twins. Netw (2020) Ten years of enhancing neuro-
Neurosci 3:674–694 imaging genetics through meta-analysis: an
94. Yamin MA, Dayan M, Squarcina L, et al overview from the ENIGMA Genetics Work-
(2019) Investigating the impact of genetic ing Group. Hum Brain Mapp
background on brain dynamic functional con- 107. Adams HHH, Hibar DP, Chouraki V et al
nectivity through machine learning: a Twins (2016) Novel genetic loci underlying human
Study. In: 2019 IEEE EMBS International intracranial volume identified through
Conference on Biomedical Health Informat- genome-wide association. Nat Neurosci 19:
ics (BHI), pp. 1–4 1569–1582
95. Han Y, Adolphs R (2020) Estimating the her- 108. Hibar DP, Adams HHH, Jahanshad N et al
itability of psychological measures in the (2017) Novel genetic loci associated with hip-
Human Connectome Project dataset. PLoS pocampal volume. Nat Commun 8:13624
One 15:e0235860 109. Stein JL, Medland SE, Vasquez AA et al
96. Demeter DV, Engelhardt LE, Mallett R et al (2012) Identification of common variants
(2020) Functional connectivity fingerprints at associated with human hippocampal and
rest are similar across youths and adults and intracranial volumes. Nat Genet 44:552–561
vary with genetic similarity. iScience 23: 110. Hibar DP, Stein JL, Renteria ME et al (2015)
100801 Common genetic variants influence human
97. Elliott LT, Sharp K, Alfaro-Almagro F et al subcortical brain structures. Nature 520:
(2018) Genome-wide association studies of 224–229
brain imaging phenotypes in UK Biobank. 111. Satizabal CL, Adams HHH, Hibar DP et al
Nature 562:210–216 (2019) Genetic architecture of subcortical
98. Pirruccello JP, Bick A, Wang M et al (2020) brain structures in 38,851 individuals. Nat
Analysis of cardiac magnetic resonance imag- Genet 51:1624–1636
ing in 36,000 individuals yields genetic
800 Baptiste Couvy-Duchesne et al.
112. Grasby KL, Jahanshad N, Painter JN et al 126. Yong W-S, Hsu F-M, Chen P-Y (2016)
(2020) The genetic architecture of the Profiling genome-wide DNA methylation.
human cerebral cortex. Science 367:eaay6690 Epigenetics Chromatin 9:26
113. Smit DJA, Wright MJ, Meyers JL et al (2018) 127. Ziller MJ, Gu H, Müller F et al (2013) Chart-
Genome-wide association analysis links multi- ing a dynamic DNA methylation landscape of
ple psychiatric liability genes to oscillatory the human genome. Nature 500:477–481
brain activity. Hum Brain Mapp 39:4183– 128. Maden SK, Thompson RF, Hansen KD et al
4195 (2021) Human methylome variation across
114. Sønderby IE, Ching CRK, Thomopoulos SI, Infinium 450K data on the gene expression
et al (2021) Effects of copy number variations omnibus. NAR Genom Bioinf 3:lqab025
on brain structure and risk for psychiatric ill- 129. Moran S, Arribas C, Esteller M (2016) Vali-
ness: large-scale studies from the ENIGMA dation of a DNA methylation microarray for
working groups on CNVs. Hum Brain Mapp 850,000 CpG sites of the human genome
115. Peterson RE, Kuchenbaecker K, Walters RK enriched in enhancer sequences. Epigenomics
et al (2019) Genome-wide association studies 8:389–399
in ancestrally diverse populations: opportu- 130. Sala C, Di Lena P, Fernandes Durso D et al
nities, methods, pitfalls, and recommenda- (2020) Evaluation of pre-processing on the
tions. Cell 179:589–603 meta-analysis of DNA methylation data from
116. Lam M, Awasthi S, Watson HJ et al (2020) the Illumina HumanMethylation450 Bead-
RICOPILI: rapid imputation for COnsortias Chip platform. PLoS One 15:e0229763
PIpeLIne. Bioinforma 36:930–933 131. Barrett T, Wilhite SE, Ledoux P et al (2013)
117. Sullivan PF, Kendler KS (2021) The state of NCBI GEO: archive for functional genomics
the science in psychiatric genomics. Psychol data sets--update. Nucleic Acids Res 41:
Med 51:2145–2147 D991–D995
118. Stilp AM, Emery LS, Broome JG et al (2021) 132. Wang T, Guan W, Lin J et al (2015) A system-
A system for phenotype harmonization in the atic study of normalization methods for Infi-
National Heart, Lung, and Blood Institute nium 450K methylation data using whole-
Trans-Omics for Precision Medicine genome bisulfite sequencing data. Epige-
(TOPMed) program. Am J Epidemiol 190: netics 10:662–669
1977–1992 133. Min JL, Hemani G, Davey Smith G et al
119. Taliun D, Harris DN, Kessler MD et al (2018) Meffil: efficient normalization and
(2021) Sequencing of 53,831 diverse gen- analysis of very large DNA methylation data-
omes from the NHLBI TOPMed program. sets. Bioinforma 34:3983–3989
Nature 590:290–299 134. Birney E, Smith GD, Greally JM (2016)
120. Patel RA, Musharoff SA, Spence JP, et al Epigenome-wide association studies and the
(2021) Effect sizes of causal variants for Interpretation of Disease -Omics. PLoS
gene expression and complex traits differ Genet 12:e1006105
between populations 135. Michels KB, Binder AM (2018) Considera-
121. Li Z, Li X, Zhou H, et al (2021) A framework tions for design and analysis of DNA methyl-
for detecting noncoding rare variant associa- ation studies. Methods Mol Biol 1708:31–46
tions of large-scale whole-genome sequenc- 136. Mill J, Heijmans BT (2013) From promises to
ing studies practical strategies in epigenetic epidemiol-
122. Schubert R, Geoffroy E, Gregga I, et al ogy. Nat Rev Genet 14:585–594
(2021) Protein prediction for trait mapping 137. Reed ZE, Suderman MJ, Relton CL et al
in diverse populations (2020) The association of DNA methylation
123. Hindy G, Dornbos P, Chaffin MD, et al with body mass index: distinguishing between
(2021) Rare coding variants in 35 genes asso- predictors and biomarkers. Clin Epigenetics
ciate with circulating lipid levels – a multi- 12:50
ancestry analysis of 170,000 exomes 138. Mendelson MM, Marioni RE, Joehanes R
124. Selvaraj MS, Li X, Li Z, et al (2021) Whole et al (2017) Association of Body Mass Index
genome sequence analysis of blood lipid levels with DNA methylation and gene expression
in >66,000 individuals in blood cells and relations to cardiometabolic
125. Bibikova M, Barnes B, Tsan C et al (2011) disease: a Mendelian Randomization
High density DNA methylation array with Approach. PLoS Med 14:e1002215
single CpG site resolution. Genomics 98: 139. Min JL, Hemani G, Hannon E et al (2021)
288–295 Genomic and phenotypic insights from an
Main Existing Datasets 801
167. K€allén B, K€allén K (2003) The Swedish Med- 180. Shiffman S, Stone AA, Hufford MR (2008)
ical Birth Register - a summary of content and Ecological momentary assessment. Annu Rev
quality. 2003-112-3 Clin Psychol 4:1–32
168. Persson M, Razaz N, Tedroff K et al (2018) 181. Hamer M, Sharma N, Batty GD (2018) Asso-
Five and 10 minute Apgar scores and risks of ciation of objectively measured physical activ-
cerebral palsy and epilepsy: population based ity with brain structure: UK Biobank study. J
cohort study in Sweden. BMJ 360:k207 Intern Med 284:439–443
169. Tettamanti G, Ljung R, Mathiesen T et al 182. Lyall LM, Wyse CA, Graham N et al (2018)
(2016) Maternal smoking during pregnancy Association of disrupted circadian rhythmicity
and the risk of childhood brain tumors: results with mood disorders, subjective wellbeing,
from a Swedish cohort study. Cancer Epide- and cognitive function: a cross-sectional
miol 40:67–72 study of 91 105 participants from the UK
170. Brooke HL, Talb€ack M, Hörnblad J et al Biobank. Lancet Psychiatry 5:507–514
(2017) The Swedish cause of death register. 183. Huang J, Zuber V, Matthews PM et al (2020)
Eur J Epidemiol 32:765–773 Sleep, major depressive disorder, and Alzhei-
171. Subic A, Zupanic E, von Euler M et al (2018) mer disease: a Mendelian randomization
Stroke as a cause of death in death certificates study. Neurology 95:e1963–e1970
of patients with dementia: a Cohort Study 184. Little MA (2021) Smartphones for remote
from the Swedish Dementia Registry. Curr symptom monitoring of Parkinson’s disease.
Alzheimer Res 15:1322–1330 J Parkinsons Dis 11:S49–S53
172. Wallerstedt SM, Wettermark B, Hoffmann M 185. Yim SJ, Lui LMW, Lee Y et al (2020) The
(2016) The first decade with the Swedish pre- utility of smartphone-based, ecological
scribed drug register - a systematic review of momentary assessment for depressive symp-
the output in the scientific literature. Basic toms. J Affect Disord 274:602–609
Clin Pharmacol Toxicol 119:464–469 186. Williamson JR, Telfer B, Mullany R et al
173. Wettermark B, Hammar N, Fored CM et al (2021) Detecting Parkinson’s disease from
(2007) The new Swedish Prescribed Drug Wrist-Worn Accelerometry in the U.-
Register--opportunities for pharmacoepide- K. Biokank. Sensors 21:2047
miological research and experience from the 187. Chaibub Neto E, Bot BM, Perumal T et al
first six months. Pharmacoepidemiol Drug (2016) Personalized hypothesis tests for
Saf 16:726–735 detecting medication response in Parkinson
174. Cermakova P, Nelson M, Secnik J et al (2017) disease patients using iPhone sensor data.
Living alone with Alzheimer’s disease: data Pac Symp Biocomput Pac Symp Biocomput
from SveDem, the Swedish Dementia Regis- 21:273–284
try. J Alzheimers Dis 58:1265–1272 188. LeMoyne R, Mastroianni T, Whiting D et al
175. Haasum Y, Fastbom J, Johnell K (2016) Use (2019) Assessment of machine learning clas-
of fall-risk inducing drugs in patients using sification strategies for the differentiation of
Anti-Parkinson Drugs (APD): a Swedish deep brain stimulation “on” and “off” status
Register-Based Study. PLoS One 11: for Parkinson’s disease using a smartphone as
e0161246 a wearable and wireless inertial sensor for
176. Religa D, Fereshtehnejad S-M, Cermakova P quantified feedback. In: LeMoyne R,
et al (2015) SveDem, the Swedish Dementia Mastroianni T, Whiting D et al (eds) Wearable
Registry - a tool for improving the quality of and Wireless Systems for Healthcare II: move-
diagnostics, treatment and care of dementia ment disorder evaluation and deep brain stim-
patients in clinical practice. PLoS One 10: ulation systems. Springer, Singapore, pp
e0116538 113–126
177. Hillert J, Stawiarz L (2015) The Swedish MS 189. Behar J, Roebuck A, Shahid M et al (2015)
registry – clinical support tool and scientific SleepAp: an automated obstructive sleep
resource. Acta Neurol Scand 132:11–19 apnoea screening application for smart-
178. Longinetti E, Regodón Wallin A, Samuelsson phones. IEEE J Biomed Health Inform 19:
K et al (2018) The Swedish motor neuron 325–331
disease quality registry. Amyotroph Lateral 190. Shah RV, Grennan G, Zafar-Khan M et al
Scler Front Degener 19:528–537 (2021) Personalized machine learning of
179. Asplund K, Hulter Åsberg K, Appelros P et al depressed mood using wearables. Transl Psy-
(2011) The Riks-stroke story: building a sus- chiatry 11:1–18
tainable national register for quality assess- 191. Vasanthakumar A, Davis JW, Idler K et al
ment of stroke care. Int J Stroke 6:99–108 (2020) Harnessing peripheral DNA methyla-
tion differences in the Alzheimer’s Disease
Main Existing Datasets 803
219. Wu K-D, Hsiao C-F, Ho L-T et al (2002) 223. Sletten TL, Rajaratnam SMW, Wright MJ et al
Clustering and heritability of insulin resis- (2013) Genetic and environmental contribu-
tance in Chinese and Japanese hypertensive tions to sleep-wake behavior in 12-year-old
families: a Stanford-Asian Pacific Program in twins. Sleep 36:1715–1722
Hypertension and Insulin Resistance sibling 224. Mitchell BL, Campos AI, Renterı́a ME et al
study. Hypertens Res 25:529–536 (2019) Twenty-five and up (25Up) study: a
220. Johnsen JM, Fletcher SN, Huston H et al new wave of the Brisbane Longitudinal Twin
(2017) Novel approach to genetic analysis Study. Twin Res Hum Genet 22:154–163
and results in 3000 hemophilia patients 225. Zietsch BP, Hansen JL, Hansell NK et al
enrolled in the my life, our future initiative. (2007) Common and specific genetic influ-
Blood Adv 1:824–834 ences on EEG power bands delta, theta,
221. Zhao X, Qiao D, Yang C et al (2020) Whole alpha, and beta. Biol Psychol 75:154–164
genome sequence analysis of pulmonary func- 226. Bot BM, Suver C, Neto EC et al (2016) The
tion and COPD in 19,996 multi-ethnic parti- mPower study, Parkinson disease mobile data
cipants. Nat Commun 11:5182 collected using ResearchKit. Sci Data 3:
222. GTEx Consortium (2020) The GTEx Con- 160011
sortium atlas of genetic regulatory effects
across human tissues. Science 369:1318–
1330
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Part V
Disorders
Chapter 25
Abstract
Dementia denotes the condition that affects people suffering from cognitive and behavioral impairments
due to brain damage. Common causes of dementia include Alzheimer’s disease, vascular dementia, or
frontotemporal dementia, among others. The onset of these pathologies often occurs at least a decade
before any clinical symptoms are perceived. Several biomarkers have been developed to gain a better insight
into disease progression, both in the prodromal and the symptomatic phases. Those markers are commonly
derived from genetic information, biofluid, medical images, or clinical and cognitive assessments. Informa-
tion is nowadays also captured using smart devices to further understand how patients are affected. In the
last two to three decades, the research community has made a great effort to capture and share for research a
large amount of data from many sources. As a result, many approaches using machine learning have been
proposed in the scientific literature. Those include dedicated tools for data harmonization, extraction of
biomarkers that act as disease progression proxy, classification tools, or creation of focused modeling tools
that mimic and help predict disease progression. To date, however, very few methods have been translated
to clinical care, and many challenges still need addressing.
Key words Dementia, Alzheimer’s disease, Cognitive impairment, Machine learning, Data harmoni-
zation, Biomarkers, Imaging, Classification, Disease progression modeling
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_25,
© The Author(s) 2023
807
808 Marc Modat et al.
1.1 Alzheimer’s AD is the most common form of dementia, accounting for 60–65%
Disease (AD) of all cases. It typically presents in individuals aged 65 or older, with
the initial and most prominent cognitive deficits being memory
loss, with additional cognitive impairments in the language, visuo-
spatial, and executive functions [3]. The distinguishing feature of
AD is the buildup of amyloid-β plaques and neurofibrillary tangles
of tau proteins. The amyloid plaques tend to be diffuse throughout
the brain, while tau pathology tends to start in the mediotemporal
lobe, and in particular in the hippocampus and entorhinal cortex,
and spread to prefrontal and temporoparietal cortex in the moder-
ate stages of the disease. There are numerous genetic factors that
have different levels of risk and prevalence in the population. The
1
National Institute of Neurological and Communicative Disorders and Stroke-Alzheimer’s Disease and Related
Disorders Association.
Machine Learning in ADRD 809
greatest risk comes from the nearly fully penetrant autosomal dom-
inant mutations in the amyloid precursor protein (APP), presenilin
1 (PSEN1), or presenilin 2 (PSEN2) genes. However, the preva-
lence of these mutations is extremely low, comprising less than 0.5%
of all AD cases. The age at onset of autosomal dominant AD is
relatively similar between generations [5] and within individual
mutations [6], typically resulting in an early-onset form of AD
(below the age of 60 years). The most prominent risk factor gene
in terms of both hazard and prevalence is apolipoprotein E
(APOE). Carriers of a single copy (roughly 25% of the population)
of the E4 allele are roughly two to three times more likely to develop
AD [7], and they tend to have an earlier age of disease onset.
Homozygotic E4 carriers represent 2–3% of the general population,
with a dose-dependent increase in risk. There have been some
suggestions that carrying an E2 form of APOE can infer some
protection to individuals compared to the most common E3 allele
[7, 8].
Besides the typical presentations of AD in which episodic mem-
ory deficits are prominent, there are other variants with atypical
presentations. Posterior cortical atrophy (PCA) [9] is characterized
by visual and spatial impairments, but memory and language abil-
ities are preserved in the early stages, with atrophy localized in the
parietal and occipital lobe. Logopenic variant of the primary pro-
gressive aphasia (lvPPA), also called logopenic progressive aphasia
(LPA), is characterized by impairments in the language domain
(i.e., word-finding difficulty, impaired repetition of sentences and
phrases) and atrophy in the left temporoparietal junction
[10]. Despite presenting with different symptoms and neuroana-
tomical features, both PCA and lvPPA typically share the same
forms of pathology, amyloid plaques, and neurofibrillary tangles,
with the typical forms of AD. Besides these pathological hallmarks,
accumulation of the TAR DNA-binding protein 43 (TDP-43) [11]
is another form of pathology often observed in AD, particularly in
cases with older onset of symptoms, resulting in increased rates of
atrophy. The limbic-predominant age-related TDP-43 encephalop-
athy dementia (LATE) is a related condition found in older elderly
adults (above 80 years of age), presenting with a slow progression
of amnestic symptoms and hippocampal sclerosis.
1.2 Vascular As the second most common form of dementia (accounting for
Dementia (VaD) 10–15% of all dementia cases), VaD is an umbrella term for a
number of syndromes due to a clear primary cause: the decreased
blood flow due to damage in the blood supply (large or small
vessels), which leads to brain tissue damage. The vascular origin is
clearly seen on magnetic resonance imaging (MRI) as the presence
of extensive periventricular white matter lesions, or multiple
lacunes in the basal ganglia and/or white matter [12]. Symptoms
tend to accumulate in a step-wise fashion, rather than gradually
810 Marc Modat et al.
1.4 Dementia with Around 10–15% of dementia cases have a diagnosis of DLB
Lewy Bodies (DLB) [31]. Symptoms tend to have an insidious onset, usually at the
age of 65 years or older, and disease duration has an average of
5 to 8 years from diagnosis, but it can range from 2 to 20 years.
Symptoms change greatly from person to person but typically
include fluctuating cognition, pronounced alterations in attention,
alertness and executive functions, visual hallucinations, and motor
features of parkinsonism. Early signs also include rapid eye move-
ment (REM) sleep behavior disorder, while memory and hippo-
campal volume are relatively preserved in the initial stages, but they
become impaired later during the course of the disease. Alongside
relatively preserved mediotemporal lobe volumes, typical biomar-
kers are reduced dopamine transporter (DAT) uptake in the basal
ganglia on single-photon emission computed tomography
(SPECT) or positron emission tomography (PET) imaging, and
polysomnographic recordings, showing REM sleep without atonia.
812 Marc Modat et al.
2.1 Genetic Markers Despite the numerous forms of pathology that can ultimately lead
to dementia, many genetic risk factors have been identified for AD
and related disorders; the overall heritability of AD has been char-
acterized to be between 60 and 80% [33]. This risk of heritability
however is spread out over a wide range of locations that vary in
terms of prevalence and impact. Identifying genetic risk factors and
the associated pathways that these genes are involved in have led to
a better understanding of various forms of dementia [34].
As mentioned in Subheading 1, the genetic variants with the
strongest penetrance are the autosomal dominant forms of demen-
tia. What these rare autosomal dominant forms provide is an oppor-
tunity to study a “purer” form of dementia, as the age of disease
onset tends to be in the 30s through 50s, when there should be a far
lower likelihood of commodities. It also provides a chance to study
pre-symptomatic changes in individuals who are nearly certain to
become affected by the disease. Thus, these cohorts are an ideal
population for clinical trials of new therapies, in part to prove that
the target engagement is successful and whether it provides any
evidence that supports the underlying hypothesis around the dis-
ease start and spread.
Outside of the autosomal dominant mutations, the gene most
linked with risk for AD is APOE. There is not an equivalent gene in
terms of risk and prevalence to APOE yet discovered for other
forms of dementia, in part because these forms of dementia are
rarer and it is thus more difficult to include the number of subjects
needed for a well-powered GWAS (genome-wide association
study). However, there are some suggestions, such as the TREM2
variant in FTD [35].
Rather than trying to identify single target genes and their
associated risks, many researchers have looked to generate a poly-
genic risk score, i.e., a sum of the risks conferred by each associated
variant across the genome. Polygenic risk scores (PRS) have been
developed for multiple diseases to better account for the amalga-
mated risk that the entire genetic profile provides [36]. For AD,
however, APOE confers a far greater risk to individuals, with the
PRS scores able to slightly improve predictive accuracy and explain
additional risks beyond APOE [37, 38].
2.2 Clinical and Given that various forms of dementia have historically been defined
Cognitive Assessment by their clinical phenotype, and that clinical and cognitive assess-
ments tend to be the cheapest and most widely available, they often
are paramount in terms of initial diagnostic workup of an individ-
ual, as well as their subsequent patient management.
814 Marc Modat et al.
Clinical vital signs, such as blood pressure [39, 40] and body
mass index (BMI) [41], may suggest causes of cognitive
impairment other than a neurodegenerative disorder or indications
of an at-risk profile that would result in a more aggressive disease.
The Clinical Dementia Rating (CDR) [42] is a semi-structured
interview that examines several aspects of physical and mental
well-being which is summarized under six subdomains. For each
subdomain, a score of 0 (no impairment), 0.5 (mild/questionable
impairment), 1, 2, or 3 is given. Both the sum of these subdomains,
referred to as the CDR Sum of Boxes (CDR-SB), and a global
summary score are often used, with CDR-SB now commonly
used as a primary endpoint in trials. Other clinical workups may
help identify non-memory symptoms that would not be picked up
via cognitive assessments, such as anxiety, depression, and quality of
daily activities.
Cognitive assessments look at numerous domains of brain
function, including executive function, language, visuospatial func-
tions, and behavior. However, given that memory is the most
common primary complaint from individuals with AD, assessments
of various aspects of an individual’s memory is one of the most
important and typically included in both clinical and research set-
tings. Numerous tests have been developed and validated for use in
the clinic as well, and they often serve as a primary outcome
measure in clinical trials of subjects with mild to moderate
AD. Standard clinical and cognitive assessments include the Mini
Mental State Exam (MMSE) [43], the Alzheimer’s Disease Assess-
ment Scale-Cognitive Subscale (ADAS-COG) [44], and the Mon-
treal Cognitive Assessment (MoCA) [45].
While cheap and readily available, these assessments do come
with some disadvantages. These are often pencil and paper tests
which are administered and scored by a trained rater. As such, there
is a level of subjectivity in many of these assessments that tend to
result in high variability. Often these tests repeat the same questions
and tasks over and over again, which leads to practice effects. It also
is often difficult to build these assessments such that their dynamic
range can simultaneously cover both the early subtle signs of
dementia pre-symptomatically and the full decline once the indivi-
duals have experienced symptoms. This results in some tests having
substantial ceiling effects (i.e., being easy enough that there is
limited distinction between healthy individuals and those experien-
cing the subtle initial symptoms) and floor effects (i.e., the tests are
so difficult that many with a cognitive impairment cannot perform
them). There are also the cultural and lingual artifacts that may
produce bias when translating one of these tests over from one
language to another. As a result, there is a trend to formulate
cognitive assessments in a more objective, computational format
to reduce issues around subjectivity, language differences, and
learning effects. They may reduce the variability compared to
Machine Learning in ADRD 815
2.3 Biofluid The most widely studied fluid-based biomarkers in AD and related
Assessments disorders come from samples of cerebrospinal fluid extracted from
an individual’s spine. Measures of primary AD-related pathology
(Aβ1-42, tau, p-tau) can be obtained from these samples, as well as
peripheral information on downstream mechanisms, such as neu-
roinflammation, synaptic dysfunction, and neuronal injury
[48]. Fluid-based biomarkers are very effective in terms of being a
“state” biomarker, i.e., whether an individual has a normal or
abnormal level. They are often much cheaper than imaging assess-
ments in providing this status and thus are more likely to be used
for screening of individuals at risk for dementia. At the same time,
their ability to track change in the disease over time is currently
limited. They are in general more noisy measurements, likely due to
a number of factors including consistency of extraction, storage,
and analysis methods [49]. Even when these have been held
extremely consistent, their variability is still much higher in terms
of measurement of change over time compared to cognitive and
imaging measures [50]. Despite the procedure being very safe and
continuing to improve, there is still a set of individuals who will not
wish to participate in studies involving these assessments. A far less
invasive and cheaper procedure is to extract similar measures from
the plasma. While plasma-based biomarkers have been actively
pursued for a lengthy time, it is only very recently that they have
produced the level of accuracy and precision needed to compete in
terms of performance to other established measurements
[51, 52]. There have been plasma-based assessments of amyloid-
β, different tau isoforms, and nonspecific markers of neurodegen-
eration (such as levels of the neurofilament light chain, NfL) which
show promise for detecting changes in the preclinical stage of
AD [53].
2.4 Imaging The primary use of brain imaging in clinical settings is to exclude
Dementia non-neurodegenerative causes, such as normal pressure hydroceph-
alus, tumors, and chronic hemorrhages, together with absence of
atrophy, all features that can be visualized on T1-weighted MRI or
computerized tomography (CT) scans. Nevertheless, three-
dimensional tomographic medical imaging modalities, particularly
PET and MR imaging, provide high-precision measurements of
spatiotemporal patterns of disease burden that have proven
extremely valuable for research and also currently contribute to
the positive diagnosis.
816 Marc Modat et al.
2.4.1 Imaging Primary Radiotracers that preferentially bind to the primary pathologies
Pathology associated with AD allow the detection and tracking of the slow
progressive buildup of the amyloid plaques and neurofibrillary
tangles, which can occur decades before the onset of symptoms
[54, 55]. The original amyloid tracer was 11C Pittsburgh Com-
pound B [PIB], and it identified individuals who were amyloid
positive but showed no symptoms [56]. These individuals tended
to progress to mild cognitive impairment and subsequently AD at
much higher rates than those who were amyloid negative
[57]. Since the introduction of PIB, there have been numerous
18
F tracers which have been developed and approved for use in
humans [58, 59]. As 18F-based tracers have a longer half-life than
11
C, it has enabled a much larger group of research centers access to
this technology. Tracers specifically related to tau-based pathology
have come much later. The most widely used has been flortaucipir,
with second-generation tracers now available that have overcome
some of the challenges of imaging with the early tau tracers
[60]. Findings from tau PET studies suggest that the landmark
postmortem staging of tau pathology seeding and spread according
to Braak [61] is the most common spatiotemporal pattern observed
in individuals [62, 63]. However, other subtypes of different dis-
tributions have been observed [64, 65]. Elevated tau PET uptake
often happens much later than elevation of amyloid PET [66],
especially in autosomal dominant cases of AD [67, 68]. Tau PET
is also far more strongly linked regionally with subsequent evidence
of neurodegeneration, while amyloid PET tends to elevate in a
similar manner across multiple regions at the same time [69]. Exam-
ples of amyloid and tau PET images from both patients with various
forms of AD and controls can be seen in Fig. 1. Despite many forms
of FTD being some form of tauopathy, the available PET tracers
have been primarily optimized to the specific form of tau pathology
that is primarily observed in AD, namely, the mix of 3-Repeat/4-
Repeat species observed in neurofibrillary tangles. Since there are
many different forms of tau pathology within FTD, the level of tau
PET uptake in these individuals is varied [70–72]. In other forms of
dementia, amyloid PET can be used to rule out AD pathology if an
individual with symptoms has an amyloid negative scan2 and tau
2
https://www.accessdata.fda.gov/drugsatfda_docs/nda/2012/202008_Florbetapir_Orig1s000TOC.cfm.
Machine Learning in ADRD 817
Fig. 1 Example amyloid (left column) and tau (right column) PET scans from controls and patients with different
forms of dementia (AD and PCA). PET images are presented as standardized uptake value ratio (SUVR) images,
where the amyloid PET have been normalized to subcortical WM, an area of high nonspecific binding (mainly in
myelin) for both healthy controls and patients with Alzheimer’s disease. The tau PET images have been
normalized to inferior cerebellar gray matter. While amyloid PET tends to show diffuse cortical uptake across
the brain, tau PET tends to be more focal in the areas where neurodegeneration is occurring
PET has now been approved to estimate the density and distribu-
tion of neurofibrillary tangles in individuals.3
2.4.2 Imaging As the pathology continues to build over time during the
Neurodegeneration pre-symptomatic period, it often leads to an insidious process of
neuronal dysfunction and ultimately to degeneration in all forms of
dementia. This is evidenced by atrophy visible in the structural
T1-weighted MRI scans (Fig. 2) and decreased metabolism on
fluorodeoxyglucose (FDG)-PET (Fig. 3). These forms of imaging
start to be altered around the time when tau pathology is present
and then provide close tracking with disease severity as symptoms
become apparent. These modalities often tend to be the most
widely available of imaging techniques within research settings,
with MRI tending to be less costly than PET. Structural imaging,
due to its high resolution (1 mm), signal-to-noise ratio, and con-
trast between tissues, lends itself to high-precision measurements of
change over time. The spatial pattern of the neurodegeneration,
whether it is hypometabolism or atrophy, can provide useful infor-
mation for differential diagnosis between different dementia [73–
75]. Parallel to neurodegeneration, changes in the white matter of
individuals with dementia also show evidence of disease-related
3
https://www.accessdata.fda.gov/drugsatfda_docs/label/2020/212123s000lbl.pdf.
818 Marc Modat et al.
Fig. 2 Example T1-weighted MRI scans from a healthy control and individuals
with different variants of Alzheimer’s disease. For each variant, atrophy can be
observed in areas of cortical GM which are known to cause the cognitive deficits
typically linked to the clinical phenotype (see white arrows)
Fig. 3 Example FDG PET scans from a healthy control and multiple variants of
Alzheimer’s disease. For each variant, hypometabolism (denoted by cooler
colors) observed in areas of cortical GM which are known to cause the
cognitive deficits typically linked to the clinical phenotype. Red arrows have
been added to highlight focal areas of hypometabolism in each variant
Fig. 4 Example T1-weighted (left column) and FLAIR (right column) images of minimal, moderate, and
prevalent lesions in the white matter, often referred to as white matter hyperintensities (WMH), as they
appear bright on FLAIR acquisitions. These lesions also often show up hypointense on T1-weighted scans, but
FLAIR tends to be more sensitive and provide more contrast, particularly around deep gray matter areas
2.5 Advances in Novel assessments and biomarkers for all forms of dementia are
Novel Biomarkers highly active areas of research, from new fluid-based biomarkers to
better computerized psychometric batteries to new imaging tracers
and MR sequences to track additional aspects of the disease. While
big data for machine learning in dementia has often meant assessing
a large number of individuals, each with a small handful of mea-
sures, there are new forms of data collection that provide a rich set
of data on single individuals. This could include not only the new
epigenetic markers like single-cell RNA sequencing [86, 87] but
also wearable devices that produce lots of data about individuals’
daily activities and spatial navigation.
4
https://dian.wustl.edu/.
5
https://www.genfi.org/.
6
https://adni.loni.usc.edu.
7
https://www.ukbiobank.ac.uk.
Machine Learning in ADRD 823
as it requires to use the most relevant marker for the correct stage.
In practice, this often leads to the use of complex models to handle
large amount of multimodal data (imaging, clinical, genetic, demo-
graphic, . . .). Additionally, each marker often suffers from its own
variability, which can be intra-patient, inter-patient, and inter-
center. For example, clinical assessments potentially differ based
on the rater. MRI acquisitions will differ due to pulse sequence
properties or scanner characteristics, as well as normal physiological
variance such as hydration and caffeine intake, among others.
4.1 Machine The largest area of machine learning research in relation to Alzhei-
Learning-Derived mer’s disease and related disorders is to extract measurements from
Biomarkers the different datasets. These biomarkers tend to reflect an aspect of
function or integrity of the individual that will gain a better under-
standing of a disease. Changes in these biomarkers from normal
values to abnormal provide a proxy for disease progression. Note
that most of this research can usually be useful for other brain
disorders.
824 Marc Modat et al.
8
https://fsl.fmrib.ox.ac.uk/fsl/fslwiki.
9
https://www.fil.ion.ucl.ac.uk/spm/.
10
https://www.nitrc.org/projects/hammer/.
11
https://surfer.nmr.mgh.harvard.edu/.
Machine Learning in ADRD 825
interval, typically 3 years,12 from those who are going to stay stable.
Several literature reviews on dementia classification and prediction
have been published [125–130]. In particular, recent reviews by Jo
et al. [126] and Ansart et al. [130] have covered the topic of
prognosis.
The proposed methodologies have been extremely varied not
only in terms of ML algorithms but also of input modalities and
extracted features. Initial ML methods often used support vector
machines (e.g., [112, 114]), while subsequent works used more
recent techniques such as random forest [117] and Gaussian pro-
cesses [121, 131]. Many recent works have used deep learning
classification techniques [127, 129]. However, so far, deep learning
has not outperformed classical ML for AD classification and pre-
diction [129, 130, 132]. Furthermore, a review on convolutional
neural networks for AD diagnosis from T1-weighted MRI [129]
has identified that more than half of these deep learning studies may
have been contaminated by data leakage, which is particularly wor-
risome. In terms of features, some studies use the whole brain as
input, either using directly the raw image or computing voxel-wise
(or vertex-wise when considering the cortical surface) measures
[125]. Others parcellate the brain into regions of interest, within
which features are computed. In particular, researchers have com-
bined segmentation approaches with disease classification techni-
ques. This has the advantage of limiting the search space of the
machine learning approach via the use of prior knowledge. For
example, Coupe et al. used a patch-based approach to classify voxels
from the hippocampus as it is known to be a vulnerable structure in
patients with dementia [116]. The input modalities have also been
extremely varied. While earlier works often focused on
T1-weighted MRI only [112–114], subsequent studies have
included other imaging modalities, in particular FDG-PET
[121, 123] Other researchers have combined tailored features
extracted from images and non-images features such as fluid bio-
markers [121], cognitive tests [124, 133], APOE genotype [121],
or genome-wide genotyping data [134, 135]. Through deep
learning, researchers are avoiding the need to craft features and
can use traditional deep learning approaches to directly infer disease
status from raw data: imaging or non-imaging. However, to date,
there has been less interest in this area than in biomarker extraction,
and thus fewer innovative solutions have been proposed. Popular
architectures include conventional neural network, autoencoder,
and recurrent neural networks, among others [127]. Training stra-
tegies mostly relied on supervised approaches, where some groups
have relied on pre-trained networks to compensate for relatively
12
While 3 years is certainly a relevant time frame to provide useful information to patients and relatives, it is likely
that the focus of research on such time frame was largely driven by the typical follow-up which is available for most
patients in large publicly available databases such as ADNI.
Machine Learning in ADRD 827
4.4 Data Data heterogeneity inducing bias arises from many sources, includ-
Harmonization ing differences in acquisition protocols, acquisition devices, or
populations under study. This is true in large multicenter studies
such as ADNI, DIAN, and GENFI. However, such research studies
use harmonized protocols for data acquisition and perform strict
Machine Learning in ADRD 829
5 What Is Next?
13
https://edon-initiative.org.
14
https://www.ftdtalk.org/research/our-projects/digital/.
Machine Learning in ADRD 833
6 Conclusion
Acknowledgements
References
1. Gauthier S, Rosa-Neto P, Morais JA, Webster 6. Ryman DC, Acosta-Baena N, Aisen PS,
C (2021) World Alzheimer report 2021: jour- Bird T, Danek A, Fox NC, Goate A,
ney through the diagnosis of dementia. Alz- Frommelt P, Ghetti B, Langbaum JBS,
heimer’s Disease International. https://www. Lopera F, Martins R, Masters CL, Mayeux
alzint.org/resource/world-alzheimer-report- RP, McDade E, Moreno S, Reiman EM,
2021/ Ringman JM, Salloway S, Schofield PR,
2. McKhann G, Drachman D, Folstein M, Sperling R, Tariot PN, Xiong C, Morris JC,
Katzman R, Price D, Stadlan EM (1984) Bateman RJ (2014) Symptom onset in auto-
Clinical diagnosis of Alzheimer’s disease: somal dominant Alzheimer disease: a system-
report of the NINCDS-ADRDA work group atic review and meta-analysis. Neurology 83:
under the auspices of department of health 253–260. https://doi.org/10.1212/WNL.
and human services task force on Alzheimer’s 0000000000000596
disease. Neurology 34:939–944. https:// 7. Farrer LA, Cupples LA, Haines JL, Hyman B,
pubmed.ncbi.nlm.nih.gov/6610841/ Kukull WA, Mayeux R, Myers RH, Pericak-
3. McKhann GM, Knopman DS, Chertkow H, Vance MA, Risch N, van Duijn CM (1997)
Hyman BT, Clifford RJ Jr, Kawas CH, Klunk Effects of age, sex, and ethnicity on the asso-
WE, Koroshetz WJ, Manly JJ, Mayeux R, ciation between apolipoprotein E genotype
Mohs RC, Morris JC, Rossor MN, and Alzheimer disease: a meta-analysis.
Scheltens P, Carrillo MC, Thies B, JAMA 278(16):1349–1356. https://doi.
Weintraub S, Phelps CH (2011) The diagno- org/10.1001/jama.1997.03550160069041
sis of dementia due to Alzheimer’s disease: 8. Genin E, Hannequin D, Wallon D,
recommendations from the National Institute Sleegers K, Hiltunen M, Combarros O, Bul-
on Aging-Alzheimer’s Association work- lido MJ, Engelborghs S, Deyn PD, Berr C,
groups on diagnostic guidelines for Alzhei- Pasquier F, Dubois B, Tognoni G, Fiévet N,
mer’s disease. Alzheimer’s Dementia 7:263– Brouwers N, Bettens K, Arosio B, Coto E,
2 6 9 . h t t p s : // d o i . o r g / 1 0 . 1 0 1 6 / J . Zompo MD, Mateo I, Epelbaum J, Frank-
JALZ.2011.03.005 Garcia A, Helisalmi S, Porcellini E,
4. Jack CR, Albert MS, Knopman DS, McKhann Pilotto A, Forti P, Ferri R, Scarpini E,
GM, Sperling RA, Carrillo MC, Thies B, Siciliano G, Solfrizzi V, Sorbi S, Spalletta G,
Phelps CH (2011) Introduction to the Valdivieso F, Veps€al€ainen S, Alvarez V,
recommendations from the National Institute Bosco P, Mancuso M, Panza F, Nacmias B,
on Aging-Alzheimer’s Association work- Boss P, Hanon O, Piccardi P, Annoni G,
groups on diagnostic guidelines for Alzhei- Seripa D, Galimberti D, Licastro F,
mer’s disease. Alzheimer’s Dementia 7:257– Soininen H, Dartigues JF, Kamboh MI,
2 6 2 . h t t p s : // d o i . o r g / 1 0 . 1 0 1 6 / J . Broeckhoven CV, Lambert JC, Amouyel P,
JALZ.2011.03.004 Campion D (2011) APOE and Alzheimer
5. Ryan NS, Nicholas JM, Weston PSJ, Liang Y, disease: a major gene with semi-dominant
Lashley T, Guerreiro R, Adamson G, Kenny J, inheritance. Mol Psychiatry 16:903–907.
Beck J, Chavez-Gutierrez L, de Strooper B, https://doi.org/10.1038/mp.2011.52
Revesz T, Holton J, Mead S, Rossor MN, Fox 9. Crutch SJ, Schott JM, Rabinovici GD,
NC (2016) Clinical phenotype and genetic Murray M, Snowden JS, Flier WM, Dickerson
associations in autosomal dominant familial BC, Vandenberghe R, Ahmed S, Bak TH,
Alzheimer’s disease: a case series. Lancet Boeve BF, Butler C, Cappa SF, Ceccaldi M,
Neuro l 15 :132 6–1 335. h tt p s : //d oi. Souza LC, Dubois B, Felician O, Galasko D,
org/10.1016/S1474-4422(16)30193-4 Graff-Radford J, Graff-Radford NR, Hof PR,
Machine Learning in ADRD 835
Krolak-Salmon P, Lehmann M, Magnin E, 14. Warren JD, Rohrer JD, Rossor MN (2013)
Mendez MF, Nestor PJ, Onyike CU, Pelak Frontotemporal dementia. BMJ 347. https://
VS, Pijnenburg Y, Primativo S, Rossor MN, doi.org/10.1136/BMJ.F4827
Ryan NS, Scheltens P, Shakespeare TJ, Gon- 15. Moore KM, Nicholas J, Grossman M, McMil-
zález AS, Tang-Wai DF, Yong KX, Carrillo M, lan CT, Irwin DJ, Massimo L, Deerlin VMV,
Fox NC (2017) Consensus classification of Warren JD, Fox NC et al (2020) Age at symp-
posterior cortical atrophy. Alzheimer’s tom onset and death and disease duration in
Dementia 13 :87 0–884. https://doi. genetic frontotemporal dementia: an interna-
org/10.1016/j.jalz.2017.01.014 tional retrospective cohort study. Lancet
10. Gorno-Tempini ML, Hillis AE, Weintraub S, N e u r o l 1 9 : 1 4 5 – 1 5 6 . h t t p s : // d o i .
Kertesz A, Mendez M, Cappa SF, Ogar JM, org/10.1016/S1474-4422(19)30394-1
Rohrer JD, Black S, Boeve BF, Manes F, 16. Rascovsky K, Hodges JR, Knopman D, Men-
Dronkers NF, Vandenberghe R, dez MF, Kramer JH, Neuhaus J, Swieten JCV,
Rascovsky K, Patterson K, Miller BL, Knop- Seelaar H, Dopper EG, Onyike CU, Hillis
man DS, Hodges JR, Mesulam MM, Gross- AE, Josephs KA et al (2011) Sensitivity of
man M (2011) Classification of primary revised diagnostic criteria for the behavioural
progressive aphasia and its variants. Neurol- variant of frontotemporal dementia. Brain
ogy 76:1006–1014. https://doi.org/10. 134:2456–2477. https://doi.org/10.1093/
1212/WNL.0B013E31821103E6 BRAIN/AWR179
11. Nelson PT, Dickson DW, Trojanowski JQ, 17. Mackenzie IR, Neumann M (2016) Molecu-
Jack CR, Boyle PA, Arfanakis K, lar neuropathology of frontotemporal demen-
Rademakers R, Alafuzoff I, Attems J, tia: insights into disease mechanisms from
Brayne C, Coyle-Gilchrist ITS, Chui HC, postmortem studies. J Neurochem 138:54–
Fardo DW, Flanagan ME, Halliday G, Hok- 70. https://doi.org/10.1111/JNC.13588
kanen SRK, Hunter S, Jicha GA, Katsumata Y, 18. Lashley T, Rohrer JD, Mead S, Revesz T
Kawas CH, Keene CD, Kovacs GG, Kukull (2015) Review: an update on clinical, genetic
WA, Levey AI, Makkinejad N, Montine TJ, and pathological aspects of frontotemporal
Murayama S, Murray ME, Nag S, Rissman lobar degenerations. Neuropathol Appl Neu-
RA, Seeley WW, Sperling RA, III CLW, robiol 41:858–881. https://doi.org/10.
Yu L, Schneider JA (2019) Limbic- 1111/NAN.12250
predominant age-related TDP-43 encepha-
lopathy (LATE): consensus working group 19. Whitwell JL, Przybelski SA, Weigand SD,
report. Brain 142:1503–1527. https://doi. Ivnik RJ, Vemuri P, Gunter JL, Senjem ML,
org/10.1093/brain/awz099 Shiung MM, Boeve BF, Knopman DS, Parisi
JE, Dickson DW, Petersen RC, Jack CR,
12. Román GC, Tatemichi TK, Erkinjuntti T, Josephs KA (2009) Distinct anatomical sub-
Cummings JL, Masdeu JC, Garcia JH, types of the behavioural variant of frontotem-
Amaducci L, Orgogozo JM, Brun A, poral dementia: a cluster analysis study. Brain
Hofman A, Moody DM, O’Brien MD, 132:2932–2946. https://doi.org/10.1093/
Yamaguchi T, Grafman J, Drayer BP, Bennett BRAIN/AWP232
DA, Fisher M, Ogata J, Kokmen E,
Bermejo F, Wolf PA, Gorelick PB, Bick KL, 20. Gordon E, Rohrer JD, Fox NC (2016)
Pajeau AK, Bell MA, Decarli C, Culebras A, Advances in neuroimaging in frontotemporal
Korczyn AD, Bogousslavsky J, Hartmann A, dementia. J Neurochem 138:193–210.
Scheinberg P (1993) Vascular dementia: diag- https://doi.org/10.1111/JNC.13656
nostic criteria for research studies. report of 21. Bocchetta M, Malpetti M, Todd EG, Rowe
the NINDS-AIREN international workshop. JB, Rohrer JD (2021) Looking beneath the
Neurology 43:250–260. https://doi.org/10. surface: the importance of subcortical struc-
1212/WNL.43.2.250 tures in frontotemporal dementia. Brain
13. Joutel A, Vahedi K, Corpechot C, Troesch A, Commun 3. https://doi.org/10.1093/
Chabriat H, Vayssière C, Cruaud C, BRAINCOMMS/FCAB158
Maciazek J, Weissenbach J, Bousser MG, 22. Rohrer JD, Warren JD, Modat M, Ridgway
Bach JF, Tournier-Lasserve E (1997) Strong GR, Douiri A, Rossor MN, Ourselin S, Fox
clustering and stereotyped nature of Notch3 NC (2009) Patterns of cortical thinning in the
mutations in CADASIL patients. Lancet 350: language variants of frontotemporal lobar
1511–1515. https://doi.org/10.1016/ degeneration. Neurology 72:1562–1569.
S0140-6736(97)08083-5 h t t p s : // d o i . o r g / 1 0 . 1 2 1 2 / W N L .
0B013E3181A4124E
836 Marc Modat et al.
23. Rohrer JD, Lashley T, Schott JM, Warren JE, 28. Sha SJ, Takada LT, Rankin KP, Yokoyama JS,
Mead S, Isaacs AM, Beck J, Hardy J, Silva RD, Rutherford NJ, Fong JC, Khan B, Karydas A,
Warrington E, Troakes C, Al-Sarraj S, King A, Baker MC, DeJesus-Hernandez M, Sha SJ,
Borroni B, Clarkson MJ, Ourselin S, Holton Takada LT, Rankin KP, Yokoyama JS, Ruther-
JL, Fox NC, Revesz T, Rossor MN, Warren ford NJ, Fong JC, Khan B, Karydas A, Baker
JD (2011) Clinical and neuroanatomical sig- MC, DeJesus-Hernandez M, Pribadi M,
natures of tissue pathology in frontotemporal Coppola G, Geschwind DH, Rademakers R,
lobar degeneration. Brain 134:2565–2581. Lee SE, Seeley W, Miller BL, Boxer AL
h t t p s : // d o i . o r g / 1 0 . 1 0 9 3 / B R A I N / (2012) Frontotemporal dementia due to
AWR198 c9orf72 mutations: clinical and imaging fea-
24. Woollacott IO, Rohrer JD (2016) The clinical tures. Neurology 79:1002–1011. https://
spectrum of sporadic and familial forms of d o i . o r g / 1 0 . 1 2 1 2 / W N L .
frontotemporal dementia. J Neurochem 138: 0b013e318268452e
6–31. https://doi.org/10.1111/JNC.13654 29. Young AL, Marinescu RV, Oxtoby NP,
25. Rohrer JD, Nicholas JM, Cash DM, van Bocchetta M, Yong K, Firth NC, Cash DM,
Swieten J, Dopper E, Jiskoot L, van Thomas DL, Dick KM, Cardoso J (2018)
Minkelen R, a Rombouts S, Cardoso MJ, Uncovering the heterogeneity and temporal
Clegg S, Espak M, Mead S, Thomas DL, complexity of neurodegenerative diseases
Vita ED, Masellis M, Black SE, with subtype and stage inference. Nat Com-
Freedman M, Keren R, MacIntosh BJ, mun 9:4273
Rogaeva E, Tang-Wai D, Tartaglia MC, 30. Young AL, Bocchetta M, Russell LL, Convery
Laforce R, Tagliavini F, Tiraboschi P, RS, Peakman G, Todd E, Cash DM, Greaves
Redaelli V, Prioni S, Grisoli M, Borroni B, CV, van Swieten J, Jiskoot L, Seelaar H,
Padovani A, Galimberti D, Scarpini E, Moreno F, Sanchez-Valle R, Borroni B,
Arighi A, Fumagalli G, Rowe JB, Coyle- Laforce R, Masellis M, Tartaglia MC,
Gilchrist I, Graff C, Fallström M, Jelic V, Graff C, Galimberti D, Rowe JB, Finger E,
Ståhlbom AK, Andersson C, Thonberg H, Synofzik M, Vandenberghe R, de
Lilius L, Frisoni GB, Binetti G, Pievani M, Mendonça A, Tagliavini F, Santana I,
Bocchetta M, Benussi L, Ghidoni R, Ducharme S, Butler C, Gerhard A, Levin J,
Finger E, Sorbi S, Nacmias B, Lombardi G, Danek A, Otto M, Sorbi S, Williams SC, Alex-
Polito C, Warren JD, Ourselin S, Fox NC, ander DC, Rohrer JD (2021) Characterizing
Rossor MN (2015) Presymptomatic cognitive the clinical features and atrophy patterns of
and neuroanatomical changes in genetic fron- mapt-related frontotemporal dementia with
totemporal dementia in the Genetic Fronto- disease progression modeling. Neurology 97:
temporal dementia Initiative (GENFI) study: e941–e952. https://doi.org/10.1212/
a cross-sectional analysis. Lancet Neurol 14: WNL.0000000000012410
253–262. https://doi.org/10.1016/S1474- 31. McKeith IG, Boeve BF, DIckson DW,
4422(14)70324-2 Halliday G, Taylor JP, Weintraub D,
26. Cash DM, Bocchetta M, Thomas DL, Dick Aarsland D, Galvin J, Attems J, Ballard CG,
KM, van Swieten JC, Borroni B, Bayston A, Beach TG, Blanc F, Bohnen N,
Galimberti D, Masellis M, Tartaglia MC, Bonanni L, Bras J, Brundin P, Burn D,
Rowe JB, Graff C, Tagliavini F, Frisoni GB, Chen-Plotkin A, Duda JE, El-Agnaf O,
Laforce RJ, Finger E, de Mendonca A, Feldman H, Ferman TJ, Ffytche D,
Sorbi S, Rossor MN, Ourselin S, Rohrer JD Fujishiro H, Galasko D, Goldman JG, Gom-
(2018) Patterns of grey matter atrophy in perts SN, Graff-Radford NR, Honig LS,
genetic frontotemporal dementia: results Iranzo A, Kantarci K, Kaufer D, Kukull W,
from the genfi study. Neurobiol Aging 62: Lee VM, Leverenz JB, Lewis S, Lippa C,
191–196. https://doi.org/10.1016/j. Lunde A, Masellis M, Masliah E, McLean P,
neurobiolaging.2017.10.008 Mollenhauer B, Montine TJ, Moreno E,
27. Whitwell JL, Weigand SD, Boeve BF, Senjem Mori E, Murray M, O’Brien JT, Orimo S,
ML, Gunter JL, Dejesus-Hernandez M, Postuma RB, Ramaswamy S, Ross OA,
Rutherford NJ, Baker M, Knopman DS, Salmon DP, Singleton A, Taylor A,
Wszolek ZK, Parisi JE, Dickson DW, Petersen Thomas A, Tiraboschi P, Toledo JB, Troja-
RC, Rademakers R, Jack CR, Josephs KA nowski JQ, Tsuang D, Walker Z, Yamada M,
(2012) Neuroimaging signatures of fronto- Kosaka K (2017) Diagnosis and management
temporal dementia genetics: C9orf72, tau, of dementia with lewy bodies. Neurology 89:
progranulin and sporadics. Brain 135:794. 88–100. https://doi.org/10.1212/WNL.
https://doi.org/10.1093/BRAIN/AWS001 0000000000004058
Machine Learning in ADRD 837
32. Rongve A, Aarsland D (2020) The Lewy body 41. Emmerzaal TL, Kiliaan AJ, Gustafson DR
dementias: dementia with Lewy bodies and (2015) 2003-2013: a decade of body mass
Parkinson’s disease dementia. In: Oxford index, Alzheimer’s disease, and dementia. J
textbook of old age psychiatry, pp 495–512. Alzheimer’s Dis 43:739–755. https://doi.
h t t p s : // d o i . o r g / 1 0 . 1 0 9 3 / M E D / org/10.3233/JAD-141086
9780198807292.003.0032 42. Morris JC (1993) The Clinical Dementia
33. Gatz M, Reynolds CA, Fratiglioni L, Rating (CDR): current version and scoring
Johansson B, Mortimer JA, Berg S, Fiske A, rules. Neurology 43:2412–2414. http://
Pedersen NL (2006) Role of genes and envir- www.ncbi.nlm.nih.gov/pubmed/8232972
onments for explaining Alzheimer disease. 43. Folstein MF, Folstein SE, McHugh PR
Arch Gen Psychiatry 63:168–174. https:// (1975) Mini-mental state. A practical method
doi.org/10.1001/ARCHPSYC.63.2.168 for grading the cognitive state of patients for
34. Selkoe DJ, Hardy J (2016) The amyloid the clinician. J Psychiatric Research 12:189–
hypothesis of Alzheimer’s disease at 25 years. 198. https://doi.org/10.1016/0022-3956
EMBO Mol Med 8:595–608. https://doi. (75)90026-6
org/10.15252/emmm.201606210 44. Rosen WG, Mohs RC, Davis KL (1984) A
35. Borroni B, Ferrari F, Galimberti D, new rating scale for Alzheimer’s disease. Am
Nacmias B, Barone C, Bagnoli S, J Psychiatry 141. https://doi.org/10.1176/
Fenoglio C, Piaceri I, Archetti S, AJP.141.11.1356
Bonvicini C, Gennarelli M, Turla M, 45. Nasreddine ZS, Phillips NA, Bédirian V,
Scarpini E, Sorbi S, Padovani A (2014) Het- Charbonneau S, Whitehead V, Collin I, Cum-
erozygous TREM2 mutations in frontotem- mings JL, Chertkow H (2005) The Montreal
poral dementia. Neurobiol Aging 35:934. Cognitive Assessment, MoCA: a brief screen-
e7–934.e10. https://doi.org/10.1016/J. ing tool for mild cognitive impairment. J Am
NEUROBIOLAGING.2013.09.017 Geriatr Soc 53:695–699. https://doi.
36. Chasioti D, Yan J, Nho K, Saykin AJ (2019) org/10.1111/J.1532-5415.2005.53221.X
Progress in polygenic composite scores in Alz- 46. Barnett JH, Blackwell AD, Sahakian BJ, Rob-
heimer’s and other complex diseases. Trends bins TW (2016) The Paired Associates
Genet 35:371. https://doi.org/10.1016/J. Learning (PAL) test: 30 years of CANTAB
TIG.2019.02.005 translational neuroscience from laboratory to
37. Altmann A, Scelsi MA, Shoai M, Silva ED, bedside in dementia research. Curr Top Behav
Aksman LM, Cash DM, Hardy J, Schott JM N e u r o s c i 2 8 : 4 4 9 – 4 7 4 . h t t p s : // d o i .
(2020) A comprehensive analysis of methods org/10.1007/7854_2015_5001
for assessing polygenic burden on Alzheimer’s 47. Brooker H, Williams G, Hampshire A,
disease pathology and risk beyond APOE. Corbett A, Aarsland D, Cummings J, Moli-
Brain Commun 2. https://doi.org/10. nuevo JL, Atri A, Ismail Z, Creese B,
1093/BRAINCOMMS/FCZ047 Fladby T, Thim-Hansen C, Wesnes K, Ballard
38. Stocker H, Möllers T, Perna L, Brenner H C (2020) Flame: a computerized neuropsy-
(2018) The genetic risk of Alzheimer’s disease chological composite for trials in early demen-
beyond APOE E4: systematic review of Alz- tia. Alzheimer’s Dementia 12. https://doi.
heimer’s genetic risk scores. Transl Psychiatry org/10.1002/DAD2.12098
8:1–9. https://doi.org/10.1038/s41398-01 48. Blennow K, Zetterberg H (2018) Biomarkers
8-0221-8 for Alzheimer’s disease: current status and
39. Lane CA, Barnes J, Nicholas JM, Sudre CH, prospects for the future. J Intern Med 284:
Cash DM, Parker TD, Malone IB, Lu K, 643–663. https://doi.org/10.1111/JOIM.
James SN, Keshavan A (2019) Associations 12816
between blood pressure across adulthood 49. Fourier A, Portelius E, Zetterberg H,
and late-life brain structure and pathology in Blennow K, Quadrio I, Perret-Liaudet A
the neuroscience substudy of the 1946 British (2015) Pre-analytical and analytical factors
birth cohort (Insight 46): an epidemiological influencing Alzheimer’s disease cerebrospinal
study. Lancet Neurol 18:942–952 fluid biomarker variability. Clin Chim Acta
40. McGrath ER, Beiser AS, DeCarli C, Plourde 449:9–15. https://doi.org/10.1016/J.
KL, Vasan RS, Greenberg SM, Seshadri S CCA.2015.05.024
(2017) Blood pressure from mid- to late life 50. Mattsson N, Andreasson U, Persson S, Car-
and risk of incident dementia. Neurology 89: rillo MC, Collins S, Chalbot S, Cutler N,
2447–2454. https://doi.org/10.1212/ Dufour-Rainfray D, Fagan AM, Heegaard
WNL.0000000000004741 NH, Hsiung GYR, Hyman B, Iqbal K,
838 Marc Modat et al.
Lachno DR, Lleó A, Lewczuk P, Molinuevo individuals from families with autosomal
JL, et al (2013) CSF biomarker variability in dominant Alzheimer’s disease: a longitudinal
the Alzheimer’s association quality control study. Lancet Neurol 17:241–250. https://
program. Alzheimer’s Dementia 9:251–261. doi.org/10.1016/S1474-4422(18)30028-0
h t t p s : // d o i . o r g / 1 0 . 1 0 1 6 / J . 56. Aizenstein HJ, Nebes RD, Saxton JA, Price
JALZ.2013.01.010 JC, Mathis CA, Tsopelas ND, Ziolko SK,
51. Nakamura A, Kaneko N, Villemagne VL, James JA, Snitz BE, Houck PR, Bi W, Cohen
Kato T, Doecke J, Doré V, Fowler C, Li QX, AD, Lopresti BJ, DeKosky ST, Halligan EM,
Martins R, Rowe C, Tomita T, Matsuzaki K, Klunk WE (2008) Frequent amyloid deposi-
Ishii K, Ishii K, Arahata Y, Iwamoto S, Ito K, tion without significant cognitive impairment
Tanaka K, Masters CL, Yanagisawa K (2018) among the elderly. Arch Neurol 65:1509.
High performance plasma amyloid-β biomar- https://doi.org/10.1001/ARCHNEUR.65.
kers for Alzheimer’s disease. Nature 554:249– 11.1509
254. https://doi.org/10.1038/nature25456 57. Rowe CC, Bourgeat P, Ellis KA, Brown B,
52. Janelidze S, Bali D, Ashton NJ, Barthélemy Lim YY, Mulligan R, Jones G, Maruff P,
NR, Vanbrabant J, Stoops E, Vanmechelen E, Woodward M, Price R, Robins P, Tochon-
He Y, Dolado AO, Triana-Baltzer G, Ponte- Danguy H, O’Keefe G, Pike KE, Yates P,
corvo MJ, Zetterberg H, Kolb H, Szoeke C, Salvado O, Macaulay SL,
Vandijck M, Blennow K, Bateman RJ, Hans- O’Meara T, Head R, Cobiac L, Savage G,
son O (2022) Head-to-head comparison of Martins R, Masters CL, Ames D, Villemagne
10 plasma phospho-tau assays in prodromal VL (2013) Predicting Alzheimer disease with
Alzheimer’s disease. Brain https://doi.org/ β-amyloid imaging: results from the
10.1093/BRAIN/AWAC333 Australian imaging, biomarkers, and lifestyle
53. Milà-Alomà M, Ashton NJ, Shekari M, study of ageing. Ann Neurol 74:905–913.
Salvadó G, Ortiz-Romero P, Montoliu- https://doi.org/10.1002/ANA.24040
Gaya L, Benedet AL, Karikari TK, Lantero- 58. Johnson KA, Sperling RA, Gidicsin CM, Car-
Rodriguez J, Vanmechelen E, Day TA, Gon- masin JS, Maye JE, Coleman RE, Reiman
zález-Escalante A, Sánchez-Benavides G, EM, Sabbagh MN, Sadowsky CH, Fleisher
Minguillon C, Fauria K, Molinuevo JL, AS, Doraiswamy PM, Carpenter AP, Clark
Dage JL, Zetterberg H, Gispert JD, Suárez- CM, Joshi AD, Lu M, Grundman M, Mintun
Calvet M, Blennow K (2022) Plasma MA, Pontecorvo MJ, Skovronsky DM (2013)
p-tau231 and p-tau217 as state markers of Florbetapir (f18-av-45) PET to assess amyloid
amyloid-β pathology in preclinical Alzhei- burden in Alzheimer’s disease dementia, mild
mer’s disease. Nat Med 28:1797–1801. cognitive impairment, and normal aging. Alz-
https://doi.org/10.1038/s41591-022-01 h e i m e r ’ s D e m e n t i a 9 . h t t p s : // d o i .
925-w org/10.1016/J.JALZ.2012.10.007
54. Villemagne VL, Burnham S, Bourgeat P, 59. Hatashita S, Yamasaki H, Suzuki Y, Tanaka K,
Brown B, Ellis KA, Salvado O, Szoeke C, Wakebe D, Hayakawa H (2014) [18f]flute-
Macaulay SL, Martins R, Maruff P, Ames D, metamol amyloid-beta PET imaging com-
Rowe CC, Masters CL (2013) Amyloid β pared with [11c]PIB across the spectrum of
deposition, neurodegeneration, and cognitive Alzheimer’s disease. Euro J Nucl Med Mol
decline in sporadic Alzheimer’s disease: a pro- I m a g i n g 4 1 : 2 9 0 – 3 0 0 . h t t p s : // d o i .
spective cohort study. Lancet Neurol 12:357– org/10.1007/S00259-013-2564-Y
367. https://doi.org/10.1016/S1474-4422 60. Leuzy A, Chiotis K, Lemoine L, Gillberg PG,
(13)70044-9 Almkvist O, Rodriguez-Vieitez E, Nordberg
55. Gordon BA, Blazey TM, Su Y, Hari-Raj A, A (2019) Tau PET imaging in neurodegener-
Dincer A, Flores S, Christensen J, ative tauopathies-still a challenge. Mol Psychi-
McDade E, Wang G, Xiong C, Cairns NJ, atry 24:1112–1134. https://doi.org/10.103
Hassenstab J, Marcus DS, Fagan AM, Clifford 8/S41380-018-0342-8
RJ Jr, Hornbeck RC, Paumier KL, Ances BM, 61. Braak H, Braak E (1991) Neuropathological
Berman SB, Brickman AM, Cash DM, Chhat- stageing of Alzheimer-related changes. Acta
wal JP, Correia S, Förster S, Fox NC, Graff- Neuropathol 82:239–259. https://doi.
Radford NR, la Fougère C, Levin J, Masters org/10.1007/BF00308809
CL, Rossor MN, Salloway S, Saykin AJ, Scho- 62. Cho H, Choi JY, Lee HS, Lee JH, Ryu YH,
field PR, Thompson PM, Weiner MM, Holtz- Lee MS, Jack CR, Lyoo CH (2019) Progres-
man DM, Raichle ME, Morris JC, Bateman sive tau accumulation in Alzheimer disease:
RJ, Benzinger TLS (2018) Spatial patterns of 2-year follow-up study. J Nucl Med 60:
neuroimaging biomarker change in
Machine Learning in ADRD 839
76. Wardlaw JM, Smith EE, Biessels GJ, Alzheimer’s disease. Neurobiol Aging 57:8–
Cordonnier C, Fazekas F, Frayne R, Lindley 1 7 . h t t p s : // d o i . o r g / 1 0 . 1 0 1 6 / j .
RI, O’Brien JT, Barkhof F, Benavente OR, neurobiolaging.2017.04.021
Black SE, Brayne C, Breteler M, Chabriat H, 82. Weston PSJ, Simpson IJA, Ryan NS,
DeCarli C, de Leeuw FE, Doubal F, Ourselin S, Fox NC (2015) Diffusion imag-
Duering M, Fox NC, Greenberg S, ing changes in grey matter in Alzheimer’s
Hachinski V, Kilimann I, Mok V, van disease: a potential marker of early neurode-
Oostenbrugge R, Pantoni L, Speck O, Ste- generation. Alzheimer’s Res Therapy 7:47.
phan BC, Teipel S, Viswanathan A, https://doi.org/10.1186/s13195-015-
Werring D, Chen C, Smith C, van 0132-3
Buchem M, Norrving B, Gorelick PB, Dich- 83. Contreras JA, Avena-Koenigsberger A, Risa-
gans M (2013) Neuroimaging standards for cher SL, West JD, Tallman E, McDonald BC,
research into small vessel disease and its con- Farlow MR, Apostolova LG, Goñi J,
tribution to ageing and neurodegeneration. Dzemidzic M, Wu YC, Kessler D, Jeub L,
Lancet Neurol 12:822–838. https://doi. Fortunato S, Saykin AJ, Sporns O (2019)
org/10.1016/S1474-4422(13)70124-8 Resting state network modularity along the
77. Fiford CM, Manning EN, Bartlett JW, Cash prodromal late onset Alzheimer’s disease con-
DM, Malone IB, Ridgway GR, Lehmann M, tinuum. NeuroImage Clin 22:101687.
Leung KK, Sudre CH, Ourselin S, Biessels https://doi.org/10.1016/J.NICL.2019.101
GJ, Carmichael OT, Fox NC, Cardoso MJ, 687
Barnes J (2017) White matter hyperintensities 84. Damoiseaux JS, Prater KE, Miller BL, Grei-
are associated with disproportionate progres- cius MD (2012) Functional connectivity
sive hippocampal atrophy. Hippocampus 27: tracks clinical deterioration in Alzheimer’s dis-
249–262. https://doi.org/10.1002/HIPO. ease. Neurobiol Aging 33:828.e19–828.e30.
22690 h t t p s : // d o i . o r g / 1 0 . 1 0 1 6 / J .
78. Rabin JS, Schultz AP, Hedden T, NEUROBIOLAGING.2011.06.024
Viswanathan A, Marshall GA, Kilpatrick E, 85. Mormino EC, Smiljic A, Hayenga AO,
Klein H, Buckley RF, Yang HS, Properzi M, Onami SH, Greicius MD, Rabinovici GD,
Rao V, Kirn DR, Papp KV, Rentz DM, John- Janabi M, Baker SL, Yen IV, Madison CM,
son KA, Sperling RA, Chhatwal JP (2018) Miller BL, Jagust WJ (2011) Relationships
Interactive associations of vascular risk and β- between beta-amyloid and functional connec-
amyloid burden with cognitive decline in clin- tivity in different components of the default
ically normal elderly individuals: findings from mode network in aging. Cereb Cortex 21:
the Harvard Aging Brain Study. JAMA 2399–2407. https://doi.org/10.1093/
N e u r o l . h t t p s : // d o i . o r g / 1 0 . 1 0 0 1 / CERCOR/BHR025
JAMANEUROL.2018.1123
86. Leng K, Li E, Eser R, Piergies A, Sit R, Tan M,
79. Sexton CE, Kalu UG, Filippini N, Mackay Neff N, Li SH, Rodriguez RD, Suemoto CK,
CE, Ebmeier KP (2011) A meta-analysis of Leite REP, Ehrenberg AJ, Pasqualucci CA,
diffusion tensor imaging in mild cognitive Seeley WW, Spina S, Heinsen H, Grinberg
impairment and Alzheimer’s disease. Neuro- LT, Kampmann M (2021) Molecular charac-
biol Aging 32:2322.e5–2322.e18. https:// terization of selectively vulnerable neurons in
d o i . o r g / 1 0 . 1 0 1 6 / J . Alzheimer’s disease. Nat Neurosci 24:276–
NEUROBIOLAGING.2010.05.019 287. https://doi.org/10.1038/s41593-
80. Mito R, Raffelt D, Dhollander T, Vaughan 020-00764-7
DN, Tournier JD, Salvado O, Brodtmann A, 87. Crist AM, Hinkle KM, Wang X, Moloney
Rowe CC, Villemagne VL, Connelly A CM, Matchett BJ, Labuzan SA,
(2018) Fibre-specific white matter reductions Frankenhauser I, Azu NO, Liesinger AM,
in Alzheimer’s disease and mild cognitive Lesser ER, Serie DJ, Quicksall ZS, Patel TA,
impairment. Brain. https://doi.org/10. Carnwath TP, DeTure M, Tang X, Petersen
1093/brain/awx355 RC, Duara R, Graff-Radford NR, Allen M,
81. Slattery CF, Zhang J, Paterson RW, Foulkes Carrasquillo MM, Li H, Ross OA, Ertekin-
AJ, Carton A, Macpherson K, Mancini L, Taner N, Dickson DW, Asmann YW, Carter
Thomas DL, Modat M, Toussaint N, Cash RE, Murray ME (2021) Transcriptomic anal-
DM, Thornton JS, Henley SM, Crutch SJ, ysis to identify genes associated with selective
Alexander DC, Ourselin S, Fox NC, hippocampal vulnerability in Alzheimer’s dis-
Zhang H, Schott JM (2017) ApoE influences ease. Nat Commun 12:1–17. https://doi.
regional white-matter axonal density loss in org/10.1038/s41467-021-22399-3
Machine Learning in ADRD 841
88. Weiner MW, Veitch DP, Aisen PS, Beckett LA, for segmentation of brain MRI scans of any
Cairns NJ, Green RC, Harvey D, Clifford RJ contrast and resolution. arXiv preprint
Jr, Jagust W, Liu E, Morris JC, Petersen RC, arXiv:210709559
Saykin AJ, Schmidt ME, Shaw L, Siuciak JA, 100. Coupé P, Manjón JV, Fonov V, Pruessner J,
Soares H, Toga AW, Trojanowski JQ (2011) Robles M, Collins DL (2011) Patch-based
The Alzheimer’s disease neuroimaging initia- segmentation using expert priors: application
tive: a review of papers published since its to hippocampus and ventricle segmentation.
inception. Alzheimer’s & Dementia 1–67. Neuroimage 54(2):940–954
https://doi.org/10.1016/j.jalz.2011.09.1 101. Collins DL, Pruessner JC (2010) Towards
72 accurate, automatic segmentation of the hip-
89. Jenkinson M, Beckmann CF, Behrens TE, pocampus and amygdala from mri by aug-
Woolrich MW, Smith SM (2012) FSL. Neu- menting animal with a template library and
roimage 62(2):782–790 label fusion. Neuroimage 52(4):1355–1366
90. Ashburner J (2012) SPM: a history. Neuro- 102. Chupin M, Gérardin E, Cuingnet R,
image 62(2):791–800 Boutet C, Lemieux L, Lehéricy S, Benali H,
91. Ashburner J, Friston KJ (2005) Unified seg- Garnero L, Colliot O (2009) Fully automatic
mentation. Neuroimage 26(3):839–851 hippocampus segmentation and classification
92. Zhang Y, Brady M, Smith S (2001) Segmen- in alzheimer’s disease and mild cognitive
tation of brain MR images through a hidden impairment applied on data from adni. Hip-
Markov random field model and the pocampus 19(6):579–587
expectation-maximization algorithm. IEEE 103. Tong T, Wolz R, Coupé P, Hajnal JV,
Trans Med Imaging 20(1):45–57 Rueckert D, Initiative ADN et al (2013) Seg-
93. Kumar P, Nagar P, Arora C, Gupta A (2018) mentation of mr images via discriminative
U-segnet: fully convolutional neural network dictionary learning and sparse coding: appli-
based automated brain tissue segmentation cation to hippocampus labeling. Neuroimage
tool. Proceedings—international conference 76:11–23
on image processing, ICIP, pp 3503–3507. 104. Xie L, Wisse LE, Pluta J, de Flores R, Piskin V,
https://doi.org/10.1109/ICIP.2018. Manjón JV, Wang H, Das SR, Ding SL, Wolk
8451295 DA et al (2019) Automated segmentation of
94. Shen D, Davatzikos C (2002) HAMMER: medial temporal lobe subregions on in vivo
hierarchical attribute matching mechanism t1-weighted mri in early stages of alzheimer’s
for elastic registration. IEEE Transactions on disease. Hum Brain Mapp 40(12):
Medical Imaging 21(11):1421–1439 3431–3451
95. Fischl B (2012) Freesurfer. Neuroimage 62: 105. Pipitone J, Park MTM, Winterburn J, Lett
774–81. https://doi.org/10.1016/j. TA, Lerch JP, Pruessner JC, Lepage M, Voi-
neuroimage.2012.01.021 neskos AN, Chakravarty MM, Initiative
96. Cardoso MJ, Modat M, Wolz R, ADN, et al (2014) Multi-atlas segmentation
Melbourne A, Cash DM, Rueckert D, Ourse- of the whole hippocampus and subfields using
lin S (2015) Geodesic information flows: multiple automatically generated templates.
spatially-variant graphs and their application Neuroimage 101:494–512
to segmentation and fusion. IEEE Trans Med 106. Yushkevich PA, Wang H, Pluta J, Das SR,
Imaging 34:1976–1988. https://doi.org/10. Craige C, Avants BB, Weiner MW, Mueller S
1109/TMI.2015.2418298 (2010) Nearly automatic segmentation of
97. Brébisson AD, Montana G (2015) Deep neu- hippocampal subfields in in vivo focal
ral networks for anatomical brain segmenta- t2-weighted MRI. Neuroimage 53(4):
tion. IEEE Comput Soc Conf Comput Vis 1208–1224
Pattern Recogn Workshops 2015- 107. Manjón JV, Romero JE, Coupe P (2022) A
October:20–28. https://doi.org/10.1109/ novel deep learning based hippocampus sub-
CVPRW.2015.7301312 field segmentation method. Sci Rep 12:1–9.
98. Henschel L, Conjeti S, Estrada S, Diers K, https://doi.org/10.1038/s41598-022-052
Fischl B, Reuter M (2020) FastSurfer-a fast 87-8
and accurate deep learning based neuroimag- 108. Vanderbecq Q, Xu E, Ströer S, Couvy-
ing pipeline. Neuroimage 219:117012 Duchesne B, Melo MD, Dormont D, Colliot
99. Billot B, Greve DN, Puonti O, Thielscher A, O (2020) Comparison and validation of seven
Van Leemput K, Fischl B, Dalca AV, Iglesias white matter hyperintensities segmentation
JE (2021) SynthSeg: domain randomisation software in elderly patients. NeuroImage:
Clin 27:102357
842 Marc Modat et al.
109. Sudre C, Cardoso MJ, Bouvy W, Biessels G, forest-based similarity measures for multi-
Barnes J, Ourselin S (2015) Bayesian model modal classification of Alzheimer’s disease.
selection for pathological neuroimaging data Neuroimage 65:167–175. https://doi.org/
applied to white matter lesion segmentation. h t t p s : // d o i . o r g / 1 0 . 1 0 1 6 / j .
IEEE Trans Med Imaging 34:1–1. https:// neuroimage.2012.09.065
doi.org/10.1109/TMI.2015.2419072 118. Tong T, Wolz R, Gao Q, Guerrero R, Hajnal
110. Boutinaud P, Tsuchida A, Laurent A, JV, Rueckert D, Initiative ADN et al (2014)
Adonias F, Hanifehlou Z, Nozais V, Multiple instance learning for classification of
Verrecchia V, Lampe L, Zhang J, Zhu YC, dementia in brain MRI. Med Image Anal
Tzourio C, Mazoyer B, Joliot M (2021) 3D 18(5):808–818
segmentation of perivascular spaces on 119. Lian C, Liu M, Zhang J, Shen D (2018)
T1-weighted 3 Tesla MR images with a con- Hierarchical fully convolutional network for
volutional autoencoder and a U-shaped neu- joint atrophy localization and Alzheimer’s
ral network. Front Neuroinform 15. https:// disease diagnosis using structural MRI.
doi.org/10.3389/FNINF.2021.641600 IEEE Trans Pattern Anal Mach Intell 42(4):
111. Wu J, Zhang Y, Wang K, Tang X (2019) Skip 880–893
connection U-Net for white matter hyperin- 120. Davatzikos C, Bhatt P, Shaw LM, Batman-
tensities segmentation from MRI. IEEE ghelich KN, Trojanowski JQ (2011) Predic-
Access 7:155194–155202. https://doi.org/ tion of MCI to AD conversion, via MRI, CSF
10.1109/ACCESS.2019.2948476 biomarkers, and pattern classification. Neuro-
112. Klöppel S, Stonnington CM, Chu C, biol Aging 32(12):2322.e19–2322.e27.
Draganski B, Scahill RI, Rohrer JD, Fox https://doi.org/https://doi.org/10.1016/
NC, Jack J Clifford R, Ashburner J, Fracko- j.neurobiolaging.2010.05.023
wiak RSJ (2008) Automatic classification of 121. Young J, Modat M, Cardoso MJ,
MR scans in Alzheimer’s disease. Brain Mendelson A, Cash D, Ourselin S, Initiative
131(3):681–689. https://doi.org/10.1093/ ADN, et al (2013) Accurate multimodal
brain/awm319 probabilistic prediction of conversion to Alz-
113. Vemuri P, Gunter JL, Senjem ML, Whitwell heimer’s disease in patients with mild cogni-
JL, Kantarci K, Knopman DS, Boeve BF, tive impairment. NeuroImage Clin 2:735–
Petersen RC, Jack Jr CR (2008) Alzheimer’s 745
disease diagnosis in individual subjects using 122. Coupé P, Fonov VS, Bernard C, Zandifar A,
structural mr images: validation studies. Neu- Eskildsen SF, Helmer C, Manjón JV,
roImage 39(3):1186–1197 Amieva H, Dartigues JF, Allard M, et al
114. Gerardin E, Chételat G, Chupin M, (2015) Detection of alzheimer’s disease sig-
Cuingnet R, Desgranges B, Kim HS, nature in mr images seven years before con-
Niethammer M, Dubois B, Lehéricy S, version to dementia: toward an early
Garnero L, et al (2009) Multidimensional individual prognosis. Hum Brain Mapp
classification of hippocampal shape features 36(12):4758–4770
discriminates alzheimer’s disease and mild 123. Jie B, Zhang D, Cheng B, Shen D, Initiative
cognitive impairment from normal aging. ADN (2015) Manifold regularized multitask
Neuroimage 47(4):1476–1486 feature learning for multimodality disease
115. Cuingnet R, Gerardin E, Tessieras J, classification. Hum Brain Mapp 36(2):
Auzias G, Lehéricy S, Habert MO, 489–507
Chupin M, Benali H, Colliot O, Initiative 124. Moradi E, Pepe A, Gaser C, Huttunen H,
ADN, et al (2011) Automatic classification Tohka J, Initiative ADN et al (2015) Machine
of patients with alzheimer’s disease from learning framework for early MRI-based Alz-
structural MRI: a comparison of ten methods heimer’s conversion prediction in MCI sub-
using the adni database. Neuroimage 56(2): jects. Neuroimage 104:398–412
766–781 125. Rathore S, Habes M, Iftikhar MA,
116. Coupé P, Eskildsen SF, Manjón JV, Fonov Shacklett A, Davatzikos C (2017) A review
VS, Collins DL (2012) Simultaneous seg- on neuroimaging-based classification studies
mentation and grading of anatomical struc- and associated feature extraction methods for
tures for patient’s classification: application to Alzheimer’s disease and its prodromal stages.
Alzheimer’s disease. Neuroimage 59:3736– Neuroimage 155:530–548
3 7 4 7 . h t t p s : // d o i . o r g / 1 0 . 1 0 1 6 / J . 126. Jo T, Nho K, Saykin AJ (2019) Deep learning
NEUROIMAGE.2011.10.080 in Alzheimer’s disease: diagnostic classifica-
117. Gray KR, Aljabar P, Heckemann RA, tion and prognostic prediction using
Hammers A, Rueckert D (2013) Random
Machine Learning in ADRD 843
neuroimaging data. Front Aging Neurosci sensitive and quantitative trait-relevant bio-
11:220. https://doi.org/10.3389/fnagi. markers from multidimensional heteroge-
2019.00220 neous imaging genetics data via sparse
127. Ebrahimighahnavieh MA, Luo S, Chiong R multimodal multitask learning. Bioinformat-
(2020) Deep learning to detect Alzheimer’s ics 28(12):i127–i136
disease from neuroimaging: a systematic liter- 135. Peng J, An L, Zhu X, Jin Y, Shen D (2016)
ature review. Comput Methods Program Structured sparse kernel learning for imaging
Biomed 187:105242. https://doi.org/ genetics based Alzheimer’s disease
https://doi.org/10.1016/j.cmpb.2019.10 diagnosis. In: International conference on
5242 medical image computing and computer-
128. Tanveer M, Richhariya B, Khan RU, Rashid assisted intervention MICCAI, pp 70–78
AH, Khanna P, Prasad M, Lin CT (2020) 136. Samper-González J, Burgos N, Bottani S,
Machine learning techniques for the diagnosis Fontanella S, Lu P, Marcoux A, Routier A,
of Alzheimer’s disease: a review. ACM Trans Guillon J, Bacci M, Wen J et al (2018) Repro-
Multimedia Comput Commun Appl 16(1s). ducible evaluation of classification methods in
https://doi.org/10.1145/3344998 alzheimer’s disease: framework and applica-
129. Wen J, Thibeau-Sutre E, Diaz-Melo M, Sam- tion to mri and pet data. Neuroimage 183:
per-González J, Routier A, Bottani S, 504–521
Dormont D, Durrleman S, Burgos N, Colliot 137. Wyman BT, Harvey DJ, Crawford K, a
O (2020) Convolutional neural networks for Bernstein M, Carmichael O, Cole PE, Crane
classification of Alzheimer’s disease: overview PK, Decarli C, Fox NC, Gunter JL, Hill D,
and reproducible evaluation. Med Image Anal Killiany RJ, Pachai C, Schwarz AJ, Schuff N,
63:101694 Senjem ML, Suhy J, Thompson PM,
130. Ansart M, Epelbaum S, Bassignana G, Weiner M, Clifford RJ Jr (2012) Standardiza-
Bône A, Bottani S, Cattai T, Couronné R, tion of analysis sets for reporting results from
Faouzi J, Koval I, Louis M, Thibeau-Sutre E, adni mri data. Alzheimer’s Dementia 1–6.
Wen J, Wild A, Burgos N, Dormont D, https://doi.org/10.1016/j.jalz.2012.0
Colliot O, Durrleman S (2021) Predicting 6.004
the progression of mild cognitive impairment 138. Bron EE, Smits M, Van Der Flier WM,
using machine learning: a systematic, quanti- Vrenken H, Barkhof F, Scheltens P, Papma
tative and critical review. Med Image Anal 67: JM, Steketee RM, Orellana CM,
101848. https://doi.org/https://doi. Meijboom R, et al (2015) Standardized eval-
org/10.1016/j.media.2020.101848 uation of algorithms for computer-aided
131. Canas LS, Sudre CH, De Vita E, Nihat A, diagnosis of dementia based on
Mok TH, Slattery CF, Paterson RW, Foulkes structural MRI: the CADDementia challenge.
AJ, Hyare H, Cardoso MJ, Thornton J, Neuroimage 111:562–579
Schott JM, Barkhof F, Collinge J, 139. Davatzikos C, Resnick SM, Wu X, Parmpi P,
Ourselin S, Mead S, Modat M (2019) Prion Clark CM (2008) Individual patient diagnosis
disease diagnosis using subject-specific imag- of AD and FTD via high-dimensional pattern
ing biomarkers within a multi-kernel Gaussian classification of MRI. Neuroimage 41(4):
process. NeuroImage Clin 24:102051. 1220–1227
https://doi.org/https://doi.org/10.1016/ 140. Koikkalainen J, Rhodius-Meester H,
j.nicl.2019.102051 Tolonen A, Barkhof F, Tijms B, Lemstra
132. Marinescu RV, Bron EE, Oxtoby NP, Young AW, Tong T, Guerrero R, Schuh A, Ledig C,
AL, Toga AW, Weiner MW, Barkhof F, Fox et al (2016) Differential diagnosis of neuro-
NC, Golland P, Klein S, et al (2020) Predict- degenerative diseases using structural MRI
ing Alzheimer’s disease progression: results data. NeuroImage Clin 11:435–449
from the TADPOLE Challenge. Alzheimer’s 141. Morin A, Samper-Gonzalez J, Bertrand A,
& Dementia 16:e039538 Ströer S, Dormont D, Mendes A, Coupé P,
133. Samper-Gonzalez J, Burgos N, Bottani S, Ahdidan J, Lévy M, Samri D et al (2020)
Habert MO, Evgeniou T, Epelbaum S, Col- Accuracy of MRI classification algorithms in
liot O (2019) Reproducible evaluation of a tertiary memory center clinical routine
methods for predicting progression to Alzhei- cohort. J Alzheimer’s Dis 74(4):1157–1166
mer’s disease from clinical and neuroimaging 142. Jack Jr CR, Knopman DS, Jagust WJ, Peter-
data. In: Medical imaging 2019: image pro- sen RC, Weiner MW, Aisen PS, Shaw LM,
cessing, vol 10949, pp 221–233 Vemuri P, Wiste HJ, Weigand SD, et al
134. Wang H, Nie F, Huang H, Risacher SL, Say- (2013) Tracking pathophysiological processes
kin AJ, Shen L (2012) Identifying disease in Alzheimer’s disease: an updated
844 Marc Modat et al.
hypothetical model of dynamic biomarkers. 150. Jedynak BM, Lang A, Liu B, Katz E, Zhang Y,
Lancet Neurol 12(2):207–216 Wyman BT, Raunig D, Jedynak CP, Caffo B,
143. Fonteijn HM, Modat M, Clarkson MJ, Prince JL (2012) A computational neurode-
Barnes J, Lehmann M, Hobbs NZ, Scahill generative disease progression score: method
RI, Tabrizi SJ, Ourselin S, Fox NC, Alexander and results with the Alzheimer’s disease neu-
DC (2012) An event-based model for disease roimaging initiative cohort. Neuroimage
progression and its application in familial Alz- 63(3):1478–1486. https://doi.org/https://
heimer’s disease and Huntington’s disease. doi.org/10.1016/j.neuroimage.2012.07.0
Neuroimage 60(3):1880–1889. https://doi. 59
o r g / h t t p s : // d o i . o r g / 1 0 . 1 0 1 6 / j . 151. Schiratti JB, Allassonnière S, Colliot O, Durr-
neuroimage.2012.01.062 leman S (2017) A Bayesian mixed-effects
144. Oxtoby NP, Young AL, Cash DM, Benzinger model to learn trajectories of changes from
TL, Fagan AM, Morris JC, Bateman RJ, Fox repeated manifold-valued observations. J
NC, Schott JM, Alexander DC (2018) Data- Mach Learn Res 18(1):4840–4872
driven models of dominantly-inherited Alz- 152. Koval I, Bône A, Louis M, Lartigue T,
heimer’s disease progression. Brain 141(5): Bottani S, Marcoux A, Samper-Gonzalez J,
1529–1544 Burgos N, Charlier B, Bertrand A, et al
145. Young AL, Oxtoby NP, Daga P, Cash DM, (2021) AD Course Map charts Alzheimer’s
Fox NC, Ourselin S, Schott JM, Alexander disease progression. Sci Rep 11(1):1–16
DC (2014) A data-driven model of biomarker 153. Wijeratne PA, Alexander DC (2020)
changes in sporadic Alzheimer’s disease. Brain Learning transition times in event sequences:
137(9):2564–2577 the event-based hidden markov model of dis-
146. Venkatraghavan V, Bron EE, Niessen WJ, ease progression. https://doi.org/10.
Klein S, et al (2019) Disease progression 48550/ARXIV.2011.01023
timeline estimation for alzheimer’s disease 154. Abi Nader C, Ayache N, Frisoni GB,
using discriminative event based modeling. Robert P, Lorenzi M, for the Alzheimer’s
NeuroImage 186:518–532 Disease Neuroimaging Initiative (2021)
147. Firth NC, Primativo S, Brotherhood E, Simulating the outcome of amyloid treat-
Young AL, Yong KX, Crutch SJ, Alexander ments in Alzheimer’s disease from imaging
DC, Oxtoby NP (2020) Sequences of cogni- and clinical data. Brain Commun 3(2).
tive decline in typical Alzheimer’s disease and https://doi.org/10.1093/braincomms/
posterior cortical atrophy estimated using a fcab091
novel event-based model of disease progres- 155. Erus G, Doshi J, An Y, Verganelakis D,
sion. Alzheimer’s Dementia 16(7):965–973 Resnick SM, Davatzikos C (2018) Longitudi-
148. Panman JL, Venkatraghavan V, Van Der Ende nally and inter-site consistent multi-atlas
EL, Steketee RM, Jiskoot LC, Poos JM, Dop- based parcellation of brain anatomy using har-
per EG, Meeter LH, Kaat LD, Rombouts SA, monized atlases. Neuroimage 166:71.
et al (2021) Modelling the cascade of bio- h t t p s : // d o i . o r g / 1 0 . 1 0 1 6 / J .
marker changes in GRN-related frontotem- NEUROIMAGE.2017.10.026
poral dementia. J Neurol Neurosurg 156. Jog A, Carass A, Roy S, Pham DL, Prince JL
Psychiatry 92(5):494–501 (2015) MR image synthesis by contrast
149. Montal V, Vilaplana E, Alcolea D, learning on neighborhood ensembles. Med
Pegueroles J, Pasternak O, González- Image A nal 24:63–76. https://doi.
Ortiz S, Clarimón J, Carmona-Iragui M, org/10.1016/J.MEDIA.2015.05.002
Illán-Gala I, Morenas-Rodrı́guez E, Ribosa- 157. Fortin JP, Sweeney EM, Muschelli J, Craini-
Nogué R, Sala I, Sánchez-Saudinós MB, Gar- ceanu CM, Shinohara RT (2016) Removing
cı́a-Sebastian M, Villanúa J, Izagirre A, inter-subject technical variability in magnetic
Estanga A, Ecay-Torres M, Iriondo A, resonance imaging studies. Neuroimage 132:
Clerigue M, Tainta M, Pozueta A, 198–212. https://doi.org/https://doi.
González A, Martı́nez-Heras E, Llufriu S, org/10.1016/j.neuroimage.2016.02.036
Blesa R, Sanchez-Juan P, Martı́nez-Lage P, 158. Fortin JP, Parker D, Tunc B, Watanabe T,
Lleó A, Fortea J (2018) Cortical microstruc- Elliott MA, Ruparel K, Roalf DR, Sat-
tural changes along the alzheimer’s disease terthwaite TD, Gur R, Schultz RT, Verma R,
continuum. Alzheimer’s Dementia 14:340– Shinohara RT (2017) Harmonization of
351. https://doi.org/10.1016/j.jalz.2017.0 multi-site diffusion tensor imaging data. Neu-
9.013 roimage 161:149–170. https://doi.org/
https://doi.org/10.1101/116541
Machine Learning in ADRD 845
159. Johnson WE, Li C, Rabinovic A (2006) 166. Costa A, Bak T, Caffarra P, Caltagirone C,
Adjusting batch effects in microarray expres- Ceccaldi M, Collette F, Crutch S, Sala SD,
sion data using empirical Bayes methods. Bio- Démonet JF, Dubois B, Duzel E, Nestor P,
statistics 8(1):118–127. https://doi.org/10. Papageorgiou SG, Salmon E, Sikkes S,
1093/biostatistics/kxj037 Tiraboschi P, Flier WMVD, Visser PJ, Cappa
160. Fortin JP, Cullen N, Sheline YI, Taylor WD, SF (2017) The need for harmonisation and
Aselcioglu I, Cook PA, Adams P, Cooper C, innovation of neuropsychological assessment
Fava M, McGrath PJ, McInnis M, Phillips in neurodegenerative dementias in Europe:
ML, Trivedi MH, Weissman MM, Shinohara consensus document of the Joint Program
RT (2018) Harmonization of cortical thick- for Neurodegenerative Diseases Working
ness measurements across scanners and sites. Group. Alzheimer’s Res Therapy 9:1–15,.
Neuroimage 167:104–120. https://doi. https://doi.org/10.1186/S13195-017-02
org/10.1016/J.NEUROIMAGE.201 54-X
7.11.024 167. Brodaty H, Woolf C, Andersen S, Barzilai N,
161. Torbati ME, Minhas DS, Ahmad G, O’Con- Brayne C, Cheung KSL, Corrada MM, Craw-
nor EE, Muschelli J, Laymon CM, Yang Z, ford JD, Daly C, Gondo Y, Hagberg B,
Cohen AD, Aizenstein HJ, Klunk WE, Chris- Hirose N, Holstege H, Kawas C, Kaye J,
tian BT, Hwang SJ, Crainiceanu CM, Tudor- Kochan NA, Lau BHP, Lucca U, Marcon G,
ascu DL (2021) A multi-scanner Martin P, Poon LW, Richmond R, Robine
neuroimaging data harmonization using JM, Skoog I, Slavin MJ, Szewieczek J,
RAVEL and ComBat. Neuroimage 245. Tettamanti M, Vi˜a J, Perls T, Sachdev PS
h t t p s : // d o i . o r g / 1 0 . 1 0 1 6 / J . (2016) ICC-dementia (International Cente-
NEUROIMAGE.2021.118703 narian Consortium—dementia): an interna-
162. Beer JC, Tustison NJ, Cook PA, tional consortium to determine the
Davatzikos C, Sheline YI, Shinohara RT, prevalence and incidence of dementia in cen-
Linn KA (2020) Longitudinal ComBat: a tenarians across diverse ethnoracial and socio-
method for harmonizing longitudinal multi- cultural groups. BMC Neurol 16:1–10.
scanner imaging data. Neuroimage 220: https://doi.org/10.1186/S12883-016-0
117129. https://doi.org/10.1016/j. 569-4/TABLES/2
neuroimage.2020.117129 168. Srikrishna M, Pereira JB, Heckemann RA,
163. Chen AA, Beer JC, Tustison NJ, Cook PA, Volpe G, van Westen D, Zettergren A,
Shinohara RT, Shou H (2022) Mitigating site Kern S, Wahlund LO, Westman E, Skoog I,
effects in covariance for machine learning in Schöll M (2021) Deep learning from
neuroimaging data. Hum Brain Mapp 43: MRI-derived labels enables automatic brain
1179–1195. https://doi.org/10.1002/ tissue classification on human brain
HBM.25688 CT. Neuroimage 244:118606. https://doi.
org/10.1016/J.NEUROIMAGE.2021.11
164. Pomponio R, Erus G, Habes M, Doshi J, 8606
Srinivasan D, Mamourian E, Bashyam V, Nas-
rallah IM, Satterthwaite TD, Fan Y, Launer 169. Bosco P, Redolfi A, Bocchetta M, Ferrari C,
LJ, Masters CL, Maruff P, Zhuo C, Völzke H, Mega A, Galluzzi S, Austin M, Chincarini A,
Johnson SC, Fripp J, Koutsouleris N, Wolf Collins DL, Duchesne S, Maráchal B,
DH, Gur R, Gur R, Morris J, Albert MS, Roche A, Sensi F, Wolz R, Alegret M,
Grabe HJ, Resnick SM, Bryan RN, Wolk Assal F, Balasa M, Bastin C, Bougea A,
DA, Shinohara RT, Shou H, Davatzikos C Emek-Savaş DD, Engelborghs S,
(2020) Harmonization of large MRI datasets Grimmer T, Grosu G, Kramberger MG,
for the analysis of brain imaging patterns Lawlor B, Stojmenovic GM, Marinescu M,
throughout the lifespan. Neuromage 208: Mecocci P, Molinuevo JL, Morais R,
116450. https://doi.org/10.1016/J. Niemantsverdriet E, Nobili F, Ntovas K,
NEUROIMAGE.2019.116450 O’Dwyer S, Paraskevas GP, Pelini L,
Picco A, Salmon E, Santana I, Sotolongo-
165. Prado P, Birba A, Cruzat J, Santamarı́a- Grau O, Spiru L, Stefanova E, Popovic KS,
Garcı́a H, Parra M, Moguilner S, Tsolaki M, Yener GG, Zekry D, Frisoni GB
Tagliazucchi E, Ibáñez A (2022) Dementia (2017) The impact of automated hippocam-
ConnEEGtome: towards multicentric harmo- pal volumetry on diagnostic confidence in
nization of EEG connectivity in neurodegen- patients with suspected Alzheimer’s disease:
eration. Int J Psychophysiol 172:24–38. a European Alzheimer’s disease consortium
h t t p s : // d o i . o r g / 1 0 . 1 0 1 6 / J . study. Alzheimer’s Dementia 13:1013–1023.
IJPSYCHO.2021.12.008 https://doi.org/10.1016/J.JALZ.201
7.01.019
846 Marc Modat et al.
170. Pemberton HG, Goodkin O, Prados F, Das Barkhof F (2021) Automated quantitative
RK, Vos SB, Moggridge J, Coath W, MRI volumetry reports support diagnostic
Gordon E, Barrett R, Schmitt A, Whiteley- interpretation in dementia: a multi-rater, clin-
Jones H, Burd C, Wattjes MP, Haller S, Ver- ical accuracy study. Euro Radiol 31:5312–
nooij MW, Harper L, Fox NC, Paterson RW, 5323. https://doi.org/10.1007/S00330-
Schott JM, Bisdas S, White M, Ourselin S, 020-07455-8/TABLES/6
Thornton JS, Yousry TA, Cardoso MJ,
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made. The images or other
third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated
otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder.
Chapter 26
Abstract
Parkinson’s disease is a complex heterogeneous neurodegenerative disorder characterized by the loss of
dopamine neurons in the basal ganglia, resulting in many motor and non-motor symptoms. Although there
is no cure to date, the dopamine replacement therapy can improve motor symptoms and the quality of life of
the patients. The cardinal symptoms of this disorder are tremor, bradykinesia, and rigidity, referred to as
parkinsonism. Other related disorders, such as dementia with Lewy bodies, multiple system atrophy, and
progressive supranuclear palsy, share similar motor symptoms although they have different pathophysiology
and are less responsive to the dopamine replacement therapy. Machine learning can be of great utility to
better understand Parkinson’s disease and related disorders and to improve patient care. Many challenges
are still open, including early accurate diagnosis, differential diagnosis, better understanding of the pathol-
ogies, symptom detection and quantification, individual disease progression prediction, and personalized
therapies. In this chapter, we review research works on Parkinson’s disease and related disorders using
machine learning.
Key words Clinical decision support, Deep learning, Disease understanding, Machine learning,
Multiple system atrophy, Parkinson’s disease, Parkinsonian syndromes, Parkinsonism, Precision medi-
cine, Progressive supranuclear palsy
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_26,
© The Author(s) 2023
847
848 Johann Faouzi et al.
of the disease have been described. More than 20 loci and asso-
ciated genes have been identified to be responsible for autosomal
dominant or recessive forms of the disease, and more than
90 genetic risk factors have been associated with sporadic PD
[3]. Although rare, genetic forms of the disease have brought
important insights on the causes and pathological mechanisms of
PD [4]. Among them, aggregation and spreading of misfolded
α-synuclein, the protein enriched in Lewy bodies, is supposed to
play a key role in the pathophysiology of the disease.
The loss of dopamine innervation of the basal ganglia network
in the brain leads to the cardinal motor symptoms of the disease
(parkinsonism): rest tremor, akinesia, and rigidity [2]. However,
the spreading of the synucleinopathy (aggregation of α-synuclein
protein) and neuronal loss outside the dopaminergic pathway is
associated with other non-motor symptoms like anosmia, sleep
disorders, dysautonomia, and progressive cognitive decline. Some
of these symptoms, particularly anosmia, constipation, and sleep
disorders, can precede the motor phase during a long prodromal
phase [5].
There is no cure for PD. The therapeutic strategy relies on the
dopamine replacement therapy by levodopa or dopamine agonists,
which alleviate motor symptoms. However, the dopamine replace-
ment therapy does not change the course of the disease, the pro-
gression being hampered by motor complications (motor
fluctuations and abnormal movement called dyskinesia), related
both to the progression of the neuronal loss and to pre- and post-
synaptic plasticity induced by the treatment. In addition, the dopa-
mine replacement therapy has no benefit on non-motor symptoms
not related to the loss of dopaminergic neurons.
PD is the most frequent synucleinopathy. Other neurodegen-
erative diseases share some clinical and pathophysiological features
of PD. Multiple system atrophy (MSA) is a rare disease associated
with parkinsonism with low response to levodopa, early dysauto-
nomia, and/or cerebellar symptoms [6]. The synucleinopathy
affects the substantia nigra, but also the striatum and the cerebel-
lum, and Lewy bodies are also observed in glial cells. There are two
variants of MSA: the parkinsonian variant (MSA-P) characterized
by parkinsonism and the cerebellar variant (MSA-C) characterized
by gait ataxia with cerebellar dysarthria. Dementia with Lewy bod-
ies (DLB), the second most common neurodegenerative dementia
after Alzheimer’s disease, is characterized by early cognitive decline,
hallucinations, and levodopa-responsive motor symptoms
[7]. However, whether DLB and PD with dementia are really two
distinct entities is still a matter of debate. There are also other rare
atypical parkinsonism syndromes, not related to a synucleinopathy.
Progressive supranuclear palsy (PSP) is a tauopathy (aggregation of
tau protein) characterized by a nonresponsive, axial predominant
parkinsonism, early falls, supranuclear gaze palsy, and a frontal
Machine Learning for Parkinson’s Disease and Related Disorders 849
2 Diagnosis
Table 1
Main characteristics of Parkinson’s disease and its related disorders
Table 2
Summary of the potential benefits of machine learning for Parkinson’s disease and related disorders
2.1 Parkinson’s Given the much larger prevalence of Parkinson’s disease compared
Disease Diagnosis to the atypical parkinsonian syndromes, gathering data from PD
Compared to Healthy patients and HC is naturally easier, especially easy-to-collect data
Subjects from sensors compared to clinical, imaging, or genetic data.
Digital technologies including wearable sensors, smartphone
applications, and smart algorithms receive a strongly increasing
interest and begin to move toward medical applications, particu-
larly in PD [9]. Two main types of sensor data are usually consid-
ered: voice data and motion data. Given that the cardinal symptoms
of PD are motor, motion data is natural, but speech also involves
motor muscles. Dysarthria, which is a motor speech disorder in
which the muscles involved in producing speech are damaged,
paralyzed, or weakened, is a symptom of PD.
2.1.1 PD Diagnosis Using Several types of sensors have been investigated to collect motion
Motion Data data depending on the movements of interest.
Wahid and colleagues [10] investigated the discrimination
between PD patients and healthy controls using gait data collected
during self-selected walking. They extracted spatial-temporal fea-
tures, such as stride length, stance time, swing time, and step
length, from the signals and investigated different strategies of
data normalization using dimensionless equations and multiple
regression and different machine learning algorithms such as
naive Bayes (NB), k-nearest neighbors (kNN), support vector
Table 3
852
Kostikis et al. [12] 2.1.1 PD diagnosis (PD vs Motion data Local 25 PD, 20 HC NB, LR, SVM,
HC) (hand AdaBoost, DT, RF
movement)
Kotsavasiloglou 2.1.1 PD diagnosis (PD vs Motion data Local 24 PD, 20 HC NB, LR, SVM,
et al. [13] HC) (hand AdaBoost
movement)
Amato et al. [14] 2.1.2 PD diagnosis (PD vs Voice data IPVC, local 28 PD, 22 HC; 26 PD, 18 HC NB, kNN, SVM, RF,
HC) AdaBoost, gradient
boosting, bagging
ensemble
Jeancolas et al. [15] 2.1.2 PD diagnosis (PD vs Voice data Local 121 PD, 100 HC MFCC-GMM,
HC) X-vector
Jeancolas et al. [16] 2.1.2 PD diagnosis (PD vs Voice data Local 117 PD, 41 iRBD, 98 HC SVM
HC)
Quan et al. [17] 2.1.2 PD diagnosis (PD vs Voice data Local 30 PD, 15 HC DT, kNN, NB, SVM,
HC) MLP, LSTM
Adeli et al. [21] 2.1.3 PD diagnosis (PD vs Imaging data PPMI 374 PD, 169 NC JFSS+robust LDA
HC) (MRI)
Solana-Lavalle and 2.1.3 PD diagnosis (PD vs Imaging data PPMI 114 PD, 49 HC; 234 PD; kNN, SVM, RF, NB,
Rosas-Romero HC) (MRI) 110 HC LR, MLP, BNN
[22]
Mudali et al. [23] 2.1.3 PD diagnosis (PD vs Imaging data Local 20 PD, 21 MSA, 17 PSP, 18 HC DT
HC), differential (FDG-PET)
diagnosis
Huppertz et al. 2.2 Differential diagnosis Imaging data Local 204 PD, 106 PSP, 21 MSA-C, SVM
[26] (MRI) 60 MSA-P, 73 HC
Archer et al. [27] 2.2 Differential diagnosis Imaging data Local 511 PD, 84 MSA, 129 PSP, SVM
(MRI) 278 HC
Chougar et al. [28] 2.2 Differential diagnosis Imaging data Local 63 PD, 21 PSP, 11 MSA-P, LR, SVM, RF
(MRI) 12 MSA-C, 72 HC; 56 PD,
30 PSP, 24 MSA-P,
11 MSA-C, 22 HC
Shinde et al. [29] 2.2 Differential diagnosis Imaging data Local 45 PD, 20 APS, 35 HC CNN, gradient
(MRI) boosting
Jucaite et al. [30] 2.2 Differential diagnosis Imaging data Local 24 PD, 66 MSA LDA
(PET)
Khawaldeh et al. 2.3 Parkinson’s disease Signal data Local 18 PD NB
[31] understanding (LFP)
(subthalamic nucleus
activity)
Poston et al. [32] 2.3 Parkinson’s disease Imaging data Local 54 PD SVM
understanding (fMRI)
(cognition)
Trezzi et al. [33] 2.3 Parkinson’s disease CSF Local 44 PD, 43 HC LR
understanding
(metabolism)
Vanneste et al. [34] 2.3 Parkinson’s disease Signal data Local 31 PD, 264 HC, 153 tinnitus, SVM
understanding (rs-EEG) 78 CP, 15 MDD
(thalamocortical
Machine Learning for Parkinson’s Disease and Related Disorders
dysrhythmia)
(continued)
853
854
Table 3
(continued)
Latourelle et al. 4.2 Disease progression Clinical and PPMI, local 312 PD, 117 HC; 317 PD REFS
[57] (prediction genetic data
of motor progression)
855
(continued)
856
Table 3
(continued)
Amara et al. [59] 4.2 Disease progression Clinical and PPMI 423 PD, 196 HC Random survival forest
(prediction of imaging
excessive daytime (DaTscan)
sleepiness) data
Couronné et al. 4.2 Disease progression Clinical and PPMI 362 PD Generative mixed
[60] (prediction of motor imaging effect model
and non-motor (DaTscan)
symptoms) data
Faouzi et al. [61] 4.2 Disease progression Clinical and PPMI, local 380 PD; 388 PD LR, SVM, RF, gradient
(prediction of impulse genetic data boosting, RNN
control disorders)
Yang et al. [63] 5.1 Treatment adjustment Imaging data Local 38 PD Ridge, SVM,
(motor symptoms) (fMRI) AdaBoost, gradient
boosting
Kim et al. [64] 5.1 Treatment adjustment Clinical and PPMI 431 PD Markov decision
(motor symptoms) medication process
data
Boutet et al. [66] 5.2 Optimal deep brain Imaging data Local 67 PD LDA
stimulation parameters (fMRI)
Geraedts et al. [67] 5.2 Optimal deep brain Signal data Local 112 PD RF
stimulation parameters (EEG)
Phokaewvarangkul 5.3 Electrical muscle Motion data Local 20 PD LR, RF, SVM, MLP,
et al. [68] stimulation (resting (hand tremor) LSTM
tremor)
Panyakaew et al. 5.3 Adverse event prevention Clinical data Local 305 PD Gradient boosting
[69] (identification of
modifiable risk factors
of falls)
Modalities: CSF cerebrospinal fluid, DaTscan dopamine transporter scan, EEG electroencephalography, FDG-PET [18F]-fluorodeoxyglucose positron emission tomography,
fMRI functional magnetic resonance imaging, LFP local field potential, MRI magnetic resonance imaging, PET positron emission tomography, rs-EEG resting-state electroen-
cephalography, rs-fMRI resting-state functional magnetic resonance imaging
Data sets: eGaIT embedded Gait analysis using Intelligent Technologies (https://www.mad.tf.fau.de/research/activitynet/digital-biobank/), IPVC Italian Parkinson’s Voice and
Speech (https://ieee-dataport.org/open-access/italian-parkinsons-voice-and-speech), LRRK2 Cohort Consortium (https://www.michaeljfox.org/news/lrrk2-cohort-consor
tium), PPMI Parkinson’s Progression Markers Initiative (https://www.ppmi-info.org)
Subjects: APS atypical parkinsonian syndromes, CP chronic pain, FOG freezing of gait, HC healthy controls, iPD idiopathic Parkinson’s disease, iRBD idiopathic rapid eye
movement sleep behavior disorder, MDD major depressive disorder, MSA multiple system atrophy, MSA-C cerebellar variant of multiple system atrophy, MSA-P parkinsonian
variant of multiple system atrophy, PD Parkinson’s disease, PD-CN Parkinson’s disease with normal cognition, PD-D Parkinson’s disease with dementia, PD-HYI Parkinson’s
disease with Hoehn and Yahr stage 1, PD-HYII Parkinson’s disease with Hoehn and Yahr stage 2, PD-HYIII Parkinson’s disease with Hoehn and Yahr stage 3, PD-LRRK2
Parkinson’s disease with LRRK2 mutation, PD-MCI Parkinson’s disease with mild cognitive impairment, PSP progressive supranuclear palsy
Methods: BNN Bayesian neural network, CNN convolutional neural network, DT decision tree, HMM hidden Markov model, JFSS joint feature-sample selection, KFD kernel
Fisher discriminant, kNN k-nearest neighbors, LR logistic regression, LDA linear discriminant analysis, LSTM long short-term memory, MFCC-GMM Mel-frequency cepstral
coefficients-Gaussian mixture model, MLP multi-layer perceptron, NB naive Bayes, REFS reverse engineering and forward simulation, RF random forest, RNN recurrent neural
network, RUSBoost random under-sampling boosting, SVM support vector machine
Machine Learning for Parkinson’s Disease and Related Disorders
857
858 Johann Faouzi et al.
machines (SVM), and random forests (RF). They obtained the best
predictive performance with the random forest trained on features
normalized using multiple regression.
Mirelman and colleagues [11] also investigated gait and mobil-
ity measures that are indicative of PD and PD stages. They gathered
data from sensors adhered to the participant’s lower back, bilateral
ankles, and wrists, during short walks, and extracted gait features.
They investigated several strategies to perform feature selection and
use a random under-sampling boosting classification algorithm to
tackle class imbalance. When comparing PD patients with mild PD
severity (Hoehn and Yahr stage 1) to healthy controls, they
obtained good discriminative performance (84% sensitivity, 80%
specificity). Most discriminative features were extracted from the
upper limb sensors, with the remaining features extracted from the
trunk sensor, while the lower limb sensors did not contribute to
discrimination accuracy.
Kostikis and colleagues [12] investigated upper limb tremor
using a smartphone-based tool. Signals from the phone’s acceler-
ometer and gyroscope were computed, from which features were
extracted. They trained several machine learning algorithms,
including random forest, naive Bayes, logistic regression (LR),
and support vector machine, using these features as input and
obtained the highest discriminative performance between PD
patients and HC with the random forest model.
Kotsavasiloglou and colleagues [13] investigated the use of a
pen-and-tablet device to study the differences in hand movement
and muscle coordination between PD patients and HC. Data con-
sisted of the trajectory of the pen’s tip and on the pad’s surface from
drawings of simple horizontal lines, from which they extracted
features. They investigated several machine learning algorithms,
such as logistic regression, support vector machine, and random
forest, and used nested cross-validation to perform feature selec-
tion. They obtained the highest discriminative performance with
the naive Bayes model.
2.1.2 PD Diagnosis Using Voice data is usually recorded from high-quality microphones or
Voice Data from smartphones during specific vocal tasks focused on character-
istics such as phonation and speech. Features are then extracted
from the corresponding signals and used as input to machine
learning classification algorithms.
Amato and colleagues [14] analyzed specific phonetic groups in
native Italian speakers, extracted several spectral moments from the
signals, and trained a SVM algorithm on these extracted features to
distinguish PD patients from HC. They first worked on a public
data set called Italian Parkinson’s Voice and Speech,1 with data
1
https://ieee-dataport.org/open-access/italian-parkinsons-voice-and-speech
Machine Learning for Parkinson’s Disease and Related Disorders 859
2.1.3 PD Diagnosis Using The diagnosis of PD remains based on its clinical presentation
Imaging Data [19]. Imaging of dopaminergic terminals loss can be assessed
using nuclear imaging, but it is not recommended in clinical rou-
tine and does not differentiate PD from other related disorders
associated with dopamine neuron loss [20]. Standard brain mag-
netic resonance imaging (MRI) is normal in PD. However, several
new markers have been recently been investigated in several studies,
with mixed results.
Adeli and colleagues [21] investigated the use of T1-weighted
anatomical MRI data to differentiate PD patients from HC. They
developed a joint feature-sample selection algorithm in order to
select an optimal subset of both features and samples from a train-
ing set, and a robust classification framework that performs denois-
ing of the selected features and samples then learns a classification
model. They analyzed data from 374 PD patients and 169 HC from
the Parkinson’s Progression Markers Initiative2 (PPMI) cohort and
included white matter, gray matter, and cerebrospinal fluid mea-
surements from 98 regions of interest. The combination of the
proposed feature selection/extraction method and classifier
achieved the highest predictive accuracy (0.819), being significantly
better than almost every other combination of a feature selection/
extraction method and a classification algorithm.
Solana-Lavalle and Rosas-Romero [22] investigated the use of
voxel-based morphometry features extracted from T1-weighted
anatomical MRI to perform a PD vs HC classification task. Their
pipeline consisted of five stages: (i) identification of regions of
interest using voxel-based morphometry, (ii) analysis of these
regions for PD detection, (iii) feature extraction based on first-
and second-order statistics, (iv) feature selection based on principal
component analysis, and (v) classification with tenfold cross-
validation based on seven different algorithms (including
k-nearest neighbors, support vector machine, random forest,
naive Bayes, and logistic regression). They obtained excellent pre-
dictive performance for both male and female genders and for both
1.5 T and 3 T MRI scans (accuracy scores ranging from 0.93 to
0.99 for the best classification algorithms). However, cross-
validation was performed very late in their pipeline (after the feature
subset selection), which could lead to biased models and overly
optimistic predictive performances.
Mudali and colleagues [23] investigated another modality,
[18F]-fluorodeoxyglucose positron emission tomography
(FDG-PET), to compare 20 PD patients and 18 HC. They applied
the subprofile model/principal component analysis method to
extract features from the images. They considered a DT algorithm
and used leave-one-out cross-validation to evaluate the predictive
2
https://www.ppmi-info.org
Machine Learning for Parkinson’s Disease and Related Disorders 861
2.2 Differential The PD vs HC binary classification task has limited utility as, even
Diagnosis at the early stage of PD, patients have clinical symptoms strongly
suggesting that they suffer from a movement disorder and thus are
not healthy subjects. However, the accurate early diagnosis of
parkinsonian syndromes is difficult but needed due to the different
pathologies and thus the different care. Although one study inves-
tigated the differential diagnosis using sensor-based gait analysis
[25], most studies investigated it using imaging data, particularly
diffusion MRI.
Huppertz and colleagues [26] investigated the differential
diagnosis with data from a relatively large cohort (73 HC,
204 PD, 106 PSP, 20 MSA-C, and 60 MSA-P). Using atlas-based
volumetry of brain MRI data, they extracted volumes in several
regions of interest and trained and evaluated a linear SVM algo-
rithm using leave-one-out cross-validation. They obtained good
predictive performance in most binary classification tasks and
showed that midbrain, basal ganglia, and cerebellar peduncles
were the most relevant regions.
A landmark study on this topic was published in 2019 by
Archer and colleagues [27], with diffusion-weighted MRI data
being collected for 1002 subjects from 17 MRI centers in Austria,
Germany, and the USA. They extracted 60 free-water and 60 free-
water-corrected fractional anisotropy values from diffusion-
weighted MRI data, and the other features consisted of the third
part of the Movement Disorder Society-Sponsored revision of the
Unified Parkinson’s Disease Rating Scale (MDS-UPDRS III), sex,
and age. They trained several SVM models and showed that the
model trained using MDS-UPDRS III (with sex and age also, for all
the models) performed poorly in most classification tasks, whereas
the model trained using DWI features had much higher predictive
performance (particularly for the MSA vs PSP task), and adding
MDS-UPDRS III to this model did not improve the performance.
More recently, Chougar and colleagues [28] investigated the
replication of such differential diagnosis models in clinical practice
on different MRI systems. Using MRI data from 119 PD, 51 PSP,
35 MSA-P, 23 MSA-C, and 94 HC, split into a training cohort
(n = 179) and a replication cohort (n = 143), they extracted
volumes and diffusion tensor imaging (DTI) features (fractional
anisotropy, mean diffusivity, axial diffusivity, and radial diffusivity)
in 13 regions of interest. They investigated two feature normaliza-
tion strategies (one based on the data of all subjects in the training
862 Johann Faouzi et al.
set and one based on the data of HC for each MRI system to tackle
the different feature distributions, in particular for DTI features,
because of the use of different MRI systems) and four standard
machine learning algorithms, including logistic regression, support
vector machines, and random forest. They obtained high perfor-
mances in the replication cohort for many binary classification tasks
(PD vs PSP, PD vs MSA-C, PSP vs MSA-C, PD vs atypical parkin-
sonism), but lower performances for other classification tasks
involving MSA-P patients (PD vs MSA-P, MSA-C vs MSA-P).
They showed that adding DTI features did not improve perfor-
mance compared to using volumes only and that the usual normali-
zation strategy worked best in this case.
Shinde and colleagues [29] investigated the automatic extrac-
tion of contrast ratios of the substantia nigra pars compacta from
neuromelanin-sensitive MRI using a convolutional neural network.
Based on the class activation maps, they identified that the left side
of substantia nigra pars compacta played a more important role in
the decision of the model compared to the right side, in agreement
with the concept of asymmetry in PD.
A recent study [30] investigated the use of positron emission
tomography of the translocator protein, expressed by glial cells, and
extracted normalized standardized uptake value images and nor-
malized total distribution volume images. Using a linear discrimi-
nant analysis algorithm with leave-one-subject-out cross-validation,
they obtained great discriminative power between MSA and PD
patients, with better performance with normalized total distribu-
tion volume images.
2.3 Disease Rather than focusing on the diagnosis of Parkinson’s disease itself,
Understanding several studies were more focused on interpreting the trained
machine learning models in order to better understand the
mechanisms of Parkinson’s disease.
Khawaldeh and colleagues [31] investigated the task-related
modulation of local field potentials of the subthalamic nucleus
before and during voluntary upper and lower limb movements in
18 consecutive Parkinson’s disease patients undergoing deep brain
stimulation (DBS) surgery of the subthalamic nucleus in order to
improve motor symptoms. Using a naive Bayes classification algo-
rithm, they obtained chance-level performance at rest, but much
higher performance during the pre-cue, pre-movement onset, and
post-movement onset tasks. They showed that the presence of
bursts of local field potential activity in the alpha and, even more
so, in the beta frequency band significantly compromised the pre-
diction of the limb to be moved, concluding that low-frequency
bursts restrict the capacity of the basal ganglia system to encode
physiologically relevant information about intended actions.
Machine Learning for Parkinson’s Disease and Related Disorders 863
3.1 Freezing of Gait Freezing of gait (FOG) is a common motor symptom and is asso-
ciated with life-threatening accidents such as falls. Prompt identifi-
cation or prediction of freezing of gait episodes is thus needed.
Ahlrichs and colleagues [37] investigated freezing of gait in
20 PD patients (8 with FOG, 12 without FOG), split into a training
set (15 patients) and a test set (5 patients). They collected sensor
(accelerometer, gyroscope, and magnetometer) data during
scripted activities (e.g., walking around the apartment, carrying a
full glass of water from the kitchen to another room) and
non-scripted activities (e.g., answering the phone). Two recording
sessions were considered, one in “OFF” motor state and one in
“ON” motor state, and the data was labeled by experienced clin-
icians based on the corresponding video recordings. The task was a
binary classification task (FOG vs no FOG) for each window. They
extracted sub-signals from the whole signals using a sliding window
and then extracted features and in the time and frequency domains
for each sub-signal. They trained two SVM algorithms (one with a
linear kernel, one with a Gaussian kernel) and obtained high and
better results with the linear kernel.
Aich and colleagues [38] gathered sensor data for 36 PD
patients with FOG and 15 PD patients without FOG from 2 wear-
able triaxial accelerometers during clinical experiments. They
extracted features, such step time, stride time, step length, stride
length, and walking speed, from the signals. They trained several
classic machine learning classification algorithms (SVM, kNN, DT,
NB) and obtained good predictive performances with all of them,
although the SVM model had the highest mean accuracy on the test
sets of the cross-validation procedure.
Borzı̀ and colleagues [39] collected data from 2 inertial sensors
placed on each shin of the 11 PD patients during the “timed up and
go” test in order to investigate FOG and pre-FOG detection. They
extracted features in the time and frequency domains and trained
decision tree algorithms. They obtained great predictive perfor-
mance to detect FOG episodes, but lower performance to predict
pre-FOG episodes, with the performance decreasing even more as
the window length increased.
Machine Learning for Parkinson’s Disease and Related Disorders 865
3.2 Bradykinesia and Bradykinesia and tremor are two other motor symptoms that are
Tremor frequently investigated for automatic assessment.
Park and colleagues [42] investigated automated rating for
resting tremor and bradykinesia from video clips of resting tremor
and finger tapping of the bilateral upper limbs. They extracted
several features from the video clips, including resting tremor
amplitude and finger tapping speed, amplitude, and fatigue, using
a pre-trained deep learning model. These features were used as
input of a SVM algorithm to predict the corresponding scores
from the MDS-UPDRS scale. For resting tremors, the automated
approach had excellent reliability range with the gold standard
rating and higher performance than that of non-trained human
rater. For finger tapping, the automated approach had good reli-
ability range with the gold standard rating and similar performance
than that of non-trained human rater.
Kim and colleagues [43] performed a study in which they
investigated tremor severity using three-dimensional acceleration
and gyroscope data obtained from wearable device. They investi-
gated a convolutional neural network to automatically extract fea-
tures and perform classification, compared to extracting defined
features from the time and frequency domains and training stan-
dard machine learning algorithms (random forest, naive Bayes,
linear regression, support vector machines) using these features.
They obtained better higher predictive performance with the deep
learning approach than the standard machine learning approach.
Eskofier and colleagues [44] obtained similar results using inertial
measurement units collected during motor tasks.
3.3 Cognition Cognitive impairment is frequent in PD, with the point prevalence
of PD dementia being around 30% and the cumulative prevalence
for patients surviving more than 10 years being at least 75%
[45]. Due to its high negative impact on the quality of life of PD
patients and their caregivers, it is important to identify and quantify
cognitive impairment. Several scales to assess cognition already
866 Johann Faouzi et al.
3.4 Other Symptoms Although less prevalent in the literature, studies also investigated
other PD symptoms such as falls and motor severity.
An early study by Hannink and colleagues [51] was performed
to investigate gait parameter extraction from sensor data using
convolutional neural networks. Using 3d-accelerometer and
3d-gyroscope data from 99 geriatric patients, the objective was to
predict the stride length and width, the foot angle, and the heal and
toe contact times. They investigated two approaches to tackle this
multi-output regression task, either training a single convolutional
neural network to predict the five outcomes or training a convolu-
tional neural network for each outcome, and obtained better per-
formance on an independent test set with the latter approach.
Although the considered population was not parkinsonian, the
prevalence of gait symptoms in this population and the obtained
results might be relevant to better understand gait in this popula-
tion. Lu and colleagues [52] investigated gait in PD, as measured
by MDS-UPDRS item 3.10, which does not include freezing of
gait. They collected video recordings of MDS-UPDRS exams from
55 participants which were scored by 3 different trained movement
disorder neurologists, and the ground truth score was defined
using majority voting among the 3 raters. They performed skeleton
extraction from the videos and trained a convolutional neural net-
work, with regularization using rater confusion estimation to tackle
noise in labels, to predict gait severity. They obtained correct per-
formance on the test set (72% accuracy with majority voting, 84%
accuracy with the model predicting at least one of the raters’
scores).
Gao and colleagues [53] investigated falls in two data sets
independently collected at two different sites. Using clinical scores
as input, they trained several classic machine learning classification
algorithms to differentiate fallers from non-fallers. They obtained
acceptable predictive performance in both data sets when training
and evaluating (using cross-validation) a model in each data set
independently. They also showed that the predictive performance
was lower when training the model on one data set and evaluating it
on the other data set, which is not surprising, but it is important to
have this possible issue in mind when a model is not evaluated on a
different cohort.
868 Johann Faouzi et al.
4 Disease Progression
4.1 Disease Several studies focused on the identification of more specific disease
Subtypes subtypes than the two aforementioned well-known ones character-
ized by postural instability and gait difficulty for one and tremor-
predominant for the other.
Severson and colleagues [54] worked on the development of a
statistical progression model of Parkinson’s disease accounting for
intra-individual and inter-individual variability, as well as medica-
tion effects. They built a contrastive latent variable model followed
by a personalized input-output hidden Markov model to define
disease states and assessed the clinical significance of the states on
seven key motor or cognitive outcomes (mild cognitive
impairment, dementia, dyskinesia, presence of motor fluctuations,
functional impairment from motor fluctuations, Hoehn and Yahr
score, and death). They identified eight disease states that were
primarily differentiated by functional impairment, tremor, bradyki-
nesia, and neuropsychiatric measures. The terminal state had the
highest prevalence of key clinical outcomes, including almost every
recorded instance of dementia. The discovered states were
non-sequential, with overlapping disease progression trajectories,
supporting the use of non-deterministic disease progression mod-
els, and suggesting that static subtype assignment might be ineffec-
tive at capturing the full spectrum of PD progression.
Salmanpour and colleagues [55] performed a longitudinal clus-
tering analysis and prediction PD progression. They extracted
almost a thousand features, including motor, non-motor, and
radiomics features. They performed a cross-sectional clustering
analysis and identified three distinct progression trajectories, with
two trajectories being characterized by disease escalation and the
other trajectory by disease stability. They also investigated the
prediction of progression trajectories from early stage (baseline
and year 1) data and obtained the highest predictive performance
with a probabilistic neural network.
were among the most relevant features to predict Hoehn and Yahr
stage and MDS-UPDRS III total score, giving evidence that
peripheral cytokines may have utility for aiding prediction of PD
progression using machine learning models.
Amara and colleagues [59] investigated the prediction of future
incidents of excessive daytime sleepiness. They trained a random
survival forest using 33 baseline variables, including anxiety, depres-
sion, rapid eye movement sleep, cognitive scores, α-synuclein,
p-tau, t-tau, and ApoE ε4 status. The performance of the model
was only marginally better than random guess, but the strongest
predictive features were p-tau and t-tau.
Couronné and colleagues [60] performed longitudinal data
analysis to predict patient-specific trajectories. They proposed to
use a generative mixed effect model that considers the progression
trajectories as curves on a Riemannian manifold and that can handle
missing values. They applied their model to PD progression with
joint modelling of two features (MDS-UPDRS III total score and
striatal binding ration in right caudate). Interpretation of the model
revealed that patients with later onset progress significantly faster
and that α-synuclein mean level was correlated with PD onset.
Faouzi and colleagues [61] investigated the prediction of
future impulse control disorders (psychiatric disorders character-
ized by the inability to resist an urge or an impulse and which
include a wide range of types including compulsive shopping,
internet addiction, and hypersexuality, for instance) in Parkinson’s
disease. The objective of their study was to predict the presence or
absence of these disorders at the next clinical visit of a given patient.
Using clinical and genetic data, they trained several machine
learning models on a training cohort and evaluated the models on
the training cohort (using cross-validation) and on an independent
replication cohort. They showed that a recurrent neural network
model achieved significantly better performance than a trivial
model (predicting the status at the next visit with the status at the
most recent visit), but the increase in performance was too small to
be deemed clinically relevant. Nevertheless, this proof-of-concept
study highlights the potential of machine learning for such
prediction.
5.2 Deep Brain Deep brain stimulation is a neurosurgical procedure that uses
Stimulation implanted electrodes and electrical stimulation and has proven
efficacy in advanced Parkinson’s disease by decreasing motor fluc-
tuations and dyskinesia and improving quality of life [65]. The
most commonly stimulated region is the subthalamic nucleus, but
the globus pallidus is sometimes preferred. Although DBS usually
greatly improves the motor symptoms, it also has downsides, such
as requiring personalized parameters and potential adverse events
such as postoperative cognitive decline.
Boutet and colleagues [66] investigated the prediction of opti-
mal deep brain stimulation parameters from functional MRI data.
They extracted blood-oxygen-level-dependent (BOLD) signals in
16 motor and non-motor regions of interest for 67 PD patients,
from which 62 underwent DBS of the subthalamic nucleus and
5 underwent DBS of the globus pallidus. They trained a linear
discriminant analysis algorithm on normalized BOLD changes
using fivefold cross-validation and obtained great performance in
classifying optimal vs non-optimal parameter settings, although the
performance was lower on two additional (a priori clinically opti-
mized and in stimulation-naive patients) unseen data sets.
Geraedts and colleagues [67] also investigated deep brain stim-
ulation in the context of cognitive function, as a downside of DBS
for PD is the potential deterioration of cognition postoperatively.
They extracted features from electroencephalograms, trained ran-
dom forest algorithms using tenfold cross-validation, and obtained
great discrimination between PD patients with the best and worst
cognitive performances. However, it should be noted that they only
included the best and worst cognitive performers (n = 20 per
group from 112 PD patients), making the classification task proba-
bly much easier than if it was performed on the 112 PD patients,
thus requiring their model to be evaluated on PD patients indepen-
dently on their cognitive performance. Nonetheless, their results
suggest the potential utility of electroencephalography for cogni-
tive profiling in DBS.
5.3 Others Phokaewvarangkul and colleagues [68] explored the effect of elec-
trical muscle stimulation as an adjunctive treatment for resting
tremor during “ON” period, with machine learning used to predict
the optimal stimulation level that will yield the longest period of
tremor reduction or tremor reset time. They used sensor data from
a glove incorporating a three-axis gyroscope to measure tremor
signals. The stimulation levels were discretized into five ordinary
classes, with the objective to predict the accurate class from the
sensor data. They observed a significant reduction in tremor
Machine Learning for Parkinson’s Disease and Related Disorders 873
6 Conclusion
Acknowledgments
The authors would like to thank Jochen Klucken for his fruitful
remarks. This work was supported by the French government
under the management of Agence Nationale de la Recherche as
part of the “Investissements d’avenir” program, reference
ANR-19-P3IA-0001 (PRAIRIE 3IA Institute) and reference
ANR-10-IAIHU-06 (Agence Nationale de la Recherche-10-IA
Institut Hospitalo-Universitaire-6), by the European Union
H2020 Programme (grant number 826421, project TVB-Cloud),
and by the ERA PerMed EU-wide project DIGIPD (01KU2110).
References
11. Mirelman A, Ben Or Frank M, Melamed M 24. Mitchell T, Lehéricy S, Chiu SY et al (2021)
et al (2021) Detecting sensitive mobility fea- Emerging neuroimaging biomarkers across dis-
tures for Parkinson’s disease stages via machine ease stage in Parkinson disease: a review. JAMA
learning. Mov Disord 36:2144–2155 Neurol 78:1262–1272
12. Kostikis N, Hristu-Varsakelis D, Arnaoutoglou 25. Gaßner H, Raccagni C, Eskofier BM et al
M et al (2015) A smartphone-based tool for (2019) The diagnostic scope of sensor-based
assessing parkinsonian hand tremor. IEEE J gait analysis in atypical parkinsonism: further
Biomed Health Inform 19:1835–1842 observations. Front Neurol 10:5
13. Kotsavasiloglou C, Kostikis N, Hristu- 26. Huppertz H-J, Möller L, Südmeyer M et al
Varsakelis D et al (2017) Machine learning- (2016) Differentiation of neurodegenerative
based classification of simple drawing move- parkinsonian syndromes by volumetric mag-
ments in Parkinson’s disease. Biomed Signal netic resonance imaging analysis and support
Process Control 31:174–180 vector machine classification. Mov Disord 31:
14. Amato F, Borzı̀ L, Olmo G et al (2021) Speech 1506–1517
impairment in Parkinson’s disease: acoustic 27. Archer DB, Bricker JT, Chu WT et al (2019)
analysis of unvoiced consonants in Italian Development and validation of the automated
native speakers. IEEE Access 9:166370– imaging differentiation in parkinsonism
166381 (AID-P): a multi-site machine learning study.
15. Jeancolas L, Petrovska-Delacrétaz D, Mangone Lancet Digit Health 1:e222–e231
G et al (2021) X-vectors: new quantitative bio- 28. Chougar L, Faouzi J, Pyatigorskaya N et al
markers for early Parkinson’s disease detection (2021) Automated categorization of parkinso-
from speech. Front Neuroinform 15:578369 nian syndromes using magnetic resonance
16. Jeancolas L, Mangone G, Petrovska-Delacrétaz imaging in a clinical setting. Mov Disord 36:
D et al (2022) Voice characteristics from 460–470
isolated rapid eye movement sleep behavior 29. Shinde S, Prasad S, Saboo Y et al (2019) Pre-
disorder to early Parkinson’s disease. Parkin- dictive markers for Parkinson’s disease using
sonism Relat Disord 95:86–91 deep neural nets on neuromelanin sensitive
17. Quan C, Ren K, Luo Z (2021) A deep learning MRI. Neuroimage Clin 22:101748
based method for Parkinson’s disease detection 30. Jucaite A, Cselényi Z, Kreisl WC et al (2022)
using dynamic features of speech. IEEE Access Glia imaging differentiates multiple system
9:10239–10252 atrophy from Parkinson’s disease: a positron
18. Ozbolt AS, Moro-Velazquez L, Lina I et al emission tomography study with [11C]
(2022) Things to consider when automatically PBR28 and machine learning analysis. Mov
detecting Parkinson’s disease using the phona- Disord 37:119–129
tion of sustained vowels: analysis of methodo- 31. Khawaldeh S, Tinkhauser G, Shah SA et al
logical issues. Appl Sci 12:991 (2020) Subthalamic nucleus activity dynamics
19. Berg D, Adler CH, Bloem BR et al (2018) and limb movement prediction in Parkinson’s
Movement disorder society criteria for clini- disease. Brain 143:582–596
cally established early Parkinson’s disease. 32. Poston KL, YorkWilliams S, Zhang K et al
Mov Disord 33:1643–1646 (2016) Compensatory neural mechanisms in
20. de la Fuente-Fernández R (2012) Role of cognitively unimpaired Parkinson disease. Ann
DaTSCAN and clinical diagnosis in Parkinson Neurol 79:448–463
disease. Neurology 78:696–701 33. Trezzi J-P, Galozzi S, Jaeger C et al (2017)
21. Adeli E, Shi F, An L et al (2016) Joint feature- Distinct metabolomic signature in cerebrospi-
sample selection and robust diagnosis of Par- nal fluid in early parkinson’s disease. Mov Dis-
kinson’s disease from MRI data. NeuroImage ord 32:1401–1408
141:206–219 34. Vanneste S, Song J-J, De Ridder D (2018)
22. Solana-Lavalle G, Rosas-Romero R (2021) Thalamocortical dysrhythmia detected by
Classification of PPMI MRI scans with voxel- machine learning. Nat Commun 9:1103
based morphometry and machine learning to 35. Goetz CG, Fahn S, Martinez-Martin P et al
assist in the diagnosis of Parkinson’s disease. (2007) Movement Disorder Society-sponsored
Comput Methods Prog Biomed 198:105793 revision of the unified Parkinson’s disease
23. Mudali D, Teune LK, Renken RJ et al (2015) rating scale (MDS-UPDRS): process, format,
Classification of parkinsonian syndromes from and clinimetric testing plan. Mov Disord 22:
FDG-PET brain data using decision trees with 41–47
SSM/PCA features. Comput Math Methods 36. Evers LJW, Krijthe JH, Meinders MJ et al
Med 2015:136921 (2019) Measuring Parkinson’s disease
876 Johann Faouzi et al.
over time: the real-world within-subject reli- dysarthria in Parkinson’s disease: an automated
ability of the MDS-UPDRS. Mov Disord 34: machine learning approach. Mov Disord 36:
1480–1487 2862–2873
37. Ahlrichs C, Samà A, Lawo M et al (2016) 49. Morales DA, Vives-Gilabert Y, Gómez-Ansón
Detecting freezing of gait with a tri-axial accel- B et al (2013) Predicting dementia develop-
erometer in Parkinson’s disease patients. Med ment in Parkinson’s disease using Bayesian net-
Biol Eng Comput 54:223–233 work classifiers. Psychiatry Res 213:92–98
38. Aich S, Pradhan PM, Park J et al (2018) A 50. Shibata H, Uchida Y, Inui S et al (2022)
validation study of freezing of gait (FoG) Machine learning trained with quantitative sus-
detection and machine-learning-based FoG ceptibility mapping to detect mild cognitive
prediction using estimated gait characteristics impairment in Parkinson’s disease. Parkinson-
with a wearable accelerometer. Sensors 18: ism Relat Disord 94:104–110
3287 51. Hannink J, Kautz T, Pasluosta CF et al (2017)
39. Borzı̀ L, Mazzetta I, Zampogna A et al (2021) Sensor-based gait parameter extraction with
Prediction of freezing of gait in Parkinson’s deep convolutional neural networks. IEEE J
disease using wearables and machine learning. Biomed Health Inform 21:85–93
Sensors 21:614 52. Lu M, Zhao Q, Poston KL et al (2021) Quan-
40. Dvorani A, Waldheim V, Jochner MCE et al tifying Parkinson’s disease motor severity
(2021) Real-time detection of freezing under uncertainty using MDS-UPDRS videos.
motions in Parkinson’s patients for adaptive Med Image Anal 73:102179
gait phase synchronous cueing. Front Neurol 53. Gao C, Sun H, Wang T et al (2018) Model-
12:720516 based and model-free machine learning techni-
41. Shalin G, Pardoel S, Lemaire ED et al (2021) ques for diagnostic prediction and classification
Prediction and detection of freezing of gait in of clinical outcomes in Parkinson’s disease. Sci
Parkinson’s disease from plantar pressure data Rep 8:7129
using long short-term memory neural- 54. Severson KA, Chahine LM, Smolensky LA et al
networks. J Neuroeng Rehabil 18:1–15 (2021) Discovery of Parkinson’s disease states
42. Park KW, Lee E-J, Lee JS et al (2021) Machine and disease progression modelling: a longitudi-
learning–based automatic rating for cardinal nal data study using machine learning. Lancet
symptoms of Parkinson disease. Neurology Digit Health 3:e555–e564
96:e1761–e1769 55. Salmanpour MR, Shamsaei M, Hajianfar G et al
43. Kim HB, Lee WW, Kim A et al (2018) Wrist (2022) Longitudinal clustering analysis and
sensor-based tremor severity quantification in prediction of Parkinson’s disease progression
Parkinson’s disease using convolutional neural using radiomics and hybrid machine learning.
network. Comput Biol Med 95:140–146 Quant Imaging Med Surg 12:90619–90919
44. Eskofier BM, Lee SI, Daneault J-F, et al (2016) 56. Oxtoby NP, Leyland L-A, Aksman LM et al
Recent machine learning advancements in (2021) Sequence of clinical and neurodegen-
sensor-based mobility analysis: deep learning eration events in Parkinson’s disease progres-
for Parkinson’s disease assessment. In: 2016 sion. Brain J Neurol 144:975–988
38th annual international conference of the 57. Latourelle JC, Beste MT, Hadzi TC et al
IEEE engineering in medicine and biology (2017) Large-scale identification of clinical
society (EMBC), pp 655–658 and genetic predictors of motor progression
45. Litvan I, Aarsland D, Adler CH et al (2011) in patients with newly diagnosed Parkinson’s
MDS task force on mild cognitive impairment disease: a longitudinal cohort study and valida-
in Parkinson’s disease: critical review of tion. Lancet Neurol 16:908–916
PD-MCI. Mov Disord 26:1814–1824 58. Ahmadi Rastegar D, Ho N, Halliday GM et al
46. Abós A, Baggio HC, Segura B et al (2017) (2019) Parkinson’s progression prediction
Discriminating cognitive status in Parkinson’s using machine learning and serum cytokines.
disease through functional connectomics and NPJ Park Dis 5:1–8
machine learning. Sci Rep 7:45347 59. Amara AW, Chahine LM, Caspell-Garcia C et al
47. Betrouni N, Delval A, Chaton L et al (2019) (2017) Longitudinal assessment of excessive
Electroencephalography-based machine daytime sleepiness in early Parkinson’s disease.
learning for cognitive profiling in Parkinson’s J Neurol Neurosurg Psychiatry 88:653–662
disease: preliminary results. Mov Disord 34: 60. Couronne R, Vidailhet M, Corvol JC, et al
210–217 (2019) Learning disease progression models
48. Garcı́a AM, Arias-Vergara TC, Vasquez-Correa with longitudinal data and missing values. In:
J et al (2021) Cognitive determinants of 2019 IEEE 16th international symposium on
Machine Learning for Parkinson’s Disease and Related Disorders 877
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 27
Abstract
Epilepsy is a prevalent chronic condition affecting about 50 million people worldwide. A third of patients
suffer from seizures unresponsive to medication. Uncontrolled seizures damage the brain, are associated
with cognitive decline, and have negative impact on well-being. For these patients, the surgical resection of
the brain region that gives rise to seizures is the most effective treatment. In this context, due to its
unmatched spatial resolution and whole-brain coverage, magnetic resonance imaging (MRI) plays a central
role in detecting lesions. The last decade has witnessed an increasing use of machine learning applied to
multimodal MRI, which has allowed the design of tools for computer-aided diagnosis and prognosis. In this
chapter, we focus on automated algorithms for the detection of epileptogenic lesions and imaging-derived
prognostic markers, including response to anti-seizure medication, postsurgical seizure outcome, and
cognitive reserves. We also highlight advantages and limitations of these approaches and discuss future
directions toward person-centered care.
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_27,
© The Author(s) 2023
879
880 Hyo Min Lee et al.
2 Lesion Mapping
2.1 Mapping Temporal lobe epilepsy (TLE), the most common focal syndrome
Hippocampal Sclerosis in adults, is pathologically defined by varying degrees of neuronal
in Temporal Lobe loss and gliosis in the hippocampus and adjacent structures [9]. On
Epilepsy MRI, marked hippocampal sclerosis (HS) appears as atrophy and
signal hyperintensity, generally more severe ipsilateral to the seizure
focus. Accurate identification of hippocampal atrophy as a marker
of HS is crucial for deciding the side of surgery. While volumetry
Machine Learning in Neuroimaging of Epilepsy 881
Patient 1
(RTLE)
Patient 2
(LTLE)
Patient N
(R/LTLE)
-5 0 +5 LTLE RTLE
training set group-level t-map & thresholding feature sampling training & model selection
Coronal
Sections
MRI-derived 0
Features
Fig. 1 Automated lateralization of hippocampal sclerosis. (a). In the training phase, an optimal region of
interest is defined for each modality to systematically sample features (T1-derived volume, T2-weighted
intensity, and FLAIR/T1 intensity) across individuals. To this purpose, in each patient paired t-tests compare
corresponding vertices of the left and right subfields, z-scored with respect to healthy controls. The resulting
group-level asymmetry t-map is then thresholded from 0 to the highest value and binarized; for each
threshold, the binarized t-map is overlaid on the asymmetry map of each individual to compute the average
across subfields. Then a linear discriminant classifier is trained for each threshold, and the model yielding the
highest lateralization accuracy (in this example LDA model 3) is used to test the classifier. (b). Lateralization
prediction in a patient with MRI-negative left TLE. Coronal sections are shown together with the automatically
generated asymmetry maps for columnar volume, T2-weighted, and FLAIR/T1 intensities. On each map,
dotted line corresponds to the level of the coronal MRI section and the optimal ROI obtained during training is
outlined in black
2.2 Automated On MRI, focal cortical dysplasia (FCD) presents with a visibility
Detection of Focal spectrum encompassing variable degrees of gray matter (GM) and
Cortical Dysplasia white matter (WM) changes that can challenge visual identification.
Indeed, recent series indicate that up to 33% of FCD Type II, the
most common surgically amenable developmental malformation,
present with “unremarkable” routine MRI, even though typical
features are ultimately identified in the histopathology of the
resected tissue [49–51]. These so-called “MRI-negative” FCDs
represent a major diagnostic challenge. Indeed, to define the epi-
leptogenic area, patients undergo long and costly hospitalizations
for EEG monitoring with intracerebral electrodes, a procedure that
carries risks similar to surgery itself [52, 53]. Moreover, patients
without MRI evidence for FCD are less likely to undergo surgery
and consistently show worse seizure control compared to those
with visible lesions [4, 54, 55]. This clinical difficulty has motivated
the development of computer-aided methods aimed at optimizing
detection in vivo. Such techniques provide distinct information
through quantitative assessment without the cost of additional
scanning time.
Early methods opted for voxel-based methods to quantify
group-level structural abnormalities related to MRI-visible dyspla-
sias by thresholding GM concentration (e.g., >1 SD relative to the
mean in healthy controls). While such methods are sensitive (87–
100%) in detecting conspicuous malformations, they fail to identify
two-thirds of subtle, MRI-negative lesions [56–59]. To counter
the relative lack of specificity, our group introduced an original
approach to integrate key voxel-wise textures and morphological
modeling (i.e., cortical thickening, blurring of the GM-WM junc-
tion, and intensity alterations) derived from T1-weighted images
into a composite map [60, 61]. The clinical value of this computer-
aided visual identification was supported by its 88% sensitivity and
95% specificity, vastly outperforming conventional MRI. An alter-
native method quantifies blurring as voxels that belong neither to
GM or WM [62]. Integrating morphological operators with
higher-order image texture features invisible to the human eye
into a fully automated classifier provided a sensitivity of 80%
[63, 64]. In contrast to voxel-based methods, surface-based mor-
phometry offers an anatomically plausible quantification of struc-
tural integrity that preserves cortical topology. Surface-based
modeling of cortical thickness, folding complexity, and sulcal
depth, together with intra- and subcortical mapping of MRI inten-
sities and textures, allow for a more sensitive description of FCD
pathology. Over the last decade, several such algorithms have been
developed, with detection rates up to 83% [65–71]. The addition of
FLAIR has contributed to further increase in sensitivity, particularly
for the detection of smaller lesions [66]. Notably, an integration of
surface-based methods into clinical workflow would be contingent
to careful verification of preprocessing steps, including manual
Machine Learning in Neuroimaging of Epilepsy 885
Fig. 2 Automated FCD detection using deep learning. (a). The training and testing workflow. In this cascaded
system, the output of the convolutional neural network 1 (CNN-1) serves as an input to CNN-2. CNN-1
maximizes the detection of lesional voxels; CNN-2 reduces the number of misclassified voxels, removing false
positives (FPs) while maintaining optimal sensitivity. The training procedure (dashed arrows) operating on
T1-weighted and FLAIR extracts 3D patches from lesional and non-lesional tissue to yield tCNN-1 (trained
model 1) and tCNN-2 with optimized weights (vertical dashed-dotted arrows). These models are then used for
subject-level inference. For each unseen subject, the inference pipeline (solid arrows) uses tCNN-1 and
generates a mean (μdropout) of 20 predictions (forward passes); the mean map is then thresholded voxel-wise
to discard improbable lesion candidates μdropout > 0.1). The resulting binary mask serves to sample the input
patches for the tCNN-2. A mean probability and uncertainty maps are obtained by collating 50 predictions;
uncertainty is transformed into confidence. The sampling strategy (identical for training and inference) is only
illustrated for testing. (b). Sagittal sections show the native T1-weighted MRI superimposed with the lesion
probability map. The bar plot shows the probability of the lesion (purple) and false-positive (FP, blue) clusters
sorted by their rank; the superimposed line indicates the degree of confidence for each cluster. In this
example, the lesion (cluster 1 in purple) has both the highest probability and confidence
3.1 Disease To date, most neuroimaging studies of epilepsy have been based on
Biotyping: Leveraging “one-size-fits-all” group-level analytical approaches. While such
Individual Variability to study designs can isolate reliable and consistent average group-
Optimize Predictions level differences, they merely decipher the common patterns with-
out modeling the inter-individual variations along the disease spec-
trum [103]. Conversely, the conceptualization of epilepsy as a
heterogeneous disorder and explicit modeling of inter-individual
phenotypic variations may be exploited to predict individual-
specific clinical outcomes [104].
Over the past decades, FCD characterization has been driven
by histology, with the primary objective to establish subtype-
specific imaging signatures [105]. Although histological grading
is a well-defined framework, the current approach is based on
descriptive criteria that do not consider the severity of each feature,
thereby limiting neurobiological understanding. The ability to per-
form in vivo patient stratification is gaining relevance due to the
emergence of minimally invasive surgical procedures that do not
provide specimens for histological examination [106]. From a neu-
robiological standpoint, whether FCD IIB (dysmorphic neurons
and balloon cells) and IIA (dysmorphic neurons only) subtypes
represent etiologically distinct entities, or a spectrum is a matter
of debate. Recent studies have shown significant cellular variability,
with anomalies that may vary across lesions within the same subtype
[73]. Moreover, multiple subtypes may coexist within the same
FCD, with the most severe phenotype determining the final diag-
nosis [74]. Furthermore, recent studies have identified regulatory
genes of the mTOR pathway that cause FCD via somatic muta-
tions, revealing a genetic continuum not linked to discrete FCD
subtypes [75]. Hence, assessing the intra- and inter-lesional varia-
bility on MRI may offer a novel basis to advance our understanding
of FCD neurobiology and improve lesion detection. Leveraging
hierarchical clustering to model connectivity from FCD tissue to
the rest of the cortex demonstrated that network dysfunction can
dissociate patients with excellent from those with suboptimal post-
surgical seizure outcomes [107]. Another recent work applying
consensus clustering to multi-contrast 3T MRI uncovered FCD
tissue classes with distinct structural profiles, variably expressed
within and across patients [108]. Importantly, these classes had
differential histopathological embeddings, and their clinical utility
was supported by gain in performance of a lesion detection algo-
rithm trained on class-informed data compared to class-naı̈ve
paradigm.
In TLE, histopathological reports have shown substantial varia-
bility in the distribution and severity of mesiotemporal lobe sclero-
sis between patients [109, 110]. A modern approach combining
quantitative histology and unsupervised machine learning identi-
fied histological subtypes with differential severity and regional
signatures [111]. Motivated by these findings, recent studies have
Machine Learning in Neuroimaging of Epilepsy 889
Fig. 3 Latent disease factors in TLE. Multimodal MRI (T1w, FLAIR, T1w/FLAIR, diffusion-derived FA, and MD) is
combined with surface-based analysis to model the main features of TLE pathology (atrophy, gliosis,
demyelination, and microstructural damage), which are z-scored with respect to the analogous vertices of
healthy controls’ ipsi- and contralateral to the seizure focus. Latent Dirichlet allocation uncovered four latent
relations (viz., disease factors) from these features (expressed as posterior probability) and quantified their
co-expression (ranging from 0 to 1) as shown in the patients’ factor composition matrix. On the color scale
below, the disease factor maps higher probability (darker red) and signifies a greater contribution of a given
feature to the factor, namely, the disease load (pFDR <0.05)
Fig. 4 Latent disease factors in TLE. Drug response, seizure outcome, verbal IQ, memory index, and motor
index are more accurately predicted when using latent disease factors than when relying on conventional
group-level features (pFDR <0.001). Data points indicate mean balanced accuracy for categorical data (drug-
response, seizure outcome) and Pearson correlation coefficients for numerical data (cognitive scores)
evaluated based on 100 repetitions of tenfold cross-validation
References
1. Wiebe S, Jette N (2012) Pharmacoresistance 3. Keezer MR, Sisodiya SM, Sander JW (2016)
and the role of surgery in difficult to treat Comorbidities of epilepsy: current concepts
epilepsy. Nat Rev Neurol 8:669. https://doi. and future perspectives. Lancet Neurol
org/10.1038/nrneurol.2012.181 15(1):106–115
2. Caciagli L, Bernasconi A, Wiebe S, Koepp MJ, 4. Jobst BC, Cascino GD (2015) Resective epi-
Bernasconi N, Bernhardt BC (2017) A meta- lepsy surgery for drug-resistant focal epilepsy:
analysis on progressive atrophy in intractable a review. JAMA 313(3):285–293. https://
temporal lobe epilepsy. Time is brain? 89(5): doi.org/10.1001/jama.2014.17426
506–516. https://doi.org/10.1212/wnl. 5. Téllez-Zenteno JF, Ronquillo LH, Moien-
0000000000004176 Afshari F, Wiebe S (2010) Surgical outcomes
892 Hyo Min Lee et al.
Magnetization transfer imaging in focal epi- analysis and multimodal learning for clinical
lepsy. Neurology 60(10):1638–1645 decision support: third international work-
58. Rugg-Gunn F, Boulby P, Symms M, shop, DLMIA 2017, ML-CDS 2017. Lecture
Barker G, Duncan J (2005) Whole-brain T2 notes in computer science, vol 10553, pp
mapping demonstrates occult abnormalities 349–356. https://doi.org/10.1007/978-3-
in focal epilepsy. Neurology 64(2):318–325 319-67558-9_40
59. Salmenpera TM, Symms MR, Rugg-Gunn FJ, 67. Hong SJ, Kim H, Schrader D, Bernasconi N,
Boulby PA, Free SL, Barker GJ, Yousry TA, Bernhardt B, Bernasconi A (2014) Auto-
Duncan JS (2007) Evaluation of quantitative mated detection of cortical dysplasia type II
magnetic resonance imaging contrasts in in MRI-negative epilepsy. Neurology 83(1):
MRI-negative refractory focal epilepsy. Epi- 48–55. https://doi.org/10.1212/WNL.
lepsia 48(2):229–237 0000000000000543
60. Bernasconi A, Antel SB, Collins DL, 68. Jin B, Krishnan B, Adler S, Wagstyl K, Hu W,
Bernasconi N, Olivier A, Dubeau F, Pike Jones S, Najm I, Alexopoulos A, Zhang K,
GB, Andermann F, Arnold DL (2001) Tex- Zhang J (2018) Automated detection of
ture analysis and morphological processing of focal cortical dysplasia type II with surface-
magnetic resonance imaging assist detection based magnetic resonance imaging postpro-
of focal cortical dysplasia in extra-temporal cessing and machine learning. Epilepsia
partial epilepsy. Ann Neurol 49(6):770–775. 59(5):982–992
https://doi.org/10.1002/ana.1013 69. Tan YL, Kim H, Lee S, Tihan T, Ver Hoef L,
61. Colliot O, Antel SB, Naessens VB, Mueller SG, Barkovich AJ, Xu D, Knowlton R
Bernasconi N, Bernasconi A (2006) In vivo (2018) Quantitative surface analysis of com-
profiling of focal cortical dysplasia on high- bined MRI and PET enhances detection of
resolution MRI with computational models. focal cortical dysplasias. Neuroimage 166:
Epilepsia 47(1):134–142. https://doi. 10–18
org/10.1111/j.1528-1167.2006.00379.x 70. Snyder K, Whitehead EP, Theodore WH,
62. Huppertz HJ, Grimm C, Fauser S, Zaghloul KA, Inati SJ, Inati SK (2021) Dis-
Kassubek J, Mader I, Hochmuth A, Spreer J, tinguishing type II focal cortical dysplasias
Schulze-Bonhage A (2005) Enhanced visuali- from normal cortex: a novel normative mod-
zation of blurred gray–white matter junctions eling approach. Neuroimage Clin 30:102565
in focal cortical dysplasia by voxel-based 3D 71. Kini LG, Gee JC, Litt B (2016) Computa-
MRI analysis. Epilepsy Res 67(1–2):35–50 tional analysis in epilepsy neuroimaging: a sur-
63. Antel SB, Li LM, Cendes F, Collins DL, Kear- vey of features and methods. Neuroimage
ney RE, Shinghal R, Arnold DL (2002) Pre- Clin 11:515–529
dicting surgical outcome in temporal lobe 72. Spitzer H, Ripart M, Whitaker K,
epilepsy patients using MRI and MRSI. Neu- Napolitano A, De Palma L, De Benedictis A,
rology 58(10):1505–1512. https://doi.org/ Foldes S, Humphreys Z, Zhang K, Hu W,
10.1212/wnl.58.10.1505 Mo J, Likeman M, Davies S, Guttler C,
64. Antel SB, Collins DL, Bernasconi N, Lenge M, Cohen NT, Tang Y, Wang S,
Andermann F, Shinghal R, Kearney RE, Chari A, Tisdall M, Bargallo N, Conde-
Arnold DL, Bernasconi A (2003) Automated Blanco E, Pariente JC, Pascual-Diaz S, Del-
detection of focal cortical dysplasia lesions gado-Martı́nez I, Pérez-Enrı́quez C,
using computational models of their MRI Lagorio I, Abela E, Mullatti N,
characteristics and texture analysis. Neuro- O’Muircheartaigh J, Vecchiato K, Liu Y,
image 19(4):1748–1759. https://doi. Caligiuri M, Sinclair B, Vivash L, Willard A,
org/10.1016/S1053-8119(03)00226-X Kandasamy J, McLellan A, Sokol D,
65. Adler S, Wagstyl K, Gunny R, Ronan L, Semmelroch M, Kloster A, Opheim G,
Carmichael D, Cross JH, Fletcher PC, Balde- Ribeiro L, Yasuda C, Rossi-Espagnet C,
weg T (2016) Novel surface features for auto- Zhang K, Hamandi K, Tietze A, Barba C,
mated detection of focal cortical dysplasias in Guerrini R, Gaillard WD, You X, Wang I,
paediatric epilepsy. Neuroimage Clin 14:18– González-Ortiz S, Severino M, Striano P,
27. https://doi.org/10.1016/j.nicl.201 Tortora D, Kalviainen R, Gambardella A,
6.12.030 Labate A, Desmond P, Lui E, O’Brien T,
Shetty J, Jackson G, Duncan J, Winston G,
66. Gill RS, Hong SJ, Fadaie F, Caldairou B, Pinborg L, Cendes F, Theis FJ, Shinohara RT,
Bernhardt B, Bernasconi N, Bernasconi A Cross JH, Baldeweg T, Adler S, Wagstyl K
(2017) Automated detection of epileptogenic (2021) Interpretable surface-based detection
cortical malformations using multimodal of focal cortical dysplasias: a MELD study.
MRI. In: Deep learning in medical image
896 Hyo Min Lee et al.
outcome prediction of temporal lobe epilepsy 104. Arbabshirani MR, Plis S, Sui J, Calhoun VD
surgery. PLoS One 8(4):e62819 (2017) Single subject prediction of brain dis-
95. Bernhardt B, Hong SJ, Bernasconi A, Bernas- orders in neuroimaging: promises and pitfalls.
coni N (2013) Imaging structural and func- Neuroimage 145:137–165. https://doi.
tional brain networks in temporal lobe org/10.1016/j.neuroimage.2016.02.079
epilepsy. Front Hum Neurosci 7(624). 105. Colombo N, Tassi L, Deleo F, Citterio A,
https://doi.org/10.3389/fnhum.2013. Bramerio M, Mai R, Sartori I, Cardinale F,
00624 Lo Russo G, Spreafico R (2012) Focal cortical
96. Caciagli L, Bernhardt BC, Hong SJ, dysplasia type IIa and IIb: MRI aspects in
Bernasconi A, Bernasconi N (2014) Func- 118 cases proven by histopathology. Neuro-
tional network alterations and their structural radiology 54(10):1065–1077. https://doi.
substrate in drug-resistant epilepsy. Front org/10.1007/s00234-012-1049-1
Neurosci 8:411 106. Gross RE, Stern MA, Willie JT, Fasano RE,
97. Munsell BC, Wee CY, Keller SS, Weber B, Saindane AM, Soares BP, Pedersen NP, Drane
Elger C, da Silva LAT, Nesland T, Styner M, DL (2018) Stereotactic laser amygdalohippo-
Shen D, Bonilha L (2015) Evaluation of campotomy for mesial temporal lobe epilepsy.
machine learning algorithms for treatment Ann Neurol 83(3):575–587. https://doi.
outcome prediction in patients with epilepsy org/10.1002/ana.25180
based on structural connectome data. Neuro- 107. Hong SJ, Lee HM, Gill R, Crane J, Sziklas V,
image 118:219–230 Bernhardt BC, Bernasconi N, Bernasconi A
98. Taylor PN, Sinha N, Wang Y, Vos SB, De (2019) A connectome-based mechanistic
Tisi J, Miserocchi A, McEvoy AW, Winston model of focal cortical dysplasia. Brain
GP, Duncan JS (2018) The impact of epilepsy 142(3):688–699. https://doi.org/10.1093/
surgery on the structural connectome and its brain/awz009
relation to outcome. Neuroimage Clin 18: 108. Lee HM, Gill RS, Fadaie F, Cho KH, Guiot
202–214 MC, Hong SJ, Bernasconi N, Bernasconi A
99. He X, Doucet GE, Pustina D, Sperling MR, (2020) Unsupervised machine learning
Sharan AD, Tracy JI (2017) Presurgical tha- reveals lesional variability in focal cortical dys-
lamic hubness predicts surgical outcome in plasia at mesoscopic scale. Neuroimage Clin
temporal lobe epilepsy. Neurology 88(24): 28:102438. https://doi.org/10.1016/j.
2285–2293 nicl.2020.102438
100. Larivière S, Weng Y, Vos de Wael R, Royer J, 109. Margerison J, Corsellis J (1966) Epilepsy and
Frauscher B, Wang Z, Bernasconi A, the temporal lobes: a clinical, electroencepha-
Bernasconi N, Schrader DV, Zhang Z, Bern- lographic and neuropathologic study of the
hardt BC (2020) Functional connectome brain in epilepsy, with particular reference to
contractions in temporal lobe epilepsy: micro- the temporal lobes. Brain 89(3):499–530.
structural underpinnings and predictors of https://doi.org/10.1093/brain/89.3.499
surgical outcome. Epilepsia 61(6): 110. De Lanerolle NC, Kim JH, Williamson A,
1221–1233. https://doi.org/10.1111/epi. Spencer SS, Zaveri HP, Eid T, Spencer DD
16540 (2003) A retrospective analysis of hippocam-
101. Gleichgerrcht E, Keller SS, Drane DL, Mun- pal pathology in human temporal lobe epi-
sell BC, Davis KA, Kaestner E, Weber B, lepsy: evidence for distinctive patient
Krantz S, Vandergrift WA, Edwards JC subcategories. Epilepsia 44(5):677–687
(2020) Temporal lobe epilepsy surgical out- 111. Blümcke I, Pauli E, Clusmann H, Schramm J,
comes can be inferred based on structural Becker A, Elger C, Merschhemke M,
connectome hubs: a machine learning study. Meencke HJ, Lehmann T, von Deimling A
Ann Neurol 88(5):970–983 (2007) A new clinico-pathological classifica-
102. Sinha N, Wang Y, da Silva NM, Miserocchi A, tion system for mesial temporal sclerosis. Acta
McEvoy AW, de Tisi J, Vos SB, Winston GP, Neuropathol 113(3):235–244
Duncan JS, Taylor PN (2021) Structural 112. Reyes A, Kaestner E, Bahrami N,
brain network abnormalities and the proba- Balachandra A, Hegde M, Paul BM,
bility of seizure recurrence after epilepsy sur- Hermann B, McDonald CR (2019) Cogni-
gery. Neurology 96(5):e758–e771 tive phenotypes in temporal lobe epilepsy are
103. Lo A, Chernoff H, Zheng T, Lo SH (2015) associated with distinct patterns of white mat-
Why significant variables aren’t automatically ter network abnormalities. Neurology
good predictors. Proc Natl Acad Sci USA 92(17):e1957–e1968. https://doi.org/10.
112(45):13892–13897. https://doi.org/10. 1212/wnl.0000000000007370
1073/pnas.1518285112
898 Hyo Min Lee et al.
113. Rodrı́guez-Cruces R, Bernhardt BC, Concha wide gene expression and neuroimaging
L (2020) Multidimensional associations data. Neuroimage 189:353–367. https://
between cognition and connectome organi- doi.org/10.1016/j.neuroimage.201
zation in temporal lobe epilepsy. Neuroimage 9.01.011
213:116706. https://doi.org/10.1016/j. 116. Markello RD, Arnatkeviciute A, Poline JB,
neuroimage.2020.116706 Fulcher BD, Fornito A, Misic B (2021) Stan-
114. Lee HM, Fadaie F, Gill R, Caldairou B, dardizing workflows in imaging transcrip-
Sziklas V, Crane J, Hong SJ, Bernhardt BC, tomics with the abagen toolbox. Elife 10:
Bernasconi A, Bernasconi N (2021) Decom- e72129
posing MRI phenotypic heterogeneity in epi- 117. Lipton ZC (2018) The mythos of model
lepsy: a step towards personalized interpretability: in machine learning, the con-
classification. Brain 145(3):897–908. cept of interpretability is both important and
https://doi.org/10.1093/brain/awab425 slippery. Queue 16(3):31–57
115. Arnatkevičiūtė A, Fulcher BD, Fornito A
(2019) A practical guide to linking brain-
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made. The images or other
third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated
otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder.
Chapter 28
Abstract
Multiple sclerosis (MS) is characterized by inflammatory activity and neurodegeneration, leading to the
accumulation of damage to the central nervous system resulting in the accumulation of disability. MRI
depicts an important part of the pathology of this disease and therefore plays a key part in diagnosis and
disease monitoring. Still, major challenges exist with regard to the differential diagnosis, adequate moni-
toring of disease progression, quantification of CNS damage, and prediction of disease progression.
Machine learning techniques have been employed in an attempt to overcome these challenges. This chapter
aims to give an overview of how machine learning techniques are employed in MS with applications for
diagnostic classification, lesion segmentation, improved visualization of relevant brain pathology, charac-
terization of neurodegeneration, and prognostic subtyping.
Key words Multiple sclerosis, Machine learning, Artificial intelligence, Deep learning, Neuroimaging
1 Introduction to MS
1.1 Disease The most striking feature is the appearance of focal inflammatory
Characteristics lesions in the CNS visible on MR imaging of the brain (Fig. 1)
and/or spinal cord that may give rise to partially reversible loss of
motor, sensory, and cognitive function depending on lesion loca-
tion and the magnitude of damage to local nerve tissue. As the
disease and resulting damage to the central nervous system accu-
mulate, irreversible disability progresses over time. Although
tempting, not all of the accumulated disability can be explained
by focal inflammatory lesions [2]. Diffuse neurodegeneration in the
CNS is another histopathological feature deemed responsible for
gradually accumulating disability, especially in the later stages of the
disease. This neurodegeneration is thought to result from a process
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_28,
© The Author(s) 2023
899
900 Bas Jasperse and Frederik Barkhof
Fig. 1 PD-weighted MR images (top row) showing the occurrence of MS typical T2/PD hyperintense lesions
over time. T1-weighted images (bottom row) showing enlargement of sulci and ventricles over time consistent
with brain atrophy and hypointensity of multiple lesions due to local tissue loss. (Figure kindly provided by
Dr. Alex Rovira (Hospital Universitari Vall d’Hebron, Barcelona))
Table 1
Brief overview of sequences that are typically used in clinical practice for the diagnosis and
monitoring of MS
Diagnosis/purpose Monitoring/purpose
Brain MRI
Ax T1 (<3 mm 2D or 3D) Optional Optional
Detection of T1 hypointense lesions Detection of T1 hypointense
lesions
Ax T2 and PD (<3 mm) Recommended Recommended
Detection and localization of lesions Detection of new lesions
(dissemination in space)
FLAIR Recommended Recommended
(preferably 3D with FS) Detection and localization of lesions Detection of new lesions
(dissemination in space)
Ax T1 after contrast Recommended Optional
(<3 mm 2D or 3D) Detection of (in)active inflammation Detection of new active
(dissemination in time) inflammation
DIR Optional Optional
Improve detection of (juxta)cortical Detection of new lesions
lesions
Ax DWI Optional Optional
Characterization of lesions Differentiation of MS versus PML
(differential diagnosis) lesions
Optic nerve MRI
Ax/cor T2 FS or STIR Optional Not required
(≤3 mm) Detection of optic neuritis
Ax/cor T1 after contrast Optional Not required
(≤3 mm) Detection of active optic neuritis
Spinal cord MRI
Sag T2 and PD (≤3 mm) Optional Optional
Detection of spinal cord lesions Detection of new spinal cord
(dissemination in space) lesions
Sag T1 after contrast Optional Optional
(≤3 mm) Detection of active inflammation Detection of new active
(dissemination in time) inflammation
For a detailed description see the 2021 MAGNIMS recommendations [17]. (Note: local preferences may vary)
Ax axial orientation, Sag sagittal orientation, Cor coronal orientation, FS fat suppression, DWI diffusion-weighted
imaging, PML progressive multifocal leukoencephalopathy
1.5 Advanced MR More advanced MRI techniques, like magnetic transfer ratio
Imaging Techniques (MTR), diffusion tensor imaging (DTI), and resting state func-
tional MRI (rsFMRI), are generally not used for the diagnosis and
monitoring of MS patients in clinical practice as clinically relevant
changes are hard to determine due to considerable biological and
904 Bas Jasperse and Frederik Barkhof
2.3 Future Taken together, these studies show that machine learning has the
Considerations capability to assist in the differential diagnosis of MS and can be
especially helpful for radiologists that are not specialized in
neuroradiology.
Most of the aforementioned ML models that could aid in
differential diagnosis have focused on the differentiation between
MS and NMOSD. Although interesting from a scientific point of
view, this distinction is not the only diagnostic challenge from a
clinical point of view. The main challenge for radiologists that are
not experienced with these disease entities is the distinction
between demyelinating lesions due to MS and vascular lesions and
should be the focus of future studies.
Generalizability to the general population and MRI scanners
is clearly the most important hurdle before these models can be
introduced in clinical practice. In addition, these studies are gen-
erally limited to a small subset of differential diagnoses, which
could lead to tunnel vision when relying on these tools in clinical
practice.
906 Bas Jasperse and Frederik Barkhof
3.1 Cross-Sectional A large number of ML-based models have been developed that
Lesion Segmentation provide cross-sectional automated lesion segmentation in MS
[29–38] using a variety of ML architecture designs. Critical evalua-
tion and comparison of these large and increasing number of lesion
segmentation methods is necessary to determine the best
performing methods and their added value to existing methods
using large test datasets made available in various challenges
organized by the Medical Image Computing and Computer
Assisted Intervention Society (MICCAI http://www.miccai.org/)
and the International Symposium on Biomedical Imaging (ISBI,
https://biomedicalimaging.org/) [28, 39]. Previous MS lesion
segmentation challenges showed that segmentation algorithms
could attain an average Dice score of 0.59 and an average surface
distance of 0.91 for the segmentation of cross-sectional images in
the MICCAI 2016 challenge [28] and an average Dice score of
0.670 and average symmetric surface distance of 2.16 for the
segmentation of longitudinal MR images in the ISBI 2015 chal-
lenge [39]. Most of these algorithms required multiple input
sequences, including T1, T2, PD, and/or FLAIR sequences,
whereas only three algorithms required a single FLAIR sequence
as input.
Besides segmentation performance, these methods need to be
validated in real-world scenario with subjects scanned on MRI
machines different from the original training dataset. To achieve
this, some level of adjustment/optimization prior to implementa-
tion on a given dataset is generally needed. Weeda et al. [40] have
compared methods with and without local optimization for cross-
sectional segmentation of MS lesions using several freely available
tools including LST [33], NicMSlesions [41], and BIANCA [31]
(Fig. 3). Optimization to the local dataset improved performance
for all these methods, while retraining with manually labelled rep-
resentative MR images provided the best performance.
Machine Learning in Multiple Sclerosis 907
Fig. 3 Output examples of four lesion segmentation algorithms and manual segmentation overlaid over FLAIR
images of the brain. (Figure adapted from Weeda et al. [40], reprinted with permission from Elsevier)
3.2 Detection of New The detection of new lesions longitudinally is a highly important
MS Lesions clinical monitoring task to demonstrate new inflammatory activity
in the CNS that may prompt initiation or change of treatment for
an individual MS patient. This requires tedious and time-
consuming visual comparison of FLAIR images, especially in
patients with a high number of confluent lesions. Initially, the aim
of any treatment was to have no evidence of disease activity
(NEDA). This proved to be unrealistic, as a low number of new
lesions over time could be observed in patients treated with various
treatment modalities [42]. Additional studies have shown that
long-term clinical disability does not increase with two or less new
lesions within 1 year and no contrast-enhancing lesions (minimal
evidence of disease activity (MEDA)).
Various machine learning models have been created to detect
new MS lesions on subsequent MR images based on fusion or
subtraction of subsequent segmentation maps [33, 43–48] or
end-to-end training of a combined registration and segmentation
network on serial MRI scans [48]. Evaluation of these and new
machine learning tools are expected following the recent MICCAI
challenge (https://portal.fli-iam.irisa.fr/msseg-2/).
908 Bas Jasperse and Frederik Barkhof
4.1 Synthetic DIR Cortical lesions are an important part of MS pathology, specific to
Sequences the disease, associated with disease progression [50, 51] and have
recently been included in the radiological diagnostic criteria
[52]. These cortical lesions are generally inconspicuous on com-
monly used FLAIR and T2-/PD-weighted MR images. The dou-
ble inversion recovery MRI sequence (DIR) is uniquely capable of
visualizing these cortical lesions by combined suppression of the
MR signal from cerebrospinal fluid and white matter [53]. DIR
sequences are generally not used in daily clinical practice or clinical
trials due to the long acquisition time and lack of availability on
most MR systems. Models based on generative adversarial networks
have been trained to generate synthetic DIR images from conven-
tional and routinely acquired T1, T2, and FLAIR images [54] and
T1 and PD/T2 [55]. These synthetic DIR images were able to
improve the detection of juxtacortical lesions (12.3 ± 10.8 vs
7.2 ± 5.6, P < 0.001) [54] and cortical lesions (N = 626 vs 696)
[56] compared to conventional MRI sequences. Although not as
sensitive as the original DIR images, synthetic DIR images are
sensitive enough to improve diagnosis and prognostication in rou-
tine clinical setting.
4.2 Prediction of Besides a very low risk of nephrogenic systemic fibrosis [57],
Contrast-Enhancing gadolinium-based contrast agents are generally safe when used for
Lesions imaging purposes. However, gadolinium is known to accumulate in
the brain after repeated IV gadolinium administrations. Although
no adverse effects have been demonstrated to date, this is a cause
for concern in the medical community as the long-term effects are
still unknown. Because of this, prediction of the presence of active
inflammatory contrast-enhancing lesions without the use of con-
trast agents is desirable. Using a large multicenter dataset, Narayana
et al. have developed a deep learning model capable of predicting
contrast-enhancing lesions using T1, T2, and FLAIR images with
sensitivity and specificity of 78% and 73%, respectively, for patient-
wise detection of enhancement using fivefold cross-validation [58].
Fig. 4 Examples of lesional myelin content changes showing T1-weighted images (left column), the predicted
change in myelin content by the MR-based model proposed by Wei et al. (middle column), and the ground truth
change in myelin content based on [(11)C]PIB PET imaging (right column). Demyelinating (red) and remye-
linating (in blue) voxels are indicated on top of the lesion mask (white). (Figure adapted from Wei et al. [60],
reprinted with permission from Elsevier)
5.1 Brain Age Neurodegenerative processes are known to change the macroscopic
Determination from structure of the brain with increasing age. Similar brain structure
MR Images changes are observed as a result of various neurodegenerative brain
diseases, including MS. Such MS-related atrophic changes occur at
a faster pace as would be expected in normal aging individuals. This
has given rise to the “brain age” paradigm, in which accelerated
aging of the brain is considered as a marker of MS-related neuro-
degeneration [62]. Machine learning models based on large popu-
lations of healthy aging individuals have been developed to
determine biological brain age from T1-weighted MR images of
the brain [63–65]. Subtraction of this predicted brain age from the
actual calendar age results in the brain-predicted age difference
(brain-PAD) or brain age gap (BAG) as an indicator of premature
aging of the brain. Key advantages of brain-PAD/BAG over brain
volume measurement are that these measures incorporate image
characteristics across the entire brain (not only the segmented brain
tissue as in brain volume measurement), provide an intuitive easily
interpretable metric, are more robust to acquisition-related image
variations, and, most importantly, are specific for the individual
patient by inherently adjusting for age. Initial studies on brain age
in MS found that the estimated brain age is between 4 and 6 years
higher than chronological age in comparison to healthy controls
and that a higher relative brain age is associated with a higher
degree of disability [65, 66]. A large retrospective multicenter
study of brain age in MS showed that brain age is approximately
10 years higher than chronological age in MS, is increased in MS
compared to HC, predicts current as well as future disability, and is
mainly driven by brain atrophy [67] (Fig. 5). Recent developments
in brain age models were made to reliably predict brain age using
FLAIR sequences instead of the usual T1-weighted sequences
(Colman et al., 2021; ISMRM 2022), ensuring flexible implemen-
tation in retrospective research settings and general clinical practice.
Further studies are needed to further elucidate changes in brain age
over time, the relationship of brain age with a wider range of
measures of cognitive and physical disability, the influence of non-
MS-related factors on brain age, the pathological substrate of brain
912 Bas Jasperse and Frederik Barkhof
Fig. 5 Examples of increasing differences between brain-predicted age and chronological age (brain-PAD) in a
healthy control, three RRMS onset patients with increasing disease durations, and a PPMS patient with a very
high brain-PAD with relatively short time since diagnosis. (Figure adapted from Cole et al. [67], CC BY 4.0)
age in MS, and the effect of treatment on brain age. In the future,
these brain age models may provide a useful clinical tool to quantify
and monitor neurodegeneration in routine clinical care of MS
patients.
Fig. 6 Evolution of MRI abnormalities in each of the three MRI-based subtypes revealed by the SuStaIn
analysis by Eshaghi et al. For each subtype, the left two columns depict the probability of regional brain
atrophy, and the right column depicts the probability of lesion occurrence in the various stages of MRI
abnormality progression. (Figure adapted from Eshaghi et al. [81], CC BY 4.0)
7 Concluding Remarks
Acknowledgments
References
sclerosis. N Engl J Med 365:1293–1303. on the use of MRI in patients with multiple
https://doi.org/10.1056/NEJMoa1014656 sclerosis. Lancet Neurol 20:653–670. https://
7. Kappos L et al (2010) A placebo-controlled doi.org/10.1016/S1474-4422(21)00095-8
trial of oral fingolimod in relapsing multiple 18. Polman CH et al (2011) Diagnostic criteria for
sclerosis. N Engl J Med 362:387–401. multiple sclerosis: 2010 revisions to the McDo-
https://doi.org/10.1056/NEJMoa0909494 nald criteria. Ann Neurol 69:292–302.
8. Hauser SL et al (2017) Ocrelizumab versus https://doi.org/10.1002/ana.22366
interferon beta-1a in relapsing multiple sclero- 19. Swanton JK et al (2007) MRI criteria for mul-
sis. N Engl J Med 376:221–234. https://doi. tiple sclerosis in patients presenting with clini-
org/10.1056/NEJMoa1601277 cally isolated syndromes: a multicentre
9. Polman CH et al (2006) A randomized, retrospective study. Lancet Neurol 6:677–
placebo-controlled trial of natalizumab for 686. https://doi.org/10.1016/S1474-4422
relapsing multiple sclerosis. N Engl J Med (07)70176-X
354:899–910. https://doi.org/10.1056/ 20. Wingerchuk DM, Lennon VA, Lucchinetti CF,
NEJMoa044397 Pittock SJ, Weinshenker BG (2007) The spec-
10. Cohen JA et al (2012) Alemtuzumab versus trum of neuromyelitis optica. Lancet Neurol 6:
interferon beta 1a as first-line treatment for 805–815. https://doi.org/10.1016/S1474-
patients with relapsing-remitting multiple scle- 4422(07)70216-8
rosis: a randomised controlled phase 3 trial. 21. Clarke L et al (2021) MRI patterns distinguish
Lancet 380:1819–1828. https://doi.org/10. AQP4 antibody positive neuromyelitis optica
1016/S0140-6736(12)61769-3 spectrum disorder from multiple sclerosis.
11. Kappos L et al (2018) Siponimod versus pla- Front Neurol 12:722237. https://doi.org/
cebo in secondary progressive multiple sclerosis 10.3389/fneur.2021.722237
(EXPAND): a double-blind, randomised, 22. Huang J et al (2021) Multi-parametric MRI
phase 3 study. Lancet 391:1263–1273. phenotype with trustworthy machine learning
https://doi.org/10.1016/S0140-6736(18) for differentiating CNS demyelinating diseases.
30475-6 J Transl Med 19:377. https://doi.org/10.
12. Muehler A, Peelen E, Kohlhof H, Groppel M, 1186/s12967-021-03015-w
Vitt D (2020) Vidofludimus calcium, a next 23. Hagiwara A et al (2021) Differentiation
generation DHODH inhibitor for the treat- between multiple sclerosis and neuromyelitis
ment of relapsing-remitting multiple sclerosis. optica spectrum disorders by multiparametric
Mult Scler Relat Disord 43:102129. https:// quantitative MRI using convolutional neural
doi.org/10.1016/j.msard.2020.102129 network. J Clin Neurosci 87:55–58. https://
13. Reich DS et al (2021) Safety and efficacy of doi.org/10.1016/j.jocn.2021.02.018
tolebrutinib, an oral brain-penetrant BTK 24. Kim H et al (2020) Deep learning-based
inhibitor, in relapsing multiple sclerosis: a method to differentiate neuromyelitis optica
phase 2b, randomised, double-blind, placebo- spectrum disorder from multiple sclerosis.
controlled trial. Lancet Neurol 20:729–738. Front Neurol 11:599042. https://doi.org/
https://doi.org/10.1016/S1474-4422(21) 10.3389/fneur.2020.599042
00237-4 25. Liu Y et al (2019) Radiomics in multiple scle-
14. Poser CM et al (1983) New diagnostic criteria rosis and neuromyelitis optica spectrum disor-
for multiple sclerosis: guidelines for research der. Eur Radiol 29:4670–4677. https://doi.
protocols. Ann Neurol 13:227–231. https:// org/10.1007/s00330-019-06026-w
doi.org/10.1002/ana.410130302 26. Luo X et al (2022) Multi-lesion radiomics
15. Barkhof F et al (1997) Comparison of MRI model for discrimination of relapsing-remitting
criteria at first presentation to predict conver- multiple sclerosis and neuropsychiatric sys-
sion to clinically definite multiple sclerosis. temic lupus erythematosus. Eur Radiol 32:
Brain 120(Pt 11):2059–2069. https://doi. 5700. https://doi.org/10.1007/s00330-
org/10.1093/brain/120.11.2059 022-08653-2
16. McDonald WI et al (2001) Recommended 27. Rauschecker AM et al (2020) Artificial intelli-
diagnostic criteria for multiple sclerosis: guide- gence system approaching neuroradiologist-
lines from the International Panel on the diag- level differential diagnosis accuracy at brain
nosis of multiple sclerosis. Ann Neurol 50: MRI. Radiology 295:626–637. https://doi.
121–127. https://doi.org/10.1002/ana. org/10.1148/radiol.2020190283
1032 28. Commowick O et al (2018) Objective evalua-
17. Wattjes MP et al (2021) 2021 MAGNIMS- tion of multiple sclerosis lesion segmentation
CMSC-NAIMS consensus recommendations using a data management and processing
Machine Learning in Multiple Sclerosis 917
infrastructure. Sci Rep 8:13650. https://doi. 39. Carass A et al (2017) Longitudinal multiple
org/10.1038/s41598-018-31911-7 sclerosis lesion segmentation: resource and
29. de Oliveira M et al (2022) Lesion volume challenge. NeuroImage 148:77–102. https://
quantification using two convolutional neural doi.org/10.1016/j.neuroimage.2016.12.064
networks in MRIs of multiple sclerosis patients. 40. Weeda MM et al (2019) Comparing lesion
Diagnostics (Basel) 12. https://doi.org/10. segmentation methods in multiple sclerosis:
3390/diagnostics12020230 input from one manually delineated subject is
30. Gabr RE et al (2020) Brain and lesion segmen- sufficient for accurate lesion segmentation.
tation in multiple sclerosis using fully convolu- NeuroImage Clin 24:102074. https://doi.
tional neural networks: a large-scale study. org/10.1016/j.nicl.2019.102074
Mult Scler 26:1217–1226. https://doi.org/ 41. Valverde S et al (2019) One-shot domain adap-
10.1177/1352458519856843 tation in multiple sclerosis lesion segmentation
31. Griffanti L et al (2016) BIANCA (Brain Inten- using convolutional neural networks. Neuro-
sity AbNormality Classification Algorithm): a Image Clin 21:101638. https://doi.org/10.
new tool for automated segmentation of white 1016/j.nicl.2018.101638
matter hyperintensities. NeuroImage 141: 42. Gasperini C et al (2019) Unraveling treatment
191–205. https://doi.org/10.1016/j. response in multiple sclerosis: a clinical and
neuroimage.2016.07.018 MRI challenge. Neurology 92:180–192.
32. Hindsholm AM et al (2021) Assessment of h t t p s : // d o i . o r g / 1 0 . 1 2 1 2 / W N L .
artificial intelligence automatic multiple sclero- 0000000000006810
sis lesion delineation tool for clinical use. Clin 43. Cabezas M et al (2016) Improved automatic
Neuroradiol 32:643. https://doi.org/10. detection of new T2 lesions in multiple sclero-
1007/s00062-021-01089-z sis using deformation fields. AJNR Am J Neu-
33. Schmidt P et al (2019) Automated segmenta- roradiol 37:1816–1823. https://doi.org/10.
tion of changes in FLAIR-hyperintense white 3174/ajnr.A4829
matter lesions in multiple sclerosis on serial 44. Sweeney EM, Shinohara RT, Shea CD, Reich
magnetic resonance imaging. NeuroImage DS, Crainiceanu CM (2013) Automatic lesion
Clin 23:101849. https://doi.org/10.1016/j. incidence estimation and detection in multiple
nicl.2019.101849 sclerosis using multisequence longitudinal
34. Shiee N et al (2010) A topology-preserving MRI. AJNR Am J Neuroradiol 34:68–73.
approach to the segmentation of brain images https://doi.org/10.3174/ajnr.A3172
with multiple sclerosis lesions. NeuroImage 45. Krüger J et al (2020) Fully automated longitu-
49:1524–1535. https://doi.org/10.1016/j. dinal segmentation of new or enlarged multiple
neuroimage.2009.09.005 sclerosis lesions using 3D convolutional neural
35. Zhang H et al (2021) ALL-Net: anatomical networks. NeuroImage Clin 28:102445.
information lesion-wise loss function https://doi.org/10.1016/j.nicl.2020.102445
integrated into neural network for multiple 46. McKinley R et al (2020) Automatic detection
sclerosis lesion segmentation. NeuroImage of lesion load change in multiple sclerosis using
Clin 32:102854. https://doi.org/10.1016/j. convolutional neural networks with segmenta-
nicl.2021.102854 tion confidence. NeuroImage Clin 25:102104.
36. Zhang Y et al (2022) A deep learning algorithm https://doi.org/10.1016/j.nicl.2019.102104
for white matter hyperintensity lesion detec- 47. Salem M et al (2018) A supervised framework
tion and segmentation. Neuroradiology 64: with intensity subtraction and deformation
727–734. https://doi.org/10.1007/s00234- field features for the detection of new T2-w
021-02820-w lesions in multiple sclerosis. NeuroImage Clin
37. Rakić M et al (2021) icobrain ms 5.1: combin- 17:607–615. https://doi.org/10.1016/j.nicl.
ing unsupervised and supervised approaches 2017.11.015
for improving the detection of multiple sclero- 48. Salem M et al (2020) A fully convolutional
sis lesions. NeuroImage Clin 31:102707. neural network for new T2-w lesion detection
https://doi.org/10.1016/j.nicl.2021.102707 in multiple sclerosis. NeuroImage Clin 25:
38. Valverde S et al (2017) Improving automated 102149. https://doi.org/10.1016/j.nicl.
multiple sclerosis lesion segmentation with a 2019.102149
cascaded 3D convolutional neural network 49. Rovira A et al (2022) Assessment of automatic
approach. NeuroImage 155:159–168. decision-support systems for detecting active
https://doi.org/10.1016/j.neuroimage. T2 lesions in multiple sclerosis patients. Mult
2017.04.034 Scler 28:1209. https://doi.org/10.1177/
13524585211061339
918 Bas Jasperse and Frederik Barkhof
50. Geurts JJ, Calabrese M, Fisher E, Rudick RA and spinal cord atrophy measures in clinical
(2012) Measurement and clinical effect of grey practice. Nat Rev Neurol 16:171–182.
matter pathology in multiple sclerosis. Lancet https://doi.org/10.1038/s41582-020-
Neurol 11:1082–1092. https://doi.org/10. 0314-x
1016/S1474-4422(12)70230-2 62. Cole JH, Franke K (2017) Predicting age using
51. Lucchinetti CF et al (2011) Inflammatory cor- neuroimaging: innovative brain ageing biomar-
tical demyelination in early multiple sclerosis. kers. Trends Neurosci 40:681–690. https://
N Engl J Med 365:2188–2197. https://doi. doi.org/10.1016/j.tins.2017.10.001
org/10.1056/NEJMoa1100648 63. Cole JH et al (2017) Predicting brain age with
52. Thompson AJ et al (2018) Diagnosis of multi- deep learning from raw imaging data results in
ple sclerosis: 2017 revisions of the McDonald a reliable and heritable biomarker. NeuroImage
criteria. Lancet Neurol 17:162–173. https:// 163:115–124. https://doi.org/10.1016/j.
doi.org/10.1016/S1474-4422(17)30470-2 neuroimage.2017.07.059
53. Geurts JJ et al (2005) Intracortical lesions in 64. Franke K, Ziegler G, Kloppel S, Gaser C, Alz-
multiple sclerosis: improved detection with 3D heimer’s Disease Neuroimaging, I (2010) Esti-
double inversion-recovery MR imaging. Radi- mating the age of healthy subjects from
ology 236:254–260. https://doi.org/10. T1-weighted MRI scans using kernel methods:
1148/radiol.2361040450 exploring the influence of various parameters.
54. Finck T et al (2020) Deep-learning generated NeuroImage 50:883–892. https://doi.org/
synthetic double inversion recovery images 10.1016/j.neuroimage.2010.01.005
improve multiple sclerosis lesion detection. 65. Hogestol EA et al (2019) Cross-sectional and
Investig Radiol 55:318–323. https://doi.org/ longitudinal MRI brain scans reveal accelerated
10.1097/RLI.0000000000000640 brain aging in multiple sclerosis. Front Neurol
55. Bouman PM et al (2022) Artificial double 10:450. https://doi.org/10.3389/fneur.
inversion recovery images for (juxta)cortical 2019.00450
lesion visualization in multiple sclerosis. Mult 66. Kaufmann T et al (2019) Common brain dis-
Scler 28:541–549. https://doi.org/10.1177/ orders are associated with heritable patterns of
13524585211029860 apparent aging of the brain. Nat Neurosci 22:
56. Bouman PM, Steenwijk MD, Geurts JJG, 1617–1623. https://doi.org/10.1038/
Jonkman LE (2022) Artificial double inversion s41593-019-0471-7
recovery images can substitute conventionally 67. Cole JH et al (2020) Longitudinal assessment
acquired images: an MRI-histology study. Sci of multiple sclerosis with the brain-age para-
Rep 12:2620. https://doi.org/10.1038/ digm. Ann Neurol 88:93–105. https://doi.
s41598-022-06546-4 org/10.1002/ana.25746
57. Woolen SA et al (2020) Risk of nephrogenic 68. Fonteijn HM et al (2011) An event-based dis-
systemic fibrosis in patients with stage 4 or ease progression model and its application to
5 chronic kidney disease receiving a group II familial Alzheimer’s disease. Inf Process Med
gadolinium-based contrast agent: a systematic Imaging 22:748–759. https://doi.org/10.
review and meta-analysis. JAMA Intern Med 1007/978-3-642-22092-0_61
180:223–230. https://doi.org/10.1001/ 69. Fonteijn HM et al (2012) An event-based
jamainternmed.2019.5284 model for disease progression and its applica-
58. Narayana PA et al (2020) Deep learning for tion in familial Alzheimer’s disease and Hun-
predicting enhancing lesions in multiple sclero- tington’s disease. NeuroImage 60:1880–1889.
sis from noncontrast MRI. Radiology 294: https://doi.org/10.1016/j.neuroimage.
398–404. https://doi.org/10.1148/radiol. 2012.01.062
2019191061 70. Eshaghi A et al (2018) Progression of regional
59. Bodini B et al (2016) Dynamic imaging of grey matter atrophy in multiple sclerosis. Brain
individual remyelination profiles in multiple 141:1665–1677. https://doi.org/10.1093/
sclerosis. Ann Neurol 79:726–738. https:// brain/awy088
doi.org/10.1002/ana.24620 71. Eshaghi A et al (2018) Deep gray matter vol-
60. Wei W et al (2020) Predicting PET-derived ume loss drives disability worsening in multiple
myelin content from multisequence MRI for sclerosis. Ann Neurol 83:210–222. https://
individual longitudinal analysis in multiple scle- doi.org/10.1002/ana.25145
rosis. NeuroImage 223:117308. https://doi. 72. Tintore M et al (2015) Defining high, medium
org/10.1016/j.neuroimage.2020.117308 and low impact prognostic factors for develop-
61. Sastre-Garriga J et al (2020) MAGNIMS con- ing multiple sclerosis. Brain 138:1863–1874.
sensus recommendations on the use of brain https://doi.org/10.1093/brain/awv105
Machine Learning in Multiple Sclerosis 919
73. Roca P et al (2020) Artificial intelligence to 78. Pareto D et al (2022) Prognosis of a second
predict clinical disability in patients with multi- clinical event from baseline MRI in patients
ple sclerosis using FLAIR MRI. Diagn Interv with a CIS: a multicenter study using a machine
Imaging 101:795–802. https://doi.org/10. learning approach. Neuroradiology 64:1383.
1016/j.diii.2020.05.009 https://doi.org/10.1007/s00234-021-
74. Zhao Y et al (2017) Exploration of machine 02885-7
learning techniques in predicting multiple scle- 79. Young AL et al (2018) Uncovering the hetero-
rosis disease course. PLoS One 12:e0174866. geneity and temporal complexity of neurode-
https://doi.org/10.1371/journal.pone. generative diseases with Subtype and Stage
0174866 Inference. Nat Commun 9:4273. https://doi.
75. Bendfeldt K et al (2019) MRI-based prediction org/10.1038/s41467-018-05892-0
of conversion from clinically isolated syndrome 80. Eshaghi A et al (2021) Author Correction:
to clinically definite multiple sclerosis using identifying multiple sclerosis subtypes using
SVM and lesion geometry. Brain Imaging unsupervised machine learning and MRI data.
Behav 13:1361–1374. https://doi.org/10. Nat Commun 12:3169. https://doi.org/10.
1007/s11682-018-9942-9 1038/s41467-021-23538-6
76. Wottschel V et al (2019) SVM recursive feature 81. Eshaghi A et al (2021) Identifying multiple
elimination analyses of structural brain MRI sclerosis subtypes using unsupervised machine
predicts near-term relapses in patients with learning and MRI data. Nat Commun 12:
clinically isolated syndromes suggestive of mul- 2078. https://doi.org/10.1038/s41467-
tiple sclerosis. NeuroImage Clin 24:102011. 021-22265-2
https://doi.org/10.1016/j.nicl.2019.102011 82. Pontillo G et al (2022) Stratification of multi-
77. Zhang H et al (2019) Predicting conversion ple sclerosis patients using unsupervised
from clinically isolated syndrome to multiple machine learning: a single-visit MRI-driven
sclerosis-an imaging-based machine learning approach. Eur Radiol 32:5382. https://doi.
approach. NeuroImage Clin 21:101593. org/10.1007/s00330-022-08610-z
https://doi.org/10.1016/j.nicl.2018.11.003
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 29
Abstract
Cerebrovascular disease refers to a group of conditions that affect blood flow and the blood vessels in the
brain. It is one of the leading causes of mortality and disability worldwide, imposing a significant socioeco-
nomic burden to society. Research on cerebrovascular diseases has been rapidly progressing leading to
improvement in the diagnosis and management of patients nowadays. Machine learning holds many
promises for further improving clinical care of these disorders. In this chapter, we will briefly introduce
general information regarding cerebrovascular disorders and summarize some of the most promising fields
in which machine learning shall be valuable to improve research and patient care. More specifically, we will
cover the following cerebrovascular disorders: stroke (both ischemic and hemorrhagic), cerebral micro-
bleeds, cerebral vascular malformations, intracranial aneurysms, and cerebral small vessel disease (white
matter hyperintensities, lacunes, perivascular spaces).
Key words Cerebrovascular disorders, Machine learning, Stroke, Cerebral microbleeds, Cerebral
vascular malformations, Intracranial aneurysms, Cerebral small vessel disease, White matter hyperin-
tensities, Lacunes, Perivascular spaces
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_29,
© The Author(s) 2023
921
922 Yannan Yu and David Yen-Ting Chen
2 Ischemic Stroke
Fig. 1 General pathway of stroke diagnosis and treatment. Solid line represents general practice; dashed line
represents optional pathway. EMS emergency medical service, CT computed tomography, MRI magnetic
resonance imaging
2.1 Diagnosing Large Large vessel occlusions are defined as blockages of the proximal
Vessel Occlusion (LVO) intracranial arteries, accounting for approximately 24–46% of acute
ischemic strokes [9]. Diagnosing an LVO is an important step of
stroke diagnosis and treatment considerations; patients with LVO
are potential candidates for endovascular thrombectomy, which is
the most effective treatment available to recanalize an occluded
artery [8, 10, 11]. Endovascular thrombectomy is a highly
specialized procedure, and the personnel and equipment needed
for thrombectomy are not widely available. Patients often need to
be transferred from the hospital where they are initially evaluated to
Machine Learning for Cerebrovascular Disorders 925
Fig. 2 Common CT scans used in acute stroke. This example case showed left-sided stroke (on the right side
of the image) with occlusion of the middle cerebral artery (M1 segment). The NCCT had only very subtle
changes, and the CT perfusion showed large perfusion deficit (asymmetrically low measures in CBF and
asymmetrically high measures on Tmax and MTT) and small irreversible tissue injury. Penumbra/core
mismatch is the volume ratio between prolonged Tmax area and decreased CBF area; the summary image
from RAPID software showed mismatch ratio of 99.5 mL/3.6 mL. CTA showed middle cerebral artery main
trunk (M1 segment) occlusion. The DSA image showed recanalization of the artery occlusion after throm-
bectomy. NCCT non-contrast computed tomography, CBV cerebral blood volume, CBF cerebral blood flow,
Tmax time to maximum of the tissue residue function, MTT mean transit time, CTA computed tomography
angiography, DSA digital subtraction angiography
926 Yannan Yu and David Yen-Ting Chen
Fig. 3 Common MRI sequences used in acute stroke. This example case showed left-sided stroke (on the right
side of the image) with occlusion of a middle cerebral artery branch (M2 segment, inferior division). The
DSC-PWI and ASL showed the perfusion deficit (asymmetrically low measures in CBF and asymmetrically high
measures on Tmax and MTT), which was much greater than the irreversible tissue injury on DWI, with a
mismatch ratio of 2.9. GRE did not show blooming effect (a common finding of acute intra-arterial thrombus),
MRA showed a left-sided large vessel occlusion (M2 segment, white arrow), and T2-FLAIR taken 24 h after the
stroke showed injured brain tissue after the stroke (white arrows). DSC-PWI dynamic susceptibility contrast
perfusion-weighted imaging. ASL arterial spin labeling, CBV cerebral blood volume, CBF cerebral blood
flow, Tmax: time to maximum of the tissue residue function, MTT mean transit time, DWI diffusion-weighted
imaging, ADC apparent diffusion coefficient, GRE gradient-recalled echo sequence, MRA magnetic resonance
angiography, T2-FLAIR T2-weighted fluid-attenuated inversion recovery
2.2 Predicting Stroke In 14–27% of strokes, the symptom onset time is not known [31–
Onset Time 33]. For those patients, identifying the likely onset time is crucial
for proper treatment. Indeed, it is key to know if one is still within
the treatment window for intravenous thrombolysis (within 4.5 h)
or endovascular therapy (within 6 h if presence of LVO plus no
extensive lesion on non-contrast CT or 24 h if presence of LVO and
target mismatch on perfusion imaging). MRI plays a key role in
estimating the duration of stroke. Studies have shown that fluid-
attenuated inversion recovery (FLAIR) usually detects ischemic
lesion after 3–6 h of stroke onset [34, 35], in contrast to
diffusion-weighted imaging (DWI), which detects ischemic lesions
within minutes of stroke. Therefore, the “mismatch” between
FLAIR and DWI may be used as a clock for determining stroke
onset time [36]. Lee et al. [37] captured 89 vector features from
DWI and FLAIR imaging and trained machine learning models
including logistic regression, support vector machine, and random
forest to classify if the stroke onset is within 4.5 h. They found the
machine learning models were more sensitive (75.8% vs 48.5%,
p = 0.01) but less specific (82.6% vs 91.3%, p = 0.15) compared
to human readers. Similar results were also achieved by other
research groups [38]. Perfusion MRI has not been studied in the
past for determining the stroke onset time. Ho et al. [39, 40]
extracted deep features using an autoencoder from perfusion MRI
to classify whether stroke onset time was within 4.5 h (the current
time window for intravenous tissue plasminogen activator [tPA]).
Using input DWI, apparent diffusion coefficient (ADC), FLAIR,
and perfusion-weighted images, they achieved a ROC AUC of
0.765. This approach outperformed DWI-FLAIR-based machine
learning methods (AUC of 0.669) and clinical methods (AUC of
0.58) in the same dataset. The use of imaging to determine the time
of stroke onset may increase the number of patients eligible for
time-limited stroke treatments, such as intravenous
thrombolysis [31].
2.3 Stroke Lesion Non-contrast-enhanced CT scan is the most common initial imag-
Segmentation ing obtained for stroke patients. Therefore, CT datasets are usually
much more common and larger than MRI datasets. However, it is
generally more challenging to diagnose early stroke or predict final
stroke lesions on CT than MRI, as changes on CT related to early
hyperacute phase (<6 h) of ischemic stroke are very subtle, includ-
ing loss of gray and white matter differentiation, hypoattenuation
of deep nuclei, and cortical hypodensity with associated parenchy-
mal swelling and gyral effacement. The Alberta Stroke Program
930 Yannan Yu and David Yen-Ting Chen
2.6 Predicting Stroke Compared to predicting future stroke lesions on images, clinical
Clinical Outcomes outcome prediction is more difficult for several reasons. The most
common scoring system, the modified Rankin score (mRS), is
nonlinear and subjective, and the unit of analysis is each patient
rather than each voxel (Table 1). The majority of the previously
published studies used non-imaging data as input to predict clinical
outcomes using simple statistical or more complex machine
learning models [84–89]. However, images may provide more
information such as the spatial location of infarct and hemorrhage
and the presence of brain atrophy. Osama et al. [90] proposed a
parallel multi-parametric feature-embedded Siamese neural net-
work [91] to classify 3-month mRS from 0 to 4 using the MRI
perfusion maps and clinical data from the ISLES 2017 challenge.
This model achieved an average accuracy of 37% on each class using
leave-one-out cross-validation testing. Nishi et al. [92] proposed a
U-Net with DWI as input and stroke lesion segmentation as out-
put. Then the bottleneck features of the U-Net were extracted to
predict whether the 3-month mRS would be greater than 2, a
common metric of good clinical outcome. This method achieved
a ROC AUC of 0.81, exceeding the performance of ASPECTS
Score (ROC AUC of 0.63) and ischemic core volume models
(ROC AUC of 0.64). These studies show promise that automated
imaging analysis might be helpful in the prediction of clinical out-
comes, but further study into these complex and ambitious predic-
tions is needed.
934 Yannan Yu and David Yen-Ting Chen
Table 1
Modified Rankin scale
0 No symptoms at all
1 No significant disability despite symptoms; able to carry out all usual duties and activities
2 Slight disability; unable to carry out all previous activities, but able to look after own affairs without
assistance
3 Moderate disability; requiring some help, but able to walk without assistance
4 Moderately severe disability; unable to walk without assistance and unable to attend to own bodily
needs without assistance
5 Severe disability; bedridden, incontinent, and requiring constant nursing care and attention
6 Dead
U-Net CNN based on the work of Guo et al. [93]. The study also
investigated several input combinations (MRI + PET vs MRI only)
to determine whether baseline O-15 PET CBF information is
required. Using a ground truth of O-15 PET rΔCBF in a cohort
of Moyamoya disease patients (a condition with chronic narrowing
of brain arteries leading to increased stroke risk), they showed that
using the baseline MRI alone resulted in better performance at
predicting regions with compromised CVR than the current clinical
method using ASL before and after acetazolamide injection. Such a
method may find use in estimating CVR from routine MRI scans
acquired as part of clinical practice, obviating the need for either
PET or acetazolamide.
5 Intracranial Aneurysms
5.1 Difficulty in Because of the small size of IAs and the complexity of intracranial
Aneurysm Detection vessels, aneurysm detection can be time-consuming and requires
subspecialty training. It renders two challenges. First, there is a
suboptimal inter-observer agreement (kappa = 0.67–0.73) in the
detection of IA from CTA and MRA [108]. The interpretation may
vary depending on the level of expertise. Therefore, the sensitivity
of detecting IA in CTA and MRA can range from 60% for a resident
to 80% for a neuroradiologist [109]. Second, there is a high false-
negative rate in detecting small aneurysms with diameter less than
5 mm. It has been reported that the sensitivity of detecting IAs of
less than 5 mm is 57–70% [108, 110] for CTA and 35–58% for
MRA [109, 110]. In comparison, the sensitivity of detecting IAs
larger than 5 mm is 94% and 86% for CTA and MRA. Given all the
difficulties mentioned above, there is a clinical need to have high-
performance computer-assisted diagnosis (CAD) tools to aid in
detection, increase efficiency, and reduce disagreement among
observers which may potentially improve the clinical care of
patients.
5.1.1 AI Algorithm for There have been several studies showing that CAD program can
Intracranial Aneurysm automatically detect IA in MRA or CTA. The conventional CAD
Detection systems, based on manually designed imaging features, such as
vessel curvature, thresholding, or a region-growing algorithm,
have shown good performance in detecting IA [111, 112]. How-
ever, these conventional methods were developed on very small
datasets and had to be modified manually when applied to new
images. New deep learning-based methods directly learn the most
predictive features from a large dataset of labeled images. They have
better performance and greater generalizability than conventional
methods. Deep learning has also been used for IA detection in
MRA and CTA, and several studies have shown decent results
[113–116].
938 Yannan Yu and David Yen-Ting Chen
6.1 Imaging Features Neuroimaging plays a pivotal role in the diagnosis and evaluation of
of cSVD cSVD [143]. According to the STandards for ReportIng Vascular
changes on nEuroimaging (STRIVE), the imaging features of
cSVD include recent small subcortical infarcts, white matter hyper-
intensities (WMH) of presumed vascular origin, lacunes, enlarged
perivascular spaces (PVS), and cerebral microbleeds (CMBs)
(Fig. 4) [144]. These imaging findings, either individually or in
combination, are associated with cognitive impairment, dementia,
depression, mobility problems, increased risk of stroke, and worse
outcomes after stroke [146, 151–153]. The quantification of cSVD
imaging features is important for disease severity evaluation and
clinical prognostication [154, 155]. However, these lesions are
generally small and widespread in the brain, rendering manual
inspection and segmentation laborious and prone to error. Machine
learning algorithms have great potential in the automatic
Machine Learning for Cerebrovascular Disorders 941
Fig. 4 MR imaging features for cerebral small vessel disease. (Upper) Clinical images (upper) and illustrations
(middle) of MRI features for cerebral small vessel disease, with a summary of imaging characteristics (lower)
for individual features. DWI, diffusion-weighted imaging. FLAIR, fluid-attenuated inversion recovery. SWI,
susceptibility-weighted imaging. ", increased signal. #, decreased signal. $, iso-intense signal. (The figure is
reproduced based on reference Wardlaw et al. [145])
Fig. 5 The Fazekas visual rating scale for white matter hyperintensity. A four-grade scale depending on the
size and confluence of lesions is given in the periventricular (upper) and deep white matter (lower) regions,
respectively
Fig. 6 Example of white matter hyperintensity (WMH) segmentation and quantification. (a) The original T2
FLAIR image. (b) Automatic WMH segmentation (pink areas) and volume quantification can be achieved by
deep learning algorithm which provides a more precise estimation of WMH burden in the brain than the
Fazekas scale. WMH white matter hyperintensity
6.2.1 Deep Learning- The WMH Segmentation Challenge at the Medical Image Com-
Based Methods for WMH puting and Computer Assisted Intervention Society (MICCAI)
Segmentation 2017 (https://wmh.isi.uu.nl/) provided a standardized assessment
of automatic methods for WMH segmentation. The multi-center/
multi-scanner dataset comprised images from patients with various
degrees of age-related degenerative and vascular pathologies. The
training dataset included 60 images from 3 scanners, with manual
WMH segmentation by 2 experts as the ground truth. The testing
dataset included 110 images obtained from 5 MR scanners, includ-
ing data from 2 scanners not used in the training set, to evaluate the
generalizability of segmentation methods on untested (?) scanners.
Five evaluation metrics, including DSC, modified Hausdorff dis-
tance, volume difference, sensitivity, and F1 for detecting individual
lesion, were used to rank the methods. Among the 20 participants,
all the top 10 participants applied deep learning methods
[172]. The top-ranking methods performed similarly or better
than the two independent human observers, who did not serve as
the raters of the ground truth, suggesting the potential of auto-
matic methods to replace human raters (Fig. 6). Li et al. [173], the
winner, achieved a DSC of 0.8 and a recall of 0.84 by utilizing an
ensemble of three fully convolutional neural networks similar to
U-Net with different initializations. Of note, they removed the
944 Yannan Yu and David Yen-Ting Chen
WMH prediction in the first and last 1/8 slices, where false-positive
prediction frequently occurred, as a post-processing method.
Andermatt et al. [174], in second place, utilized a network based
on multi-dimensional gated recurrent units (GRU), trained on 3D
patches, to achieve a DSC of 0.78 and a recall of 0.83. Ghafoorian
et al. [175], in third place, constructed a multi-scale 2D CNN,
trained in tenfolds and selecting the three best performing check-
points on the training data, to achieve a DSC of 0.77, a recall of
0.73, and the highest F1 score of 0.78. Valverde et al. [176], in
fourth, constructed a cascade framework of three 3D CNNs, with
the first model to identify candidate lesion voxels, the second to
reduce false-positive detections, and the third to perform final
WMH segmentation. Overall, challenge results indicate that
ensemble methods and strategies for false-positive reduction,
including selective sampling WMH mimics, removing slices prone
to false positives, and adding false-positive reduction model, are
advantageous. The top-ranking models generally had very few false
positives in normal areas that are hyperintense on FLAIR but are
not WMH (e.g., the septum pellucidum), a fault of many lower-
ranking methods. Although the top-four ranking models remained
to be the leaders in the inter-scanner robustness ranking, some
higher-ranking, deep learning-based methods performed worse in
inter-scanner robustness than the lower-ranking, rule-based meth-
ods, suggesting data-driven approaches sometimes may not gener-
alize well to unseen scanners.
The WMH Segmentation Challenge remains open for new and
updated submissions. Zhang et al. [177] designed a dual-path
U-Net segmentation model that used an attention mechanism to
combine FLAIR sequences and a brain atlas (for location informa-
tion) inputs to achieve higher performance than the previously
mentioned methods. Park et al. [178] proposed a U-Net with
multi-scale highlighting foregrounds, which was designed to
improve the detection of the WMH voxels with partial volume
effects, and achieved a record high of DSC (0.81) and F1 score
(0.79).
Although deep learning methods are gaining popularity and
have shown great performance in the WMH Segmentation Chal-
lenge, a recent systemic review [179] of automatic WMH segmen-
tation methods developed from 2015 to July 2020 showed no
evidence to favor deep learning methods in clinical research over
the k-NN algorithm [180, 181], linear regression [182, 183], or
unsupervised methods (e.g., fuzzy c-means algorithm [184, 185],
Gaussian mixture model [186], statistical definition [187]), in
terms of spatial agreement with reference segmentations (i.e.,
DSC). Non-deep learning methods, such as k-NN and linear
regression methods, have the advantage of simplicity, can be easier
to train, and may be less susceptible to overfitting when dealing
Machine Learning for Cerebrovascular Disorders 945
6.3 Cerebral CMBs are radiological manifestations of cerebral small vessel dis-
Microbleed (CMB) ease, usually defined as small (≤10 mm) areas of signal void on
Detection T2-weighted gradient-recalled echo (GRE) or susceptibility-
weighted images (SWI). CMBs are frequently seen in patients
with spontaneous intracranial hemorrhage [189] or cognitive
impairment [189] and are associated with a higher risk of hemor-
rhage after IV thrombolysis or therapeutic anticoagulation
[190, 191]. CMBs are highly associated with underlying uncon-
trolled hypertension (particularly when located in deep and/or
posterior fossa structures) [192] and/or cerebral amyloid angio-
pathy (especially when seen in cortical locations) [193]. Detecting
CMBs can be clinically important to assess the benefits and risks in
treatment planning for stroke patients.
Greenberg et al. [189] published a detailed field guide to CMB
detection. The small size of CMBs and the existence of several
CMB mimics (e.g., small veins, calcifications, cavernous malforma-
tions, iron deposition in deep nucleus, and flow voids) lead to
limited inter-observer agreement, long scan interpretation time,
and increased error rate by manual inspection, especially for
patients with heavy CMB load.
Automatic CMB detection methods might improve the effi-
ciency and accuracy of CMB identification. Radiomic-based and
traditional machine learning automatic detection methods have
been investigated. Van den Heuvel et al. [194] used morphological
features based on the dark and spherical nature of CMBs and
random forest classifier to achieve a sensitivity of 89.1% and 25.9
false positives per subject on CMB detection. Several studies have
applied deep learning models to improve CMB detection [195–
198]. Dou et al. [198] utilized a two-step cascade framework, first
with a 3D fully convolutional network for the screening of CMB
candidates, followed by a 3D CNN discriminator for the exclusion
of CMB mimics, to achieve a sensitivity of 93.16%, precision of
44.31%, and 2.74 false positives per subject for the detection of
CMB on SWI. Liu et al. [196] used a two-stage 3D CNN architec-
ture, while adding phase images to SWI as model inputs. The phase
images enabled the differentiation of diamagnetic calcifications
from paramagnetic CMB, which is not a distinction radiologists
can make solely on SWI. Their model successfully reduced false-
positive detection and achieved a sensitivity of 95.8%, precision of
70.9%, and 1.6 false positives per subject. Rashid et al. further
added quantitative susceptibility mapping (QSM) to SWI as inputs
to construct a multi-class U-Net CNN method to differentiate
CMBs and non-hemorrhage iron deposits, which was not
946 Yannan Yu and David Yen-Ting Chen
6.4 Lacune Lesion Lacunes of presumed vascular origin are sequelae of chronic small
Detection subcortical infarcts or hemorrhages located in deep gray and white
matter in the territory of a perforating arteriole [144]. They are
associated with an increased risk of stroke, dementia, and gait
impairment [143, 144]. In neuroimaging, lacunes appear as
round or ovoid, subcortical, fluid-filled cavities, measuring between
3 and 15 mm, typically showing a surrounding hyperintense gliotic
rim on T2 FLAIR images [144]. Longitudinal spatial mapping
studies show new WMH forming around small subcortical infarcts
[199] and new lacunes forming at the margin of WMH [200],
suggesting a strong association and vicinity between the two types
of lesions. Therefore, automatic applications that can not only
segment WMH but also detect lacunes are desired. However, few
studies have proposed automatic methods for lacune detection.
Uchiyama et al. [201] developed an algorithm that first used
top-hat transformation and multiple-phase binarization techniques
to detect potential candidates of lacune and then used rule-based
schemes and a support vector machine to eliminate the false posi-
tives to achieve a sensitivity of 96.8% with 0.76 false positive per
slice. Wang et al. [169] applied a multi-step algorithm to detect
WMH, cortical infarcts, and lacunes. The steps included extraction
of brain tissue, segmentation of hyperintense lesions from brain
tissue using Gaussian mixture model, separation of WMH and
cortical infarct based on anatomical location and morphological
operation, and segmentation of lacunes based on location and
intensity threshold. They achieved a sensitivity of 83.3% with 0.06
false positives per subject for lacune detection. Ghafoorian et al.
[202] used a two-stage deep learning method, which included a
fully convolutional neural network for candidate detection and a
3D multi-scale location-aware CNN for false-positive reduction.
The method achieved a sensitivity of 97.4% with 0.13 false positives
per slice.
6.5 Perivascular Perivascular spaces (PVS), also known as Virchow-Robin spaces, are
Space Quantification extensions of extracerebral fluid spaces that surround the
penetrating vessels of the brain [144]. They were recently recog-
nized as parts of the glymphatic system, which is a brain-wide
perivascular fluid transport system responsible for the clearance of
waste in the brain [203]. Normal PVS are not typically seen on
conventional MRI, while enlarged PVS are associated with progres-
sion of subcortical infarcts, WMH, CMBs, and cognitive decline
and are considered a biomarker for cSVD [204]. In neuroimaging,
PVS appear as round or ovoid cavities with diameters less than
Machine Learning for Cerebrovascular Disorders 947
6.6 Total Small cSVD is considered a dynamic, whole brain disorder with a wide
Vessel Disease Burden spectrum of clinical presentations and diffuse imaging manifesta-
tions in the brain while sharing common microvascular pathologies
[209]. A multifactorial approach that combines all imaging features
may better represent the burden and disease status of cSVD. Several
visual scoring systems of total cSVD burden have been introduced
[154, 205]. Staals et al. [154] proposed a four-point score in which
one point is given in the presence of each of the cSVD imaging
feature: (1) more than one lacune, (2) more than one microbleed,
(3) moderate to severe (more than 11) PVS in basal ganglia, and
(4) periventricular WMH Fazekas score of 3 and/or deep WMH
Fazekas score of 2–3. Although these semiquantitative scoring
systems are pragmatic and simple for clinical use, they have several
limitations. First, they may not be sensitive enough to represent the
severity of the disease, as the accumulation of cSVD burden forms a
948 Yannan Yu and David Yen-Ting Chen
Fig. 7 AI applications for cerebral small vessel disease (cSVD). AI algorithms have great potential to perform
automatic quantification of individual cSVD imaging features. By combining these burdens, a “total cSVD
burden” could be quantified, which might facilitate the clinical assessment, treatment monitoring, and
outcome prediction in patients with cSVD
7 Conclusion
Acknowledgments
References
1. Virani SS, Alonso A, Aparicio HJ, Benjamin WHO MONICA stroke study. Stroke 31(5):
EJ, Bittencourt MS, Callaway CW et al (2021) 1054–1061. https://doi.org/10.1161/01.
Heart disease and stroke Statistics-2021 str.31.5.1054
update: a report from the American Heart 5. van Gijn J, Kerr RS, Rinkel GJE (2007) Sub-
Association. Circulation 143(8):e254–e743. arachnoid haemorrhage. Lancet (London,
h t t p s : // d o i . o r g / 1 0 . 1 1 6 1 / C I R . England) 369(9558):306–318. https://doi.
0000000000000950 org/10.1016/S0140-6736(07)60153-6
2. Collaborators GBDS (2021) Global, regional, 6. Hop JW, Rinkel GJ, Algra A, van Gijn J
and national burden of stroke and its risk (1997) Case-fatality rates and functional out-
factors, 1990-2019: a systematic analysis for come after subarachnoid hemorrhage: a sys-
the global burden of disease study 2019. Lan- tematic review. Stroke 28(3):660–664.
cet Neurol 20(10):795–820. https://doi. https://doi.org/10.1161/01.str.28.3.660
org/10.1016/S1474-4422(21)00252-0 7. Saver JL, Goyal M, van der Lugt A, Menon
3. Cannistraro RJ, Badi M, Eidelman BH, Dick- BK, Majoie CB, Dippel DW et al (2016) Time
son DW, Middlebrooks EH, Meschia JF to treatment with endovascular thrombect-
(2019) CNS small vessel disease: a clinical omy and outcomes from ischemic stroke: a
review. Neurology 92(24):1146–1156. meta-analysis. JAMA 316(12):1279–1288.
h t t p s : // d o i . o r g / 1 0 . 1 2 1 2 / W N L . https://doi.org/10.1001/jama.2016.13647
0000000000007654 8. Powers WJ, Rabinstein AA, Ackerson T,
4. Ingall T, Asplund K, M€ahönen M, Bonita R Adeoye OM, Bambakidis NC, Becker K et al
(2000) A multinational comparison of sub- (2019) Guidelines for the early management
arachnoid hemorrhage epidemiology in the of patients with acute ischemic stroke: 2019
950 Yannan Yu and David Yen-Ting Chen
update to the 2018 guidelines for the early 16. You J, Yu PLH, Tsang ACO, Tsui ELH, Woo
management of acute ischemic stroke: a PPS, Leung GKK (2019) Automated com-
guideline for healthcare professionals from puter evaluation of acute ischemic stroke and
the American Heart Association/American large vessel occlusion. arXiv preprint
Stroke Association. Stroke 50(12):e344– arXiv:1906.08059
e418. https://doi.org/10.1161/STR. 17. You J, Tsang ACO, Yu PLH, Tsui ELH, Woo
0000000000000211 PPS, Lui CSM et al (2020) Automated hier-
9. Rennert RC, Wali AR, Steinberg JA, archy evaluation system of large vessel occlu-
Santiago-Dieppa DR, Olson SE, Pannell JS sion in acute ischemia stroke. Front
et al (2019) Epidemiology, natural history, Neuroinform 14:13. https://doi.org/10.
and clinical presentation of large vessel ische- 3389/fninf.2020.00013
mic stroke. Neurosurgery 85(suppl_1):S4– 18. Olive-Gadea M, Crespo C, Granes C,
S8. https://doi.org/10.1093/neuros/ Hernandez-Perez M, Perez de la Ossa N, Lar-
nyz042 edo C et al (2020) Deep learning based soft-
10. Goyal M, Menon BK, van Zwam WH, Dippel ware to identify large vessel occlusion on
DW, Mitchell PJ, Demchuk AM et al (2016) noncontrast computed tomography. Stroke
Endovascular thrombectomy after large- 51(10):3133–3137. https://doi.org/10.
vessel ischaemic stroke: a meta-analysis of 1161/STROKEAHA.120.030326
individual patient data from five randomised 19. Gaha M, Roy C, Estrade L, Gevry G, Weill A,
trials. Lancet 387(10029):1723–1731. Roy D et al (2014) Inter- and intraobserver
https://doi.org/10.1016/S0140-6736(16) agreement in scoring angiographic results of
00163-X intra-arterial stroke therapy. AJNR Am J Neu-
11. Badhiwala JH, Nassiri F, Alhazzani W, Selim roradiol 35(6):1163–1169. https://doi.org/
MH, Farrokhyar F, Spears J et al (2015) 10.3174/ajnr.A3828
Endovascular thrombectomy for acute ische- 20. Volny O, Cimflova P, Szeder V (2017) Inter-
mic stroke: a meta-analysis. JAMA 314(17): Rater reliability for thrombolysis in cerebral
1832–1843. https://doi.org/10.1001/jama. infarction with TICI 2c category. J Stroke
2015.13767 Cerebrovasc Dis 26(5):992–994. https://
12. Hassan AE, Ringheanu VM, Rabah RR, doi.org/10.1016/j.jstrokecerebrovasdis.
Preston L, Tekle WG, Qureshi AI (2020) 2016.11.008
Early experience utilizing artificial intelligence 21. Ueda D, Katayama Y, Yamamoto A,
shows significant reduction in transfer times Ichinose T, Arima H, Watanabe Y et al
and length of stay in a hub and spoke model. (2021) Deep learning-based angiogram gen-
Interv Neuroradiol 26(5):615–622. https:// eration model for cerebral angiography with-
doi.org/10.1177/1591019920953055 out misregistration artifacts. Radiology
13. Yahav-Dovrat A, Saban M, Merhav G, 299(3):675–681. https://doi.org/10.1148/
Lankri I, Abergel E, Eran A et al (2021) Eval- radiol.2021203692
uation of artificial intelligence-powered iden- 22. Zhang M, Zhang C, Wu X, Cao X, Young GS,
tification of large-vessel occlusions in a Chen H et al (2020) A neural network
comprehensive stroke center. AJNR Am J approach to segment brain blood vessels in
Neuroradiol 42(2):247–254. https://doi. digital subtraction angiography. Comput
org/10.3174/ajnr.A6923 Methods Prog Biomed 185:105159.
14. Stib MT, Vasquez J, Dong MP, Kim YH, https://doi.org/10.1016/j.cmpb.2019.
Subzwari SS, Triedman HJ et al (2020) 105159
Detecting large vessel occlusion at multiphase 23. Bhurwani MMS, Snyder KV, Waqas M,
CT angiography by using a deep convolu- Mokin M, Rava RA, Podgorsak AR et al
tional neural network. Radiology 297(3): (2021) Use of biplane quantitative angio-
640–649. https://doi.org/10.1148/radiol. graphic imaging with ensemble neural net-
2020200334 works to assess reperfusion status during
15. Dehkharghani S, Lansberg M, mechanical thrombectomy. Proc SPIE Int
Venkatsubramanian C, Cereda C, Lima F, Soc Opt Eng 11597:115971F
Coelho H et al (2021) High-performance 24. Su R, Cornelissen SAP, van der Sluijs M, van
automated anterior circulation CT angio- Es A, van Zwam WH, Dippel DWJ et al
graphic clot detection in acute stroke: a multi- (2021) autoTICI: automatic brain tissue
reader comparison. Radiology 298(3): reperfusion scoring on 2D DSA images of
665–670. https://doi.org/10.1148/radiol. acute ischemic stroke patients. IEEE Trans
2021202734 Med Imaging 40(9):2380–2391. https://
doi.org/10.1109/TMI.2021.3077113
Machine Learning for Cerebrovascular Disorders 951
25. Su R, van der Sluijs M, Cornelissen SAP, 34. Xu XQ, Zu QQ, Lu SS, Cheng QG, Yu J,
Lycklama G, Hofmeijer J, Majoie C et al Sheng Y et al (2014) Use of FLAIR imaging
(2022) Spatio-temporal deep learning for to identify onset time of cerebral ischemia in a
automatic detection of intracranial vessel per- canine model. AJNR Am J Neuroradiol
foration in digital subtraction angiography 35(2):311–316. https://doi.org/10.3174/
during endovascular thrombectomy. Med ajnr.A3689
Image Anal 77:102377. https://doi.org/10. 35. Thomalla G, Rossbach P, Rosenkranz M,
1016/j.media.2022.102377 Siemonsen S, Krutzelmann A, Fiehler J et al
26. Wang J, Zhang J, Gong X, Zhang W, Zhou Y, (2009) Negative fluid-attenuated inversion
Lou M (2022) Prediction of large vessel recovery imaging identifies acute ischemic
occlusion for ischaemic stroke by using the stroke at 3 hours or less. Ann Neurol 65(6):
machine learning model random forests. 724–732. https://doi.org/10.1002/ana.
Stroke Vasc Neurol 7(2):94–100. https:// 21651
doi.org/10.1136/svn-2021-001096 36. Petkova M, Rodrigo S, Lamy C,
27. Chen Z, Zhang R, Xu F, Gong X, Shi F, Oppenheim G, Touze E, Mas JL et al (2010)
Zhang M et al (2018) Novel prehospital pre- MR imaging helps predict time from symp-
diction model of large vessel occlusion using tom onset in patients with acute stroke: impli-
artificial neural network. Front Aging Neu- cations for patients with unknown onset time.
rosci 10:181. https://doi.org/10.3389/ Radiology 257(3):782–792. https://doi.
fnagi.2018.00181 org/10.1148/radiol.10100461
28. Tarkanyi G, Tenyi A, Hollos R, Kalmar PJ, 37. Lee H, Lee EJ, Ham S, Lee HB, Lee JS, Kwon
Szapary L (2022) Optimization of large vessel SU et al (2020) Machine learning approach to
occlusion detection in acute ischemic stroke identify stroke within 4.5 hours. Stroke 51(3):
using machine learning methods. Life (Basel) 860–866. https://doi.org/10.1161/
12(2):230. https://doi.org/10.3390/ STROKEAHA.119.027611
life12020230 38. Zhu H, Jiang L, Zhang H, Luo L, Chen Y,
29. Uchida K, Kouno J, Yoshimura S, Kinjo N, Chen Y (2021) An automatic machine
Sakakibara F, Araki H et al (2022) Develop- learning approach for ischemic stroke onset
ment of machine learning models to predict time identification based on DWI and
probabilities and types of stroke at prehospital FLAIR imaging. Neuroimage Clin 31:
stage: the Japan urgent stroke triage score 102744. https://doi.org/10.1016/j.nicl.
using machine learning (JUST-ML). Transl 2021.102744
Stroke Res 13(3):370–381. https://doi.org/ 39. Ho KC, Speier W, Zhang H, Scalzo F,
10.1007/s12975-021-00937-x El-Saden S, Arnold CW (2019) A machine
30. Hayashi Y, Shimada T, Hattori N, learning approach for classifying ischemic
Shimazui T, Yoshida Y, Miura RE et al stroke onset time from imaging. IEEE Trans
(2021) A prehospital diagnostic algorithm Med Imaging 38(7):1666–1676. https://doi.
for strokes using machine learning: a prospec- org/10.1109/TMI.2019.2901445
tive observational study. Sci Rep 11(1): 40. Ho KC, Speier W, El-Saden S, Arnold CW
20519. https://doi.org/10.1038/s41598- (2017) Classifying acute ischemic stroke
021-99828-2 onset time using deep imaging features.
31. Thomalla G, Simonsen CZ, Boutitie F, AMIA Annu Symp Proc 2017:892–901
Andersen G, Berthezene Y, Cheng B et al 41. Pexman JH, Barber PA, Hill MD, Sevick RJ,
(2018) MRI-guided thrombolysis for stroke Demchuk AM, Hudon ME et al (2001) Use
with unknown time of onset. N Engl J Med of the Alberta stroke program early CT score
379(7):611–622. https://doi.org/10.1056/ (ASPECTS) for assessing CT scans in patients
NEJMoa1804355 with acute stroke. AJNR Am J Neuroradiol
32. Mackey J, Kleindorfer D, Sucharew H, Moo- 22(8):1534–1542
maw CJ, Kissela BM, Alwell K et al (2011) 42. Yoshimura S, Sakai N, Yamagami H,
Population-based study of wake-up strokes. Uchida K, Beppu M, Toyoda K et al (2022)
Neurology 76(19):1662–1667. https://doi. Endovascular therapy for acute stroke with a
org/10.1212/WNL.0b013e318219fb30 large ischemic region. N Engl J Med. https://
33. Fink JN, Kumar S, Horkan C, Linfante I, doi.org/10.1056/NEJMoa2118191
Selim MH, Caplan LR et al (2002) The stroke 43. Chen L, Bentley P, Rueckert D (2017) Fully
patient who woke up: clinical and radiological automatic acute ischemic lesion segmentation
features, including diffusion and perfusion in DWI using convolutional neural networks.
MRI. Stroke 33(4):988–993. https://doi. Neuroimage Clin 15:633–643. https://doi.
org/10.1161/01.str.0000014585.17714.67 org/10.1016/j.nicl.2017.06.016
952 Yannan Yu and David Yen-Ting Chen
44. Wu O, Winzeck S, Giese AK, Hancock BL, 54. Woo I, Lee A, Jung SC, Lee H, Kim N, Cho
Etherton MR, Bouts M et al (2019) Big data SJ et al (2019) Fully automatic segmentation
approaches to phenotyping acute ischemic of acute ischemic lesions on diffusion-
stroke using automated lesion segmentation weighted imaging using convolutional neural
of multi-Center magnetic resonance imaging networks: comparison with conventional
data. Stroke 50(7):1734–1741. https://doi. algorithms. Korean J Radiol 20(8):
org/10.1161/STROKEAHA.119.025373 1275–1284. https://doi.org/10.3348/kjr.
45. Maier O, Menze BH, von der Gablentz J, 2018.0615
Hani L, Heinrich MP, Liebrand M et al 55. Winzeck S, Mocking SJT, Bezerra R,
(2017) ISLES 2015 - a public evaluation Bouts M, McIntosh EC, Diwan I et al
benchmark for ischemic stroke lesion segmen- (2019) Ensemble of convolutional neural net-
tation from multispectral MRI. Med Image works improves automated segmentation of
Anal 35:250–269. https://doi.org/10. acute ischemic lesions using multiparametric
1016/j.media.2016.07.009 diffusion-weighted MRI. AJNR Am J Neu-
46. Praveen GB, Agrawal A, Sundaram P, Sardesai roradiol 40(6):938–945. https://doi.org/
S (2018) Ischemic stroke lesion segmentation 10.3174/ajnr.A6077
using stacked sparse autoencoder. Comput 56. Do LN, Baek BH, Kim SK, Yang HJ, Park I,
Biol Med 99:38–52. https://doi.org/10. Yoon W (2020) Automatic assessment of
1016/j.compbiomed.2018.05.027 ASPECTS using diffusion-weighted imaging
47. Tomita N, Jiang S, Maeder ME, Hassanpour S in acute ischemic stroke using recurrent resid-
(2020) Automatic post-stroke lesion segmen- ual convolutional neural network. Diagnostics
tation on MR images using 3D residual con- (Basel) 10(10):803. https://doi.org/10.
volutional neural network, vol 27, p 102276 3390/diagnostics10100803
48. Liu L, Kurgan L, Wu FX, Wang J (2020) 57. Qiu W, Kuang H, Teleg E, Ospel JM, Sohn
Attention convolutional neural network for SI, Almekhlafi M et al (2020) Machine
accurate segmentation and quantification of learning for detecting early infarction in
lesions in ischemic stroke disease. Med acute stroke with non-contrast-enhanced
Image Anal 65:101791. https://doi.org/10. CT. Radiology 294(3):638–644. https://
1016/j.media.2020.101791 doi.org/10.1148/radiol.2020191193
49. Karthik R, Gupta U, Jha A, Rajalakshmi R, 58. Kuang H, Najm M, Chakraborty D, Maraj N,
Menaka R (2019) A deep supervised approach Sohn SI, Goyal M et al (2019) Automated
for ischemic lesion segmentation from multi- ASPECTS on noncontrast CT scans in
modal MRI using fully convolutional net- patients with acute ischemic stroke using
work. Appl Soft Comput 84:105685. machine learning. AJNR Am J Neuroradiol
https://doi.org/10.1016/j.asoc.2019. 40(1):33–38. https://doi.org/10.3174/
105685 ajnr.A5889
50. Liu L, Chen S, Zhang F, Wu F, Pan Y, Wang J 59. Albers GW, Wald MJ, Mlynash M, Endres J,
(2020) Deep convolutional neural network Bammer R, Straka M et al (2019) Automated
for automatically segmenting acute ischemic calculation of Alberta stroke program early
stroke lesion in multi-modality MRI. Neural CT score: validation in patients with large
Comput Appl 32:6545 hemispheric infarct. Stroke 50(11):
51. Zhang L, Song R, Wang Y, Zhu C, Liu J, Yang 3277–3279. https://doi.org/10.1161/
J et al (2020) Ischemic stroke lesion segmen- STROKEAHA.119.026430
tation using multi-plane information fusion. 60. Maegerlein C, Fischer J, Monch S, Berndt M,
IEEE Access 8:45715–45725. https://doi. Wunderlich S, Seifert CL et al (2019) Auto-
org/10.1109/ACCESS.2020.2977415 mated calculation of the Alberta stroke pro-
52. Zhao B, Ding S, Wu H, Liu G, Cao C, Jin S, gram early CT score: feasibility and reliability.
et al. (2019) Automatic acute ischemic stroke Radiology 291(1):141–148. https://doi.
lesion segmentation using semi-supervised org/10.1148/radiol.2019181228
learning. https://doi.org/10.48550/arXiv. 61. Brinjikji W, Abbasi M, Arnold C, Benson JC,
1908.03735 Braksick SA, Campeau N et al (2021)
53. Federau C, Christensen S, Scherrer N, Ospel e-ASPECTS software improves interobserver
JM, Schulze-Zachau V, Schmidt N et al agreement and accuracy of interpretation of
(2020) Improved segmentation and detection aspects score. Interv Neuroradiol 27(6):
sensitivity of diffusion-weighted stroke lesions 781–787. https://doi.org/10.1177/
with synthetically enhanced deep learning. 15910199211011861
Radiol Artif Intell 2(5):e190217. https:// 62. Neuhaus A, Seyedsaadat SM, Mihal D,
doi.org/10.1148/ryai.2020190217 Benson J, Mark I, Kallmes DF et al (2020)
Machine Learning for Cerebrovascular Disorders 953
116. Dai X, Huang L, Qian Y, Xia S, Chong W, Liu surgical and endovascular treatment. Lancet
J et al (2020) Deep learning for automated (London, England) 362(9378):103–110.
cerebral aneurysm detection on computed https://doi.org/10.1016/s0140-6736(03)
tomography images. Int J Comput Assist 13860-3
Radiol Surg 15(4):715–723. https://doi. 125. Murayama Y, Fujimura S, Suzuki T, Takao H
org/10.1007/s11548-020-02121-2 (2019) Computational fluid dynamics as a risk
117. Zeng Y, Liu X, Xiao N, Li Y, Jiang Y, Feng J assessment tool for aneurysm rupture. Neu-
et al (2019) Automatic diagnosis based on rosurg Focus 47(1):E12. https://doi.org/
spatial information fusion feature for intracra- 10.3171/2019.4.FOCUS19189
nial aneurysm. IEEE Trans Med Imaging. 126. Chien A, Sayre J (2014) Morphologic and
https://doi.org/10.1109/TMI.2019. hemodynamic risk factors in ruptured aneur-
2951439 ysms imaged before and after rupture. AJNR
118. Duan H, Huang Y, Liu L, Dai H, Chen L, Am J Neuroradiol 35(11):2130–2135.
Zhou L (2019) Automatic detection on intra- https://doi.org/10.3174/ajnr.A4016
cranial aneurysm from digital subtraction 127. Cornelissen BMW, Schneiders JJ, Potters WV,
angiography with cascade convolutional neu- van den Berg R, Velthuis BK, Rinkel GJE et al
ral networks. Biomed Eng Online 18(1):110. (2015) Hemodynamic differences in intracra-
https://doi.org/10.1186/s12938-019- nial aneurysms before and after rupture.
0726-2 AJNR Am J Neuroradiol 36(10):
119. Sichtermann T, Faron A, Sijben R, 1927–1933. https://doi.org/10.3174/ajnr.
Teichert N, Freiherr J, Wiesmann M (2019) A4385
Deep learning-based detection of intracranial 128. Sugiyama S-I, Endo H, Omodaka S, Endo T,
aneurysms in 3D TOF-MRA. AJNR Am J Niizuma K, Rashad S et al (2016) Daughter
Neuroradiol 40(1):25–32. https://doi.org/ sac formation related to blood inflow jet in an
10.3174/ajnr.A5911 intracranial aneurysm. World Neurosurg 96:
120. Faron A, Sichtermann T, Teichert N, Luet- 396–402. https://doi.org/10.1016/j.wneu.
kens JA, Keulers A, Nikoubashman O et al 2016.09.040
(2019) Performance of a deep-learning neural 129. Stember JN, Chang P, Stember DM, Liu M,
network to detect intracranial aneurysms Grinband J, Filippi CG et al (2019) Convolu-
from 3D TOF-MRA compared to human tional neural networks for the detection and
readers. Clin Neuroradiol. https://doi.org/ measurement of cerebral aneurysms on mag-
10.1007/s00062-019-00809-w netic resonance angiography. J Digit Imaging
121. Investigators UJ, Morita A, Kirino T, 32(5):808–815. https://doi.org/10.1007/
Hashi K, Aoki N, Fukuhara S et al (2012) s10278-018-0162-z
The natural course of unruptured cerebral 130. Podgorsak AR, Rava RA, Shiraz Bhurwani
aneurysms in a Japanese cohort. N Engl J MM, Chandra AR, Davies JM, Siddiqui AH
Med 366(26):2474–2482. https://doi.org/ et al (2019) Automatic radiomic feature
10.1056/NEJMoa1113260 extraction using deep learning for angio-
122. Naggara ON, Lecler A, Oppenheim C, Meder graphic parametric imaging of intracranial
J-F, Raymond J (2012) Endovascular treat- aneurysms. J Neurointerv Surg. https://doi.
ment of intracranial unruptured aneurysms: a org/10.1136/neurintsurg-2019-015214
systematic review of the literature on safety 131. Zhao X, Gold N, Fang Y, Xu S, Zhang Y, Liu J
with emphasis on subgroup analyses. Radiol- et al (2018) Vertebral artery fusiform aneu-
ogy 263(3):828–835. https://doi.org/10. rysm geometry in predicting rupture risk. R
1148/radiol.12112114 Soc Open Sci 5(10):180780. https://doi.
123. Thompson BG, Brown RD, Amin-Hanjani S, org/10.1098/rsos.180780
Broderick JP, Cockroft KM, Connolly ES et al 132. Liu Q, Jiang P, Jiang Y, Ge H, Li S, Jin H et al
(2015) Guidelines for the management of (2019) Prediction of aneurysm stability using
patients with unruptured intracranial aneur- a machine learning model based on
ysms: a guideline for healthcare professionals PyRadiomics-derived morphological features.
from the American Heart Association/Amer- Stroke 50(9):2314–2321. https://doi.org/
ican Stroke Association. Stroke 46(8): 10.1161/STROKEAHA.119.025777
2368–2400. https://doi.org/10.1161/STR. 133. Kim HC, Rhim JK, Ahn JH, Park JJ, Moon
0000000000000070 JU, Hong EP et al (2019) Machine learning
124. Wiebers DO, Whisnant JP, Huston J, application for rupture risk assessment in
Meissner I, Brown RD, Piepgras DG et al small-sized intracranial aneurysm. J Clin
(2003) Unruptured intracranial aneurysms: Med 8(5):683. https://doi.org/10.3390/
natural history, clinical outcome, and risks of jcm8050683
Machine Learning for Cerebrovascular Disorders 957
134. Paliwal N, Jaiswal P, Tutino VM, Shallwani H, admission. Eur Radiol 28(12):4949–4958.
Davies JM, Siddiqui AH et al (2018) Out- https://doi.org/10.1007/s00330-018-
come prediction of intracranial aneurysm 5505-0
treatment by flow diverters using machine 143. Wardlaw JM, Smith C, Dichgans M (2013)
learning. Neurosurg Focus 45(5):E7. Mechanisms of sporadic cerebral small vessel
h t t p s : // d o i . o r g / 1 0 . 3 1 7 1 / 2 0 1 8 . 8 . disease: insights from neuroimaging. Lancet
FOCUS18332 Neurol 12(5):483–497. https://doi.org/10.
135. Liu J, Chen Y, Lan L, Lin B, Chen W, Wang 1016/S1474-4422(13)70060-7
M et al (2018) Prediction of rupture risk in 144. Wardlaw JM, Smith EE, Biessels GJ,
anterior communicating artery aneurysms Cordonnier C, Fazekas F, Frayne R et al
with a feed-forward artificial neural network. (2013) Neuroimaging standards for research
Eur Radiol 28(8):3268–3275. https://doi. into small vessel disease and its contribution
org/10.1007/s00330-017-5300-3 to ageing and neurodegeneration. Lancet
136. Varble N, Tutino VM, Yu J, Sonig A, Siddiqui Neurol 12(8):822–838
AH, Davies JM et al (2018) Shared and dis- 145. Wardlaw JM, Smith C, Dichgans M (2019)
tinct rupture discriminants of small and large Small vessel disease: mechanisms and clinical
intracranial aneurysms. Stroke 49(4): implications. Lancet Neurol 18(7):684–696.
856–864. https://doi.org/10.1161/ https://doi.org/10.1016/S1474-4422(19)
STROKEAHA.117.019929 30079-1
137. Detmer FJ, Luckehe D, Mut F, Slawski M, 146. Debette S, Schilling S, Duperron M-G, Lars-
Hirsch S, Bijlenga P et al (2019) Comparison son SC, Markus HS (2019) Clinical signifi-
of statistical learning approaches for cerebral cance of magnetic resonance imaging markers
aneurysm rupture assessment. Int J Comput of vascular brain injury: a systematic review
Assist Radiol Surg. https://doi.org/10. and meta-analysis. JAMA Neurol 76(1):
1007/s11548-019-02065-2 8 1 – 9 4 . h t t p s : // d o i . o r g / 1 0 . 1 0 0 1 /
138. Tanioka S, Ishida F, Yamamoto A, Shimizu S, jamaneurol.2018.3122
Sakaida H, Toyoda M et al (2020) Machine 147. Kwon HM, Lynn MJ, Turan TN, Derdeyn
learning classification of cerebral aneurysm CP, Fiorella D, Lane BF et al (2016) Fre-
rupture status with morphologic variables quency, risk factors, and outcome of coexis-
and hemodynamic parameters. Radiol Artif tent small vessel disease and intracranial
Intell 2(1):e190077. https://doi.org/10. arterial stenosis: results from the Stenting
1148/ryai.2019190077 and Aggressive Medical Management for Pre-
139. Shi Z, Chen GZ, Mao L, Li XL, Zhou CS, Xia venting Recurrent Stroke in Intracranial Ste-
S et al (2021) Machine learning-based predic- nosis (SAMMPRIS) Trial. JAMA Neurol
tion of small intracranial aneurysm rupture 73(1):36–42. https://doi.org/10.1001/
status using CTA-derived hemodynamics: a jamaneurol.2015.3145
multicenter study. AJNR Am J Neuroradiol 148. Pasi M, Cordonnier C (2020) Clinical rele-
42(4):648–654. https://doi.org/10.3174/ vance of cerebral small vessel diseases. Stroke
ajnr.A7034 51(1):47–53. https://doi.org/10.1161/
140. Kim KH, Koo HW, Lee BJ, Sohn MJ (2021) STROKEAHA.119.024148
Analysis of risk factors correlated with angio- 149. Kapasi A, DeCarli C, Schneider JA (2017)
graphic vasospasm in patients with aneurys- Impact of multiple pathologies on the thresh-
mal subarachnoid hemorrhage using old for clinically overt dementia. Acta Neuro-
explainable predictive modeling. J Clin Neu- pathol 134(2):171–186. https://doi.org/10.
rosci 91:334–342. https://doi.org/10. 1007/s00401-017-1717-7
1016/j.jocn.2021.07.028 150. Bos D, Wolters FJ, Darweesh SKL, Vernooij
141. Ramos LA, van der Steen WE, Sales Barros R, MW, Wolf F, Ikram MA et al (2018) Cerebral
Majoie C, van den Berg R, Verbaan D et al small vessel disease and the risk of dementia: a
(2019) Machine learning improves prediction systematic review and meta-analysis of
of delayed cerebral ischemia in patients with population-based evidence. Alzheimers
subarachnoid hemorrhage. J Neurointerv Dement 14(11):1482–1492. https://doi.
Surg. 11(5):497–502. https://doi.org/10. org/10.1016/j.jalz.2018.04.007
1136/neurintsurg-2018-014258 151. Akoudad S, Wolters FJ, Viswanathan A,
142. Rubbert C, Patil KR, Beseoglu K, Mathys C, Bruijn RF, Lugt A, Hofman A et al (2016)
May R, Kaschner MG et al (2018) Prediction Association of cerebral microbleeds with cog-
of outcome after aneurysmal subarachnoid nitive decline and dementia. JAMA Neurol
haemorrhage using data from patient 73(8):934–943. https://doi.org/10.1001/
jamaneurol.2016.1017
958 Yannan Yu and David Yen-Ting Chen
152. Brown R, Benveniste H, Black SE, Charpak S, segmentation of white matter lesions in MR
Dichgans M, Joutel A et al (2018) Under- imaging. NeuroImage 21(3):1037–1044.
standing the role of the perivascular space in https://doi.org/10.1016/j.neuroimage.
cerebral small vessel disease. Cardiovasc Res 2003.10.012
114(11):1462–1473. https://doi.org/10. 162. Lao Z, Shen D, Liu D, Jawad AF, Melhem
1093/cvr/cvy113 ER, Launer LJ et al (2008) Computer-
153. Georgakis MK, Duering M, Wardlaw JM, assisted segmentation of white matter lesions
Dichgans M (2019) WMH and long-term in 3D MR images using support vector
outcomes in ischemic stroke: a systematic machine. Acad Radiol 15(3):300–313.
review and meta-analysis. Neurology 92(12): https://doi.org/10.1016/j.acra.2007.
e1298–ee308. https://doi.org/10.1212/ 10.012
WNL.0000000000007142 163. Herskovits EH, Bryan RN, Yang F (2008)
154. Staals J, Makin SD, Doubal FN, Dennis MS, Automated Bayesian segmentation of micro-
Wardlaw JM (2014) Stroke subtype, vascular vascular white-matter lesions in the
risk factors, and total MRI brain small-vessel ACCORD-MIND study. Adv Med Sci
disease burden. Neurology 83(14): 53(2):182–190. https://doi.org/10.2478/
1228–1234. https://doi.org/10.1212/ v10039-008-0039-3
WNL.0000000000000837 164. Beare R, Srikanth V, Chen J, Phan TG,
155. Xu X, Hilal S, Collinson SL, Chong EJY, Stapleton J, Lipshut R et al (2009) Develop-
Ikram MK, Venketasubramanian N et al ment and validation of morphological seg-
(2015) Association of magnetic resonance mentation of age-related cerebral white
imaging markers of cerebrovascular disease matter hyperintensities. NeuroImage 47(1):
burden and cognition. Stroke 46(10): 199–203. https://doi.org/10.1016/j.
2808–2814. https://doi.org/10.1161/ neuroimage.2009.03.055
STROKEAHA.115.010700 165. Dyrby TB, Rostrup E, Baaré WFC, Straaten
156. Prins ND, Scheltens P (2015) White matter ECW, Barkhof F, Vrenken H et al (2008)
hyperintensities, cognitive impairment and Segmentation of age-related white matter
dementia: an update. Nat Rev Neurol 11(3): changes in a clinical multi-center study. Neu-
157–165. https://doi.org/10.1038/ roImage 41(2):335–345. https://doi.org/
nrneurol.2015.10 10.1016/j.neuroimage.2008.02.024
157. Fazekas F, Chawluk JB, Alavi A, Hurtig HI, 166. Jack CR, O’Brien PC, Rettman DW, Shiung
Zimmerman RA (1987) MR signal abnorm- MM, Xu Y, Muthupillai R et al (2001) FLAIR
alities at 1.5 T in Alzheimer’s dementia and histogram segmentation for measurement of
normal aging. AJR Am J Roentgenol 149(2): leukoaraiosis volume. J Magn Reson Imaging
351–356. https://doi.org/10.2214/ajr.149. 14(6):668–676. https://doi.org/10.1002/
2.351 jmri.10011
158. Heuvel DMJ, Dam VH, Craen AJM, 167. Admiraal-Behloul F, Heuvel DMJ,
Admiraal-Behloul F, Es ACGM, Palm WM Olofsen H, Osch MJP, Grond J, Buchem
et al (2006) Measuring longitudinal white MA et al (2005) Fully automatic segmenta-
matter changes: comparison of a visual rating tion of white matter hyperintensities in MR
scale with a volumetric measurement. AJNR images of the elderly. NeuroImage 28(3):
Am J Neuroradiol 27(4):875–878 607–617. https://doi.org/10.1016/j.
159. Straaten ECW, Fazekas F, Rostrup E, neuroimage.2005.06.061
Scheltens P, Schmidt R, Pantoni L et al 168. Seghier ML, Ramlackhansingh A, Crinion J,
(2006) Impact of white matter hyperintensi- Leff AP, Price CJ (2008) Lesion identification
ties scoring method on correlations with using unified segmentation-normalisation
clinical data: the LADIS study. Stroke 37(3): models and fuzzy clustering. NeuroImage
836–840. https://doi.org/10.1161/01. 41(4):1253–1266. https://doi.org/10.
STR.0000202585.26325.74 1016/j.neuroimage.2008.03.028
160. M€antyl€a R, Erkinjuntti T, Salonen O, Aronen 169. Wang Y, Catindig JA, Hilal S, Soon HW,
HJ, Peltonen T, Pohjasvaara T et al (1997) Ting E, Wong TY et al (2012) Multi-stage
Variable agreement between visual rating segmentation of white matter hyperintensity,
scales for white matter hyperintensities on cortical and lacunar infarcts. NeuroImage
MRI. Comparison of 13 rating scales in a 60(4):2379–2388. https://doi.org/10.
poststroke cohort. Stroke 28(8):1614–1623. 1016/j.neuroimage.2012.02.034
https://doi.org/10.1161/01.str.28.8.1614 170. Jeon S, Yoon U, Park J-S, Seo SW, Kim J-H,
161. Anbeek P, Vincken KL, Osch MJP, Bisschops Kim ST et al (2011) Fully automated pipeline
RHC, Grond J (2004) Probabilistic for quantification and localization of white
Machine Learning for Cerebrovascular Disorders 959
matter hyperintensity in brain magnetic reso- magnetic resonance images in the era of
nance image. Int J Imaging Syst Technol deep learning and big data – a systematic
21(2):193–200. https://doi.org/10.1002/ review. Comput Med Imaging Graph 88:
ima.20277 101867. https://doi.org/10.1016/j.
171. Caligiuri ME, Perrotta P, Augimeri A, compmedimag.2021.101867
Rocca F, Quattrone A, Cherubini A (2015) 180. Jiang J, Liu T, Zhu W, Koncz R, Liu H, Lee T
Automatic detection of white matter hyperin- et al (2018) UBO detector – a cluster-based,
tensities in healthy aging and pathology using fully automated pipeline for extracting white
magnetic resonance imaging: a review. Neu- matter hyperintensities. NeuroImage 174:
roinformatics 13(3):261–276. https://doi. 539–549. https://doi.org/10.1016/j.
org/10.1007/s12021-015-9260-y neuroimage.2018.03.050
172. Kuijf HJ, Biesbroek JM, De Bresser J, 181. Sundaresan V, Zamboni G, Le Heron C,
Heinen R, Andermatt S, Bento M et al Rothwell PM, Husain M, Battaglini M et al
(2019) Standardized assessment of automatic (2019) Automated lesion segmentation with
segmentation of white matter hyperintensities BIANCA: impact of population-level features,
and results of the WMH segmentation chal- classification algorithm and locally adaptive
lenge. IEEE Trans Med Imaging 38(11): thresholding. NeuroImage 202:116056.
2556–2568. https://doi.org/10.1109/ https://doi.org/10.1016/j.neuroimage.
TMI.2019.2905770 2019.116056
173. Li H, Jiang G, Zhang J, Wang R, Wang Z, 182. Zhan T, Yu R, Zheng Y, Zhan Y, Xiao L, Wei
Zheng W-S et al (2018) Fully convolutional Z (2017) Multimodal spatial-based segmen-
network ensembles for white matter hyperin- tation framework for white matter lesions in
tensities segmentation in MR images. Neuro- multi-sequence magnetic resonance images.
Image 183:650–665. https://doi.org/10. Biomed Signal Process Control 31:52–62.
1016/j.neuroimage.2018.07.005 https://doi.org/10.1016/j.bspc.2016.
174. Andermatt S, Pezold S, Cattin P (2016) 06.016
Multi-dimensional gated recurrent units for 183. Ding T, Cohen AD, O’Connor EE, Karim
the segmentation of biomedical 3D-data. HT, Crainiceanu A, Muschelli J et al (2020)
https://doi.org/10.1007/978-3-319- An improved algorithm of white matter
46976-8_15 hyperintensity detection in elderly adults.
175. Ghafoorian M, Karssemeijer N, Heskes T, Neuroimage Clin 25:102151. https://doi.
Uden IWM, Sanchez CI, Litjens G et al org/10.1016/j.nicl.2019.102151
(2017) Location sensitive deep convolutional 184. Zhan T, Zhan Y, Liu Z, Xiao L, Wei Z (2015)
neural networks for segmentation of white Automatic method for white matter lesion
matter hyperintensities. Sci Rep 7(1):5110. segmentation based on T1-fluid-attenuated
https://doi.org/10.1038/s41598-017- inversion recovery images. IET Comput Vis
05300-5 9(4):447–455. https://doi.org/10.1049/
176. Valverde S, Cabezas M, Roura E, González- iet-cvi.2014.0121
Villà S, Pareto D, Vilanova JC et al (2017) 185. Valverde S, Oliver A, Roura E, González-
Improving automated multiple sclerosis Villà S, Pareto D, Vilanova JC et al (2017)
lesion segmentation with a cascaded 3D con- Automated tissue segmentation of MR brain
volutional neural network approach. Neuro- images in the presence of white matter lesions.
Image 155:159–168. https://doi.org/10. Med Image Anal 35:446–457. https://doi.
1016/j.neuroimage.2017.04.034 org/10.1016/j.media.2016.08.014
177. Zhang Z, Powell K, Yin C, Cao S, 186. Fiford CM, Sudre CH, Pemberton H,
Gonzalez D, Hannawi Y et al (2021) Brain Walsh P, Manning E, Malone IB et al (2020)
atlas guided attention U-net for white matter Automated White matter hyperintensity seg-
hyperintensity segmentation. AMIA Jt Sum- mentation using Bayesian model selection:
mits Transl Sci Proc 2021:663–671 assessment and correlations with cognitive
178. Park G, Hong J, Duffy BA, Lee J-M, Kim H change. Neuroinformatics 18(3):429–449.
(2021) White matter hyperintensities seg- https://doi.org/10.1007/s12021-019-
mentation using the ensemble U-net with 09439-6
multi-scale highlighting foregrounds. Neuro- 187. Damangir S, Westman E, Simmons A,
Image 237:118140. https://doi.org/10. Vrenken H, Wahlund L-O, Spulber G
1016/j.neuroimage.2021.118140 (2017) Reproducible segmentation of white
179. Balakrishnan R, Valdés Hernández MC, Far- matter hyperintensities using a new statistical
rall AJ (2021) Automatic segmentation of definition. MAGMA 30(3):227–237.
white matter hyperintensities from brain
960 Yannan Yu and David Yen-Ting Chen
dementia: the age, gene/environment 209. Shi Y, Wardlaw JM (2016) Update on cere-
susceptibility-Reykjavik study. JAMA Neurol bral small vessel disease: a dynamic whole-
74(9):1105–1112. https://doi.org/10. brain disease. Stroke Vasc Neurol 1(3):
1001/jamaneurol.2017.1397 83–92. https://doi.org/10.1136/svn-
205. Charidimou A, Martinez-Ramirez S, Reijmer 2016-000035
YD, Oliveira-Filho J, Lauer A, Roongpiboon- 210. Biesbroek JM, Weaver NA, Biessels GJ (2017)
sopit D et al (2016) Total magnetic resonance Lesion location and cognitive impact of cere-
imaging burden of small vessel disease in cere- bral small vessel disease. Clin Sci (Lond)
bral amyloid angiopathy: an imaging- 131(8):715–728. https://doi.org/10.1042/
pathologic study of concept validation. CS20160452
JAMA Neurol 73(8):994–1001. https://doi. 211. Duan Y, Shan W, Liu L, Wang Q, Wu Z, Liu P
org/10.1001/jamaneurol.2016.0832 et al (2020) Primary categorizing and mask-
206. Park SH, Zong X, Gao Y, Lin W, Shen D ing cerebral small vessel disease based on
(2016) Segmentation of perivascular spaces “deep learning system”. Front Neuroinform
in 7T MR image using auto-context model 14:17
with orientation-normalized features. Neuro- 212. Dickie DA, Valdés Hernández MDC, Makin
Image 134:223–235. https://doi.org/10. SD, Staals J, Wiseman SJ, Bastin ME et al
1016/j.neuroimage.2016.03.076 (2018) The brain health index: towards a
207. Ballerini L, Lovreglio R, Valdés Hernández combined measure of neurovascular and neu-
MDC, Ramirez J, MacIntosh BJ, Black SE rodegenerative structural brain injury. Int J
et al (2018) Perivascular spaces segmentation Stroke 13(8):849–856. https://doi.org/10.
in brain MRI using optimal 3D filtering. Sci 1177/1747493018770222
Rep 8(1):2132. https://doi.org/10.1038/ 213. Jokinen H, Koikkalainen J, Laakso HM,
s41598-018-19781-5 Melkas S, Nieminen T, Brander A et al
208. Dubost F, Yilmaz P, Adams H, Bortsova G, (2020) Global burden of small vessel
Ikram MA, Niessen W et al (2019) Enlarged disease-related brain changes on MRI predicts
perivascular spaces in brain MRI: automated cognitive and functional decline. Stroke
quantification in four regions. NeuroImage 51(1):170–178. https://doi.org/10.1161/
185:534–544. https://doi.org/10.1016/j. STROKEAHA.119.026170
neuroimage.2018.10.026
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 30
Abstract
Diagnostic imaging is widely used to assess, characterize, and monitor brain tumors. However, there remain
several challenges in each of these categories due to the heterogeneous nature of these tumors. This may
include variations in tumor biology that relate to variable degrees of cellular proliferation, invasion, and
necrosis that in turn have different imaging manifestations. These variations have created challenges for
tumor assessment, including segmentation, surveillance, and molecular characterizations. Although several
rule-based approaches have been implemented that relates to tumor size and appearance, these methods
inherently distill the rich amount of tumor imaging data into a limited number of variables. Approaches in
artificial intelligence, machine learning, and deep learning have been increasingly leveraged to computer
vision tasks, including tumor imaging, given their effectiveness for solving image-based challenges. This
objective of this chapter is to summarize some of these advances in the field of tumor imaging.
Key words Brain tumors, Radiogenomics, Tumor segmentation, Response Assessment in Neuro-
Oncology (RANO), Response Evaluation Criteria in Solid Tumors (RECIST)
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_30,
© The Author(s) 2023
963
964 Jennifer Soun et al.
Malignant
Astrocytomas 20–25%
Oligodendrogliomas 1–2%
Ependymal tumors <2%
Other 8%
Non-malignant
Meningiomas 37%
Pituitary 16%
Nerve sheath 8%
Other 7%
Fig. 1 Head tilt affecting designation. Patient with glioblastoma after resection. Simulation of tilting the
patient’s head up results in progression of disease (a) while in routine positioning demonstrates stable disease
(b), and tilting downward results in partial response (c)
Fig. 2 Pseudoprogression. Example of a 45-year-old female with GBM. Axial post-contrast images immedi-
ately after resection show minimal enhancing disease (a). Follow-up MRI at 1 month demonstrates new thick
enhancement (b) that subsequently reduced on images 12 months out (c)
3.3.3 MRI Biomarkers of Both spatial and temporal variations in genetic expression result in
Tumor Biology and Genetic alterations in tumor biology, including changes in apoptosis, cellu-
Heterogeneity lar proliferation, cellular invasion, and angiogenesis [30]. In turn,
these biologic changes manifest in the heterogeneous imaging
features of brain tumors, resulting in varying degrees of enhance-
ment and edema. For example, imaging changes on contrast-
enhanced MRI result from the breakdown of the blood-brain
barrier and can demonstrate areas of necrosis as a marker for
apoptosis. Additionally, MRI sequences based on physiology such
as apparent diffusion coefficient (ADC) and perfusion imaging have
been shown to relate to tumoral cellularity and angiogenesis,
respectively. Furthermore, promising efforts have shown that
tumors with lower cerebral blood volume (CBV) on perfusion are
more likely to be IDH mutants and have longer overall survival
(OS) [31, 32]. Other reports have used enhancement patterns and
ADC to predict IDH status with some success [33, 34]. Currently,
efforts to provide molecular classification for brain tumors based on
these MRI features have had mixed results. For example, classifica-
tion of IDH and MGMT mutant status has had some success;
however, methods for 1p19q and EGFR have demonstrated less
reproducibility [35–37]. Different mutations may have similar MRI
AI’s Role in Neuro-Oncology Imaging 969
Fig. 3 Example of automated glioma segmentation using deep learning showing FLAIR edema segmentation
(left) as well as segmentation of enhancing tissue (right). (Courtesy Peter Chang, MD)
4.2 Surveillance As described previously, psPD cases are not reliably distinguished
from true progression using RANO criteria with a recent meta-
analysis suggesting that upward of 36% are underdiagnosed
[54]. In fact, the only accepted methods to distinguish true PD
from treatment-related psPD are invasive tissue sampling and short
interval clinical follow-up with imaging, which may delay and com-
promise disease management in an aggressive tumor [16, 17].
Traditional machine learning models have been previously uti-
lized for psPD characterization from radiologic imaging. Hu et al.’s
[55] SVM approach examining multi-parametric MRI data yielded
an optimized classifier for psPD with a sensitivity of 89.9% and
specificity of 93.7%. Though deep learning methods have been
leveraged less frequently, they are showing promise for
AI’s Role in Neuro-Oncology Imaging 971
Fig. 4 MRI separating gliomas by MGMT methylation status. Features include thick enhancement with central
necrosis (a) with infiltrative edema patterns (b). In contrast, features predictive of MGMT promoter methylated
status include nodular and heterogeneous enhancement (c) with masslike FLAIR edema (d). (Copyright
American Journal of Neuroradiology, adapted, with permission, from reference [35])
5 Summary
Acknowledgments
References
malignant glioma can mimic tumor progres- 25. Bartek J Jr, Ng K, Bartek J, Fischer W,
sion. Neurology 63(3):535–537 Carter B, Chen CC (Jul 2012) Key concepts
16. Brandsma D, Stalpers L, Taal W, Sminia P, van in glioblastoma therapy. J Neurol Neurosurg
den Bent MJ (2008) Clinical features, mechan- Psychiatry 83(7):753–760. https://doi.org/
isms, and management of pseudoprogression 10.1136/jnnp-2011-300709
in malignant gliomas. Lancet Oncol 9(5): 26. Hegi ME, Diserens AC, Gorlia T et al (2005)
453–461. https://doi.org/10.1016/S1470- MGMT gene silencing and benefit from temo-
2045(08)70125-6 zolomide in glioblastoma. N Engl J Med
17. Hygino da Cruz LC Jr, Rodriguez I, Domin- 352(10):997–1003. https://doi.org/10.
gues RC, Gasparetto EL, Sorensen AG (2011) 1056/NEJMoa043331
Pseudoprogression and pseudoresponse: imag- 27. Auffinger B, Thaci B, Nigam P, Rincon E,
ing challenges in the assessment of posttreat- Cheng Y, Lesniak MS (2012) New therapeutic
ment glioma. AJNR Am J Neuroradiol 32(11): approaches for malignant glioma: in search of
1978–1985. https://doi.org/10.3174/ajnr. the Rosetta stone. F1000 Med Rep 4:18.
A2397 https://doi.org/10.3410/M4-18
18. Brandes AA, Franceschi E, Tosoni A et al 28. Patel AP, Tirosh I, Trombetta JJ et al (2014)
(2008) MGMT promoter methylation status Single-cell RNA-seq highlights intratumoral
can predict the incidence and outcome of pseu- heterogeneity in primary glioblastoma. Science
doprogression after concomitant radioche- 344(6190):1396–1401. https://doi.org/10.
motherapy in newly diagnosed glioblastoma 1126/science.1254257
patients. J Clin Oncol 26(13):2192–2197. 29. Sottoriva A, Spiteri I, Piccirillo SG et al (2013)
https://doi.org/10.1200/JCO.2007.14. Intratumor heterogeneity in human glioblas-
8163 toma reflects cancer evolutionary dynamics.
19. Wen PY, Macdonald DR, Reardon DA et al Proc Natl Acad Sci U S A 110(10):
(2010) Updated response assessment criteria 4009–4014. https://doi.org/10.1073/pnas.
for high-grade gliomas: response assessment 1219747110
in neuro-oncology working group. J Clin 30. Belden CJ, Valdes PA, Ran C et al (Oct 2011)
Oncol 28(11):1963–1972. https://doi.org/ Genetics of glioblastoma: a window into its
10.1200/JCO.2009.26.3541 imaging and histopathologic variability. Radio-
20. Huang RY, Wen PY (Nov 2016) Response graphics 31(6):1717–1740. https://doi.org/
assessment in neuro-oncology criteria and clin- 10.1148/rg.316115512
ical endpoints. Magn Reson Imaging Clin N 31. Kickingereder P, Sahm F, Radbruch A et al
Am 24(4):705–718. https://doi.org/10. (2015) IDH mutation status is associated
1016/j.mric.2016.06.003 with a distinct hypoxia/angiogenesis transcrip-
21. Huang RY, Neagu MR, Reardon DA, Wen PY tome signature which is non-invasively predict-
(2015) Pitfalls in the neuroimaging of glioblas- able with rCBV imaging in human glioma. Sci
toma in the era of antiangiogenic and Rep 5:16238. https://doi.org/10.1038/
immuno/targeted therapy - detecting illusive srep16238
disease, defining response. Front Neurol 6:33. 32. Law M, Young RJ, Babb JS et al (2008) Glio-
https://doi.org/10.3389/fneur.2015.00033 mas: predicting time to progression or survival
22. Okada H, Weller M, Huang R et al (Nov 2015) with cerebral blood volume measurements at
Immunotherapy response assessment in neuro- dynamic susceptibility-weighted contrast-
oncology: a report of the RANO working enhanced perfusion MR imaging. Radiology
group. Lancet Oncol 16(15):e534–e542. 247(2):490–498. https://doi.org/10.1148/
https://doi.org/10.1016/S1470-2045(15) radiol.2472070898
00088-1 33. Price SJ, Allinson K, Liu H et al (2017) Less
23. Yan H, Parsons DW, Jin G et al (2009) IDH1 invasive phenotype found in Isocitrate
and IDH2 mutations in gliomas. N Engl J Med dehydrogenase-mutated glioblastomas than in
360(8):765–773. https://doi.org/10.1056/ Isocitrate dehydrogenase wild-type glioblasto-
NEJMoa0808710 mas: a diffusion-tensor imaging study. Radiol-
24. Louis DN, Perry A, Reifenberger G et al (Jun ogy 283(1):215–221. https://doi.org/10.
2016) The 2016 World Health Organization 1148/radiol.2016152679
classification of tumors of the central nervous 34. Xiong J, Tan W, Wen J et al (2016) Combina-
system: a summary. Acta Neuropathol 131(6): tion of diffusion tensor imaging and conven-
803–820. https://doi.org/10.1007/s00401- tional MRI correlates with isocitrate
016-1545-1 dehydrogenase 1/2 mutations but not
AI’s Role in Neuro-Oncology Imaging 975
55. Hu X, Wong KK, Young GS, Guo L, Wong ST 60. Levner I, Drabycz S, Roldan G, De Robles P,
(2011) Support vector machine multipara- Cairncross JG, Mitchell R (2009) Predicting
metric MRI identification of pseudoprogres- MGMT methylation status of glioblastomas
sion from tumor recurrence in patients with from MRI texture. Med Image Comput Com-
resected glioblastoma. J Magn Reson Imaging put Assist Interv 12(Pt 2):522–530. https://
33(2):296–305 doi.org/10.1007/978-3-642-04271-3_64
56. Jang B-S, Jeon SH, Kim IH, Kim IA (2018) 61. Korfiatis P, Kline TL, Lachance DH, Parney IF,
Prediction of pseudoprogression versus pro- Buckner JC, Erickson BJ (Oct 2017) Residual
gression using machine learning algorithm in deep convolutional neural network predicts
glioblastoma. Sci Rep 8(1):12516 MGMT methylation status. J Digit Imaging
57. Jang BS, Park AJ, Jeon SH et al (2020) 30(5):622–628. https://doi.org/10.1007/
Machine learning model to predict pseudopro- s10278-017-0009-z
gression versus progression in glioblastoma 62. Ryu YJ, Choi SH, Park SJ, Yun TJ, Kim JH,
using MRI: a multi-institutional study Sohn CH (2014) Glioma: application of
(KROG 18–07). Cancers (Basel) 12(9):2706. whole-tumor texture analysis of diffusion-
https://doi.org/10.3390/cancers12092706 weighted imaging for the evaluation of tumor
58. Lee J, Wang N, Turk S et al (2020) Discrimi- heterogeneity. PLoS One 9(9):e108335.
nating pseudoprogression and true progression https://doi.org/10.1371/journal.pone.
in diffuse infiltrating glioma using multi- 0108335
parametric MRI data through deep learning. 63. Drabycz S, Roldán G, de Robles P et al (2010)
Sci Rep 10(1):20331. https://doi.org/10. An analysis of image texture, tumor location,
1038/s41598-020-77389-0 and MGMT promoter methylation in glioblas-
59. Trivizakis E, Papadakis GZ, Souglakos I et al toma using magnetic resonance imaging. Neu-
(2020) Artificial intelligence radiogenomics for roimage 49(2):1398–1405. https://doi.org/
advancing precision and effectiveness in onco- 10.1016/j.neuroimage.2009.09.049
logic care (Review). Int J Oncol 57(1):43–53.
https://doi.org/10.3892/ijo.2020.5063
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 31
Abstract
Neurodevelopmental disorders (NDDs) constitute a major health issue with >10% of the general world-
wide population affected by at least one of these conditions—such as autism spectrum disorders (ASD) and
attention deficit hyperactivity disorders (ADHD). Each NDD is particularly complex to dissect for several
reasons, including a high prevalence of comorbidities and a substantial heterogeneity of the clinical
presentation. At the genetic level, several thousands of genes have been identified (polygenicity), while a
part of them was already involved in other psychiatric conditions (pleiotropy). Given these multiple sources
of variance, gathering sufficient data for the proper application and evaluation of machine learning
(ML) techniques is essential but challenging. In this chapter, we offer an overview of the ML methods
most widely used to tackle NDDs’ complexity—from stratification techniques to diagnosis prediction. We
point out challenges specific to NDDs, such as early diagnosis, that can benefit from the recent advances in
the ML field. These techniques also have the potential to delineate homogeneous subgroups of patients that
would enable a refined understanding of underlying physiopathology. We finally survey a selection of recent
papers that we consider as particularly representative of the opportunities offered by contemporary ML
techniques applied to large open datasets or that illustrate the challenges faced by current approaches to be
addressed in the near future.
Key words Neurodevelopmental disorders, Autism spectrum disorders, Attention deficit hyperactiv-
ity disorders, Machine learning, Pattern recognition, Classification, Clustering, Stratification
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_31,
© The Author(s) 2023
977
978 Clara Moreau et al.
Fig. 1 Left: As introduced in Subheading 1, the complexity of NDDs comes from the combination of multiple
sources of heterogeneity acting at different levels and that overlap across conditions as illustrated here with
ASD, ADHD, and intellectual disability (ID). Right: As described in Subheading 2, ML approaches are
instrumental to characterize and overcome the heterogeneity at each level with dedicated techniques
2 What Are the Main Challenges in These Conditions That Can Be Addressed Using
Machine Learning?
2.1 The Classical Historically, the classical analysis approach consisted in designing a
Analysis Approach study starting from the definition of an “atypical” population of
Failed to Reach interest, based on particular clinical scores selected among the
Consensus behavioral assessments used for diagnosis. This population of inter-
est is compared to a group of control subjects, following a feature
defined a priori such as “the volume of a specific cortical region
estimated from anatomical MRI.” As extensively described in, e.g.,
[37–39], this corresponds to statistically testing the hypothesis:
Does the atypical population differ, on average, from controls in
the selected feature? Statistically speaking, this amounts to a case-
control study using univariate hypothesis testing for one or a few
features. The large literature of early studies following this approach
allowed to refine the characterization of the different sources of
heterogeneity presented above and shed light on the lack of
biological validity of categorical representations of NDDs that
manifest in the evolution of the nosology, for instance, moving
from “autism” to “autism spectrum disorders” [3]. However, as
we progressed in our understanding of the interactions between
genetics, biological brain, and behavior, the limits of the group
statistics and univariate approaches became obvious.
2.1.1 Limitations of The univariate approach is prevalent in the literature for historical
Classical Univariate reasons. It relies on the implicit assumption that different brain
Analysis Techniques regions and/or different features are independent, while more and
more evidence supports the opposite view: effects are spread across
several brain regions, possibly located far from each other. Knowing
the various sources of variance in NDDs’ data described earlier, it is
unlikely that a single feature may capture a large portion of that
variation and thus be interpreted in terms of underlying biological
processes. It is thus not surprising that the effect sizes reported in
meta-analyses remain small. In addition to potentially reduced
984 Clara Moreau et al.
2.1.2 Limitations of As extensively discussed in [41], group statistics all focus on first-
Group Statistics order statistics (group means), thereby seeking a pattern of atypi-
cality that is consistent across the population (i.e., the “average
patient”). Indeed, mean group differences may reflect a systematic
shift in the distribution of the clinical group and thus provide useful
information on altered processes in that population. However,
those differences do not delineate variability within groups
[38]. In addition, the evolution of the DSM by regrouping condi-
tions that were considered in previous versions as distinct (e.g.,
Asperger and pervasive developmental disorders not otherwise spe-
cified) induced an increase in the heterogeneity of the populations
included in studies on ASD [37]. Group comparisons based on
diagnosis thus present the major caveat of ignoring psychiatric
comorbidities, which are common in NDDs. It thus becomes
obvious that group statistics applied to populations defined based
on diagnostic categories are inadequate. Indeed, categorical diag-
noses from the DSM are increasingly found to be incongruent with
emerging neuroscientific evidence that points toward shared neu-
robiological dysfunction underlying NDDs [42]. See, e.g., [39] for
extensive discussions on the limitations of the diagnostic-first
approach in comparison to the alternative strategy that begins at
the level of molecular factors enabling the study of mechanisms
related to biological risk, irrespective of diagnoses or clinical
manifestations.
The combination of univariate statistics and mean group differ-
ence analysis applied to heterogeneous populations with small sam-
ple sizes resulted in highly inconsistent findings. Indeed, most of
the published findings are not consistent and were not replicated.
The recent challenge [43] further illustrates the intrinsic limitation
of the group statistics framework, but also that state-of-the-art ML
techniques do not systematically outperform classical approaches in
such a binary classification task. In this context, deep learning
techniques were prone to overfitting with poor generalization to
unseen dataset, while simpler approaches had a stable prediction
performance when applied to new data. It is important to stress that
several limitations from this early literature do fully apply to more
advanced ML techniques and/or multivariate data analysis strate-
gies. While the problem of inflated false discovery rate in univariate
analysis framework has been extensively discussed [40], the pro-
blems related to the improper evaluation and validation of ML
Neurodevelopmental Disorders 985
2.2 Promises of ML The rise of big data and the sustained advances in ML enable in
in NDDs principle the integration of various and heterogeneous character-
istics such as behavioral profiles, imaging phenotypes, and geno-
mics. The extraction and manual construction of features from each
data type, also termed as feature engineering, did undergo continu-
ous progress in tight relation with innovations in the acquisition
processes. As an illustration, the imaging phenotype today covers a
wide range of features extracted mainly from MRI data. For
instance, a variety of measures can be extracted from diffusion-
weighted imaging [53], from basic estimation in each voxel such
as the fractional anisotropy to higher-level connectivity measures in
each anatomically defined fiber tract, or even connections between
distant anatomical regions (structural connectivity). On the genet-
ics side, polygenic risk scores (PRS) are additive models developed
to estimate the aggregate effects of thousands of common variants
with very small individual effects. They can be computed for any
individual to estimate the risk/probability for a particular trait
conferred by common variants [54]. Feature engineering is a cru-
cial step in the analysis since the biological relevance of the features
directly impacts the interpretation, and the strategy used to manage
potential interaction across different features might determine the
performance of the analysis procedure more than the ML algorithm
itself. In parallel, the increase in the size of the available data enables
the training of more complex algorithms, making it possible to
investigate central questions related to the dynamics of normal
and abnormal development by means of advanced ML techniques.
2.4 Latent Space Following the progressive confirmation of the inadequacy of mutu-
Decomposition and ally exclusive diagnostic categories, behavioral assessments for
Clustering: quantifying ASD traits in any given individual were introduced,
Unsupervised Learning such as the autism spectrum quotient questionnaire [61] and the
for NDDs Social Responsiveness Scale (SRS) [62]. A number of studies used
these scores to demonstrate that ASD traits are also present in the
typically developing population as well as in other NDDs such as
ADHD [63]. These studies supported the view of a continuum
across NDDs and emphasized the need for novel approaches to
identify general psychopathology dimensions that cut through
diagnostic boundaries. Such data-driven dimensions would ulti-
mately enable the identification of new targets for treatment devel-
opment and to stratify the NDDs in subgroups more appropriate
for treatment selection [58, 59, 64]. Uncovering the hidden intrin-
sic structure in the data is a well-known ML problem that has been
formulated as unsupervised learning in opposition to supervised
learning tasks such as classification where the algorithm learns to
predict a label based on a training set for which the true label is
known (see Chapters 1 and 2 or, e.g., [65]). Unsupervised ML
techniques consist in fitting a statistical model to the data by
implementing specific assumptions regarding the relationships
between the input features and on the supposed hidden structure.
A general assumption to all unsupervised techniques is that there
exists a non-negligeable degree of correlation across some of the
features in the actual data, which justifies the search for a more
compact optimal representation. Depending on the assumptions
regarding the hidden structure to discover, unsupervised techni-
ques can be divided into two classes: latent space decomposition
and clustering. Latent space decomposition techniques aim at pro-
jecting the data onto a new feature space of lower dimension in
which a large portion of the variance can be explained by a few
factors. The underlying assumption is that the projected features
vary continuously along the axes of this compact subspace. In
contrast, clustering techniques seek to partition the data into dis-
tinct groups (often termed as population stratification) so that the
observations within each group are similar to each other, while
observations in different groups differ from each other. The under-
lying assumption is thus that a categorical representation is more
appropriate than in the case of the latent space decomposition
approach. In contrast with the classification task, the algorithm is
988 Clara Moreau et al.
2.5 Normative Normative modeling gained great interest in the context of psychi-
Modeling for NDDs atry recently, and the first applications to NDDs confirm the partic-
ular relevance of this approach in this context. Marquand et al. [75]
introduced normative modeling as an alternative to clustering for
parsing heterogeneity across the full range of population variation,
i.e., spanning both clinical and healthy cohorts. In the approach
990 Clara Moreau et al.
We refer to the recent reviews [31, 32, 34, 51, 55, 69, 71], for a
complete overview of the literature of the field. Here, we survey a
selection of very recent works that we consider particularly relevant
with respect to the opportunities offered by recent ML techniques
applied to large open datasets, or that illustrate the challenges faced
by current approaches, to be addressed in the near future.
Neurodevelopmental Disorders 993
3.2 Latent Space Complementary works are aiming to face clinical and biological
Decomposition and heterogeneity in NDDs using a subtyping approach based on imag-
Subtyping Approaches ing data. Using hierarchical clustering methods on neuroanatomi-
Applied to NDDs cal data, Hong and colleagues [95] identified three distinct
morphometric subtypes in ASD: ASD-I characterized by cortical
thickening, increased surface area, and tissue blurring; ASD-II with
994 Clara Moreau et al.
3.3 Normative In [104], the authors applied normative modeling to a large sample
Modeling of ASD and controls males covering a wide age range (5–40 years).
They investigated the potential of age-related effects on cortical
thickness to serve as an individualized metric of atypicality in indi-
viduals with ASD. They reported that only a small subgroup of
patients showed age-atypical cortical thickness. By comparing with
conventional case-control analyses, they observed that most case-
control differences were driven by a small subgroup of patients with
high atypicality for their age. Highly consistent results were
obtained in another application of normative modeling to a differ-
ent ASD cohort [105], despite important variations across these
studies. The population of the second work was composed of both
males and females, and sex was included as a factor in the normative
model. In addition, the normative models were estimated using
different approaches (non-parametric regression in [104], Gaussian
process regression in [105]). The overall consistent results despite
the methodological differences support the relevance of the nor-
mative modeling approach for NDDs. In a follow-up study, [106]
applied the spectral clustering technique to the atypicality maps
computed at the individual level as deviation in the cortical thick-
ness with respect to the normative model estimated in [105].They
identified five subtypes of individuals with ASD and assessed their
separability using a multi-class linear SVM. Each subpopulation was
then characterized in terms of demographic and clinical measures as
well as association with polygenic scores for seven traits (autism,
ADHD, epilepsy, full IQ, neuroticism, schizophrenia, and cross-
disorder risk for psychiatric disorders). Importantly, they observed
striking differences in the spatial patterns of cortical thickness atyp-
icality maps between subtypes: three clusters showed reduced cor-
tical thickness relative to the normative pattern, whereas two
clusters showed an increased cortical thickness. These distinct and
opposing atypicalities across different subtypes could explain the
inconsistency in the previous case-control analyses. A last study did
apply normative modeling to an adult population of ADHD
patients [107]. The authors estimated a normative model predict-
ing regional gray and white matter volumes across the brain from
age and sex. They observed deviations shared across patients in gray
matter in the cerebellum, temporal regions, and the hippocampus.
They also provided a measure of the inter-individual variation
between ADHD patients with extreme deviations in specific
regions in more than 2% of the participants. Overall, these results
highlighted the relevance of the normative modeling approach to
understanding the heterogeneity in NDDs.
presence of a genetic risk factor for NDDs allows for the investiga-
tion of pathways related to a particular biological risk for psychiatric
symptoms (bottom-up approach). Clinical routine with genomic
microarrays revealed that copy number variants are present in
10–15% of children with neurodevelopmental conditions
[108]. Genetic-first approaches can however only be applied to a
few recurrent pathogenic mutations frequent enough to establish a
case-control study design. Thus, the effect of the vast majority of
rare deleterious risk variants remains undocumented. Because a
highly diverse landscape of rare variants confers a higher risk to a
spectrum of NDDs, studies focusing on individual mutations will
not be able to properly disentangle the relationship between muta-
tions, molecular mechanisms, and diagnoses. Huguet and collea-
gues [109] speculated that large effect size pathogenic deletions
may be attributable to the sum of individual effects of genes
encompassed in each copy number variation. They introduced a
new framework to estimate the effect of any pathogenic deletion on
intelligence quotient (IQ). Using several types of functional anno-
tations of rare genetic deletions associated with NDDs, the pro-
posed framework predicted their impact on IQ with 76% accuracy
[109]. They showed that haploinsufficiency scores—probability of
being loss of function intolerant (pLI)—best explain the cognitive
deficits. Follow-up works specifically on ASD confirmed that this
score was the best predictor of IQ deficit and autism risk (odds
ratio) [110, 111]. Deletion of 1 point of pLI was associated with a
decrease of 2.6 points of IQ in autism.
3.5 Deep Learning A deep learning-based framework has been recently introduced to
Applied to NDDs predict the regulatory contribution of non-coding mutations to
autism [112]. Authors constructed a deep convolutional network
to model the functional impact of each individual mutation (single
nucleotide polymorphism). They first identified that ASD probands
(n = 1700 families) were carriers of a higher rate of transcriptional
and post-transcriptional regulation disrupting de novo mutations
compared with their siblings. They also revealed a convergent
pattern of coding and non-coding mutations.
In [113], the authors analyzed resting-state fMRI (RS-fMRI)
data from 260 subjects with ADHD and 343 healthy controls from
the ADHD-200 database. They proposed to represent RS-fMRI
data from each individual as a graph that integrates both temporal
and spatial correlation of regional time-series signals. An original
graph convolutional neural network architecture was introduced to
characterize the brain functional connectome. The model also
included seven non-imaging variables (age, gender, handedness,
IQ measurement, and three Wechsler Intelligence Scale evaluation
IQ variables) and was trained to distinguish ADHD patients from
HC. Several experiments showed a performance gain compared to
previous methods including SVM, logistic regression, and conven-
tional graph convolutional networks. The proposed method
998 Clara Moreau et al.
3.6 Discussion The review of the selected recent studies presented above demon-
strates that the application of ML in NDDs is a very active field of
research, with encouraging perspectives. This field indeed benefits
directly from initiatives to openly share data [114], which did
increase the sample size involved across studies, and favored the
engagement of ML scientists. The paradigm shift from diagnostic-
first to genetic-first and from one diagnostic at a time to cross-
diagnoses approaches is afoot, with a clear rise of large-scale studies
based on normative modeling and deep learning approaches. Meth-
odological works continue to introduce new innovative ML
approaches specifically designed to address the central tasks in
NDDs. Importantly, the adoption of best practices for the valida-
tion and replication of the results across independent datasets as
stated in [46] is clearly encouraged by the recent reviews [31, 32,
34, 51, 55, 69, 71]. However, the validation is limited by insuffi-
cient access to large enough datasets combining multiscale data
(genetics, transcriptomic, proteomic, metabolomic, neuroimaging
features, phenomics). There is no open dataset so far offering that
level of granularity. Indeed, the imaging field is just reaching the
sample size allowing for running modern ML techniques for some
but not all modalities. For instance, large-scale studies involving
diffusion-weighted imaging are clearly lacking in NDDs, probably
due to insufficient access to appropriate data. The genomic field is
not ready yet, and several domains remain relatively new (e.g., first
genome sequenced in 2000, next-generation sequencing techni-
ques in 2010) and expensive (e.g., RNA-Seq data)
[115, 116]. Such data will provide—in the near future—massive
potential for accurate classification and appropriate validation.
Neurodevelopmental Disorders 999
4.1 Potential Bias in Despite the large amount of new approaches released by recent
Data and Processing literature, some potential biases in analysis pipelines should be
Pipelines mentioned. For instance, the analysis of functional networks com-
puted from RS-fMRI relies on a complex succession of processing
steps. Several of these processing steps actually correspond to
implementing assumptions regarding the data. However, the valid-
ity of these assumptions and their influence on the subsequent
results are not sufficiently discussed in the literature. See, for
instance, [117] for a quantitative evaluation of the impact of the
brain parcellation procedure on functional connectivity analyses.
Another major barrier to reproducibility is the lack of compatibility
among programming languages, software versions, and operating
systems as illustrated in [118]. This report highlights the challenges
and potential solutions to be implemented at both the individual
researcher and community levels in order to enable the appropriate
reuse of published methods.
On the data side, the limitations related to the absence of
recording of potentially influencing factors are not sufficiently
investigated and acknowledged. As pointed, e.g., in [119]: “The
extent of brain differences in disease may depend critically on a
patient’s age, duration of illness, course of treatment, as well as
adherence to the treatment, polypharmacy and other unmeasured
factors. Differences in ancestral background, as determined based
on genotype, are strongly related to systematic differences in brain
shape. Any realistic understanding of the brain imaging measures
must take all these into account, as well as acknowledge the exis-
tence of causal factors perhaps not yet known or even imagined.” As
a concrete illustration, [120, 121] recently reported significant
alterations in brain morphometry induced by prematurity, a factor
that was not considered by any of the studies we reviewed here.
Such uncontrolled factors might introduce considerable bias in the
learning process. The ML research field has identified this pitfall,
and several solutions to prevent unexpected implications in clinical
applications are actively debated [122–124].
4.2 Interpretability Even in the absence of bias, the interpretation of the outcome of
and Biological any ML algorithm in the context of clinical application represents a
Substrates critical challenge. More than the level of raw performance, the level
of expertise required from medical doctors in (1) the recording and
(2) the analysis of the data compared to “expertise-free” raw data is
1000 Clara Moreau et al.
Acknowledgments
References
1. American Psychiatric Association (2013) The neurodevelopmental disorders based on data-
diagnostic and statistical manual of mental base review. Mol Cell Neurosci 113:103623.
disorders: DSM 5. American Psychiatric Pub- https://doi.org/10.1016/j.mcn.2021.
lishing, Arlington, VA 103623
2. Jacob S, Wolff JJ, Steinbach MS, Doyle CB, 11. Mills KL et al (2021) Inter-individual varia-
Kumar V, Elison JT (2019) Neurodevelop- bility in structural brain development from
mental heterogeneity and computational late childhood to young adulthood. Neuro-
approaches for understanding autism. Transl Image 242:118450. https://doi.org/10.
Psychiatry 9(1):1–12. https://doi.org/10. 1016/j.neuroimage.2021.118450
1038/s41398-019-0390-0 12. Brown TT (2017) Individual differences in
3. Lombardo MV, Lai M-C, Baron-Cohen S human brain development. Wiley Interdiscip
(2019) Big data approaches to decomposing Rev Cogn Sci 8(1–2):e1389. https://doi.
heterogeneity across the autism spectrum. org/10.1002/wcs.1389
Mol Psychiatry 24(10):1435. https://doi. 13. Brown TT et al (2012) Neuroanatomical
org/10.1038/s41380-018-0321-0 assessment of biological maturity. Curr Biol
4. Hyman SE (2007) Can neuroscience be 22(18):1693–1698. https://doi.org/10.
integrated into the DSM-V? Nat Rev Neu- 1016/j.cub.2012.07.002
rosci 8(9):725–732. https://doi.org/10. 14. Thompson DK et al (2020) Tracking regional
1038/nrn2218 brain growth up to age 13 in children born
5. Bourgeron T (2015) What do we know about term and very preterm. Nat Commun 11(1):
early onset neurodevelopmental disorders? 696. https://doi.org/10.1038/s41467-020-
h t t p s : // d o i . o r g / 1 0 . 7 5 5 1 / m i t p r e s s / 14334-9
9780262029865.003.0005 15. Witvliet D et al (2021) Connectomes across
6. Joshi G et al (2010) The heavy burden of development reveal principles of brain matu-
psychiatric comorbidity in youth with autism ration. Nature 596(7871):257–261. https://
spectrum disorders: a large comparative study doi.org/10.1038/s41586-021-03778-8
of a psychiatrically referred population. J 16. Fjell AM et al (2019) Continuity and discon-
Autism Dev Disord 40(11):1361–1370. tinuity in human cortical development and
https://doi.org/10.1007/s10803-010- change from embryonic stages to old age.
0996-9 Cereb Cortex 29(9):3879–3890. https://
7. Lai M-C, Lombardo MV, Baron-Cohen S doi.org/10.1093/cercor/bhy266
(2014) Autism. Lancet 383(9920):896–910. 17. Reh RK et al (2020) Critical period regulation
https://doi.org/10.1016/S0140-6736(13) across multiple timescales. Proc Natl Acad Sci
61539-1 117(38):23242–23251. https://doi.org/10.
8. Anttila V et al (2018) Analysis of shared heri- 1073/pnas.1820836117
tability in common disorders of the brain. 18. Rudel RG (1981) Residual effects of child-
Science 360(6395):eaap8757. https://doi. hood reading disabilities. Bull Orton Soc 31:
org/10.1126/science.aap8757 89–102
9. Siugzdaite R, Bathelt J, Holmes J, Astle DE 19. Patel Y et al (2020) Virtual histology of corti-
(2020) Transdiagnostic brain mapping in cal thickness and shared neurobiology in
developmental disorders. Curr Biol 30(7): 6 psychiatric disorders. JAMA Psychiatry.
1245–1257.e4. https://doi.org/10.1016/j. https://doi.org/10.1001/jamapsychiatry.
cub.2020.01.078 2020.2694
10. Leblond CS et al (2021) Operative list of 20. Casey BJ et al (2018) The Adolescent Brain
genes associated with autism and Cognitive Development (ABCD) study:
1002 Clara Moreau et al.
imaging acquisition across 21 sites. Dev Cogn the heterogeneity of autism spectrum disor-
Neurosci 32(January):43–54. https://doi. der. Neurosci Biobehav Rev 104
org/10.1016/j.dcn.2018.03.001 (April):240–254. https://doi.org/10.1016/
21. Di Martino A et al (2014) The autism brain j.neubiorev.2019.07.010
imaging data exchange: towards a large-scale 32. Xu M, Calhoun V, Jiang R, Yan W, Sui J
evaluation of the intrinsic brain architecture in (2021) Brain imaging-based machine learning
autism. Mol Psychiatry 19(April):659–667. in autism spectrum disorder: methods and
https://doi.org/10.1038/mp.2013.78 applications. J Neurosci Methods 361:
22. Loth E et al (2017) The EU-AIMS longitudi- 109271. https://doi.org/10.1016/j.
nal European Autism Project (LEAP): design jneumeth.2021.109271
and methodologies to identify and validate 33. Hiremath CS et al (2021) Emerging behav-
stratification biomarkers for autism spectrum ioral and neuroimaging biomarkers for early
disorders. Mol Autism 8(1):24. https://doi. and accurate characterization of autism spec-
org/10.1186/s13229-017-0146-8 trum disorders: a systematic review. Transl
23. Bellec P, Chu C, Chouinard-Decorte F, Psychiatry 11(1):1–12. https://doi.org/10.
Benhajali Y, Margulies DS, Craddock RC 1038/s41398-020-01178-6
(2017) The neuro bureau ADHD-200 pre- 34. Bzdok D, Meyer-Lindenberg A (2018)
processed repository. NeuroImage 144:275– Machine learning for precision psychiatry:
2 8 6 . h t t p s : // d o i . o r g / 1 0 . 1 0 1 6 / j . opportunities and challenges. Biol Psychiatry
neuroimage.2016.06.034 Cogn Neurosci Neuroimaging 3(3):
24. May T et al (2018) Trends in the overlap of 223–230. https://doi.org/10.1016/j.bpsc.
autism spectrum disorder and attention deficit 2017.11.007
hyperactivity disorder: prevalence, clinical 35. Hyde KK et al (2019) Applications of super-
management, language and genetics. Curr vised machine learning in autism spectrum
Dev Disord Rep 5(1):49–57. https://doi. disorder research: a review. Rev J Autism
org/10.1007/s40474-018-0131-8 Dev Disord 6(2):128–146. https://doi.org/
25. Mansour R, Dovi AT, Lane DM, Loveland 10.1007/s40489-019-00158-x
KA, Pearson DA (2017) ADHD severity as it 36. Eslami T, Almuqhim F, Raiker JS, Saeed F
relates to comorbid psychiatric symptomatol- (2021) Machine learning methods for diag-
ogy in children with autism spectrum disor- nosing autism spectrum disorder and atten-
ders (ASD). Res Dev Disabil 60:52–64. tion- deficit/hyperactivity disorder using
https://doi.org/10.1016/j.ridd.2016. functional and structural MRI: a survey.
11.009 Front Neuroinform 14. Accessed 21 Jan
26. Hoogman M et al (2022) Consortium neuro- 2022. [Online]. Available: https://www.
science of attention deficit/hyperactivity dis- frontiersin.org/article/10.3389/fninf.2020.
order and autism spectrum disorder: the 575999
ENIGMA adventure. Hum Brain Mapp 37. Mottron L, Bzdok D (2020) Autism spec-
43(1):37. https://doi.org/10.1002/hbm. trum heterogeneity: fact or artifact? Mol Psy-
25029 chiatry 25(12):3178–3185. https://doi.org/
27. Fombonne E (1999) The epidemiology of 10.1038/s41380-020-0748-y
autism: a review. Psychol Med 29(4): 38. Loth E et al (2021) The meaning of signifi-
769–786. https://doi.org/10.1017/ cant mean group differences for biomarker
S0033291799008508 discovery. PLoS Comput Biol 17(11):
28. Bourgeron T (2015) From the genetic archi- e1009477. https://doi.org/10.1371/jour
tecture to synaptic plasticity in autism spec- nal.pcbi.1009477
trum disorder. Nat Rev Neurosci 16(9): 39. Moreau CA, Raznahan A, Bellec P,
551–563. https://doi.org/10.1038/ Chakravarty M, Thompson PM, Jacquemont
nrn3992 S (2021) Dissecting autism and schizophrenia
29. Lord C et al (2020) Autism spectrum disor- through neuroimaging genomics. Brain 144:
der. Nat Rev Dis Primer 6(1). https://doi. 1943. https://doi.org/10.1093/brain/
org/10.1038/s41572-019-0138-4 awab096
30. Thapar A, Cooper M (2016) Attention deficit 40. Simmons JP, Nelson LD, Simonsohn U
hyperactivity disorder. Lancet 387(10024): (2011) False-positive psychology: undisclosed
1240–1250. https://doi.org/10.1016/ flexibility in data collection and analysis allows
S0140-6736(15)00238-X presenting anything as significant. Psychol Sci
31. Wolfers T et al (2019) From pattern classifica- 22(11):1359–1366. https://doi.org/10.
tion to stratification: towards conceptualizing 1177/0956797611417632
Neurodevelopmental Disorders 1003
106. Zabihi M et al (2020) Fractionating autism 116. Dias CM, Walsh CA (2020) Recent advances
based on neuroanatomical normative model- in understanding the genetic architecture of
ing. Transl Psychiatry 10(1):384. https://doi. autism. Annu Rev Genomics Hum Genet
org/10.1038/s41398-020-01057-0 21(1):289–304. https://doi.org/10.1146/
107. Wolfers T, Beckmann CF, Hoogman M, Bui- annurev-genom-121219-082309
telaar JK, Franke B, Marquand AF (2020) 117. Bryce NV et al (2021) Brain parcellation
Individual differences v. the average patient: selection: an overlooked decision point with
mapping the heterogeneity in ADHD using meaningful effects on individual differences in
normative models. Psychol Med 50(2): resting-state functional connectivity. Neuro-
314–323. https://doi.org/10.1017/ Image 243:118487. https://doi.org/10.
S0033291719000084 1016/j.neuroimage.2021.118487
108. Kearney HM, Thorland EC, Brown KK, 118. Kim Y-M, Poline J-B, Dumas G (2018)
Quintero-Rivera F, South ST (2011) Ameri- Experimenting with reproducibility: a case
can College of Medical Genetics standards study of robustness in bioinformatics. Giga-
and guidelines for interpretation and report- Science 7(7):giy077. https://doi.org/10.
ing of postnatal constitutional copy number 1093/gigascience/giy077
variants. Genet Med 13(7):680–685. https:// 119. Thompson PM et al (2017) ENIGMA and
doi.org/10.1097/GIM.0b013e3182217a3a the individual: predicting factors that affect
109. Huguet G et al (2018) Measuring and esti- the brain in 35 countries worldwide. Neuro-
mating the effect sizes of copy number var- Image 145:389–408. https://doi.org/10.
iants on general intelligence in community- 1016/j.neuroimage.2015.11.057
based samples. JAMA Psychiatry 75(5): 120. Dimitrova R et al (2020) Phenotyping the
447–457. https://doi.org/10.1001/ preterm brain: characterising individual devia-
jamapsychiatry.2018.0039 tions from normative volumetric develop-
110. Douard E et al (2020) Effect sizes of deletions ment in two large infant cohorts.
and duplications on autism risk across the Neuroscience, preprint,. https://doi.org/10.
genome. Am J Psychiatry 178(1):87–98. 1101/2020.08.05.228700
https://doi.org/10.1176/appi.ajp.2020. 121. Dimitrova R et al (2021) Preterm birth alters
19080834 the development of cortical microstructure
111. Huguet G et al (2021) Genome-wide analysis and morphology at term-equivalent age.
of gene dosage in 24,092 individuals esti- NeuroImage 243:118488. https://doi.org/
mates that 10,000 genes modulate cognitive 10.1016/j.neuroimage.2021.118488
ability. Mol Psychiatry 26(6):2663. https:// 122. Caton S, Haas C (2020) Fairness in machine
doi.org/10.1038/s41380-020-00985-z learning: a survey. ArXiv201004053 Cs Stat.
112. Zhou J et al (2019) Whole-genome deep- Accessed 07 Mar 2022. [Online]. Available:
learning analysis identifies contribution of http://arxiv.org/abs/2010.04053
noncoding mutations to autism risk. Nat 123. Mhasawade V, Zhao Y, Chunara R (2021)
Genet 51(6):973. https://doi.org/10. Machine learning and algorithmic fairness in
1038/s41588-019-0420-0 public and population health. Nat Mach Intell
113. Zhao K, Duka B, Xie H, Oathes DJ, 3(8):659. https://doi.org/10.1038/
Calhoun V, Zhang Y (2022) A dynamic s42256-021-00373-4
graph convolutional neural network frame- 124. Lee EE et al (2021) Artificial intelligence for
work reveals new insights into connectome mental health care: clinical applications, bar-
dysfunctions in ADHD. NeuroImage 246: riers, facilitators, and artificial wisdom. Biol
118774. https://doi.org/10.1016/j. Psychiatry Cogn Neurosci Neuroimaging
neuroimage.2021.118774 6(9):856–864. https://doi.org/10.1016/j.
114. Milham MP et al (2018) Assessment of the bpsc.2021.02.001
impact of shared brain imaging data on the 125. Beaulieu-Jones BK et al (2021) Machine
scientific literature. Nat Commun 9(1):2818. learning for patient risk stratification: stand-
https://doi.org/10.1038/s41467-018- ing on, or looking over, the shoulders of clin-
04976-1 icians? Npj Digit Med 4(1):1–6. https://doi.
115. Tărlungeanu DC, Novarino G (2018) Geno- org/10.1038/s41746-021-00426-3
mics in neurodevelopmental disorders: an 126. Boscolo Galazzo I et al (2022) Explainable
avenue to personalized medicine. Exp Mol artificial intelligence for magnetic resonance
Med 50(8):1. https://doi.org/10.1038/ imaging aging brainprints: grounds and chal-
s12276-018-0129-7 lenges. IEEE Signal Process Mag 39(2):
Neurodevelopmental Disorders 1007
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Chapter 32
Abstract
Psychiatric disorders include a broad panel of heterogeneous conditions. Among the most severe psychiatric
diseases, in intensity and incidence, depression will affect 15–20% of the population in their lifetime,
schizophrenia 0.7–1%, and bipolar disorder 1–2.5%. Today, the diagnosis is solely based on clinical evalua-
tion, causing major issues since it is subjective and as different diseases can present similar symptoms. These
limitations in diagnosis lead to limitations in the classification of psychiatric diseases and treatments. There
is therefore a great need for new biomarkers, usable at an individual level. Among them, magnetic resonance
imaging (MRI) allows to measure potential brain abnormalities in patients with psychiatric disorders. This
creates datasets with high dimensionality and very subtle variations between healthy subjects and patients,
making machine and statistical learning ideal tools to extract biomarkers from these data. Machine learning
brings different tools that could be useful to tackle these issues. On the one hand, supervised learning can
support automated classification between different psychiatric conditions. On the other hand, unsupervised
learning could allow the identification of new homogeneous subgroups of patients, refining our under-
standing of the classification of these disorders. In this chapter, we will review current research applying
machine learning tools to brain imaging in psychiatry, and we will discuss its interest, limitations, and future
applications.
Key words Psychiatry, Depression, Schizophrenia, Bipolar disorder, Machine learning, Artificial
intelligence, Neuroimaging, MRI, Clustering, Classification
1 Introduction
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_32,
© The Author(s) 2023
1009
1010 Ivan Brossollet et al.
1.1 Major Depressive Major depressive disorder (MDD) is defined by the occurrence of
Disorder one or more major depressive episodes without any manic or hypo-
manic episodes in the lifetime. Its prevalence can vary significantly
according to the studies, but exceeds 15% of the population during
their lifetime [1], and affects two women for one man. Depression
can affect people at any time during their life [2]. Nowadays, the
diagnosis is based on structured interviews, and the clinical criteria
are given by, among others, two classification manuals: the Interna-
tional Classification of Diseases [1] and the fifth edition of the
Diagnostic and Statistical Manual of Mental Disorders (DSM-5)
[3]. According to the DSM-5, to meet the criteria for a major
depressive episode, five of the nine following symptoms must be
present over a 2-week period: depressed mood or anhedonia (loss
of interest or pleasure), change in weight or appetite, sleep distur-
bances (insomnia or hypersomnia), psychomotor retardation or
restlessness, loss of energy or fatigue, low self-esteem or guilt,
difficulty in concentrating or indecisiveness, and thoughts of
death or suicidal thoughts. Patients with MDD are at an increased
risk of other comorbid disorders. Most commonly, they may pres-
ent alcohol abuse or dependence, anxiety disorders such as panic
disorder, obsessive-compulsive disorder, and generalized anxiety
disorder. Treatment options for MDD include a variable combina-
tion of pharmacotherapy (antidepressants such as serotonin selec-
tive reuptake inhibitors or tricyclics) and psychotherapy (cognitive
behavioral therapy, interpersonal therapy, etc.). Despite consider-
able progress in its diagnosis and treatment, MDD remains under-
diagnosed and underestimated and remains a challenge for
healthcare institutions, especially since one of the main risks of
mood disorders (BD or MDD) is suicidal behavior.
1.2 Bipolar Disorder Bipolar disorder (BD) is defined as a chronic mood disorder char-
acterized by episodes of depression and episodes of abnormal exci-
tation (mania, hypomania), separated by periods of “euthymia”
(without any symptoms of major mood episode) [3]. This mood
disorder affects around 1% of the world’s adult population [4],
regardless of continent, socioeconomic status, or ethnicity. The
course of BD is lifelong, but is heterogeneous in terms of number
of episodes, relapses, polarity (i.e., higher number of manic or
depressive episodes), and response to treatment. The impact of
the disease on cognitive function and quality of life can be major
[4]. Diagnosis, treatment, health, and social care are major goals in
the management of BD.
Manic episodes are defined by a period lasting at least 1 week,
during which patients exhibit elevated mood and increased motor
activity. The intensity of these symptoms defines the manic or
hypomanic nature of the episode. During a manic episode, patients
may experience psychotic symptoms such as hallucinations, delu-
sions, disorganized thinking, and sleep disturbances. The delusions
may be consistent with the manic mood, with individuals displaying
Machine Learning for Psychiatric Disorders 1011
1.3 Schizophrenia The annual incidence of schizophrenia is 0.2–0.4 per 1000, with a
lifetime prevalence of about 0.8% [5], which can slightly vary
between countries and cultural groups [6]. These differences are
reduced when stricter diagnostic criteria are used for schizophrenia,
such as the ones of the DSM-5. Research conducted by the WHO
has further confirmed this observation by showing that
schizophrenic disorder prevalence is similar across a wide range of
cultures and countries, including developed and developing
countries [6]. Its sex ratio is around 1:1.
Schizophrenia is characterized by three main types of symp-
toms, namely, positive symptoms, negative symptoms, and cogni-
tive impairment [7]. Positive symptoms involve a loss of contact
with reality; the patient has false beliefs (delusions) and perceptual
experiences not shared with others (hallucinations) and may exhibit
behavioral oddities. People with schizophrenia can experience dif-
ferent kinds of hallucinations: auditory, visual, olfactory, gustatory,
or tactile. About delusions, patients with schizophrenia may have
persecutory delusions, control delusions (e.g., belief in telepathy),
grandiose delusions (e.g., belief in being a god), and somatic
1012 Ivan Brossollet et al.
delusions (e.g., belief that one’s body is rotting from the inside)
[8]. Negative symptoms are characterized by a deficit state during
which basic emotional and behavioral processes are diminished or
absent. The most common negative symptoms are blunted affect,
anhedonia, avolition, apathy, and alogia (i.e., reduction in the
amount or content of speech). Negative symptoms are more fre-
quent and less fluctuating over time than positive symptoms
[9]. They are also strongly associated with poor psychosocial func-
tioning [10]. Cognitive impairments in schizophrenia include def-
icits with attention and concentration, psychomotor speed,
learning and memory, and executive function. A decline in cogni-
tive abilities from premorbid functioning is present in most of the
patients, with cognitive functioning after the onset of the illness
being relatively stable over time [10]. Despite this decline, cogni-
tive functioning in some patients could be in the normal range. As
for the negative symptoms, cognitive impairment is strongly asso-
ciated with poor psychosocial functioning, particularly with regard
to social and professional lives.
The etiology of schizophrenia is complex and multifactorial.
Genetic and environmental factors seem to play a major role. The
risk of developing schizophrenia is higher in patients’ relatives than
in the general population [11, 12]. Adoption and twin studies have
shown that this increased risk is genetic, with the risk being
increased by the presence of an affected first-degree relative
[12]. There are two main approaches to the treatment of schizo-
phrenia: pharmacological and psychosocial treatments [13]. Anti-
psychotics constitute the main medication, with major effects on
reducing positive symptoms and preventing relapses. First-
generation antipsychotics include molecules such as chlorproma-
zine or haloperidol. Second-generation antipsychotics were devel-
oped to decrease the neurological and cognitive side effects. They
are the most used molecules nowadays (quetiapine, aripiprazole,
risperidone, clozapine, etc.). In contrast, their effects on negative
symptoms and cognitive impairment are much more moderate
[14]. Psychosocial interventions improve the management of
schizophrenia, e.g., through symptom management or relapse pre-
vention. Other specific interventions that can improve the outcome
of schizophrenia include family psychoeducation, supported
employment, social skills training, psychoeducation, cognitive
behavioral therapy, and integrated treatment of comorbid sub-
stance abuse [8].
The remainder of this chapter is organized as follows: We first
describe the challenges in psychiatry that can potentially be
addressed with machine learning. We then provide a
non-exhaustive state of the art of machine learning with magnetic
resonance imaging in psychiatry. We finally highlight the limitations
of current approaches and propose perspectives for the field. Stud-
ies reviewed in this chapter are summarized in Table 1.
Machine Learning for Psychiatric Disorders 1013
2.1 Improving the In the early stages of research on machine learning and psychiatric
Diagnosis of disorders, researchers wanted to explore whether different diag-
Psychiatric Disorders noses could be predicted using machine learning algorithms
applied to neuroimaging features. They mainly applied machine
learning on structural MRI (sMRI) and functional MRI (fMRI)
data (during tasks or at rest) [53]. Recent efforts have been made to
apply machine learning on diffusion MRI [15], mostly in combina-
tion with other modalities [53, 54], and to explore whether adding
modalities improves the diagnosis. Classification using machine
learning in neuroimaging initially focused on major psychiatric
disorders, such as MDD [55], schizophrenia [56], and bipolar
disorder [54]. In a second phase, research has broadened the
spectrum of psychiatric disorders such as anxiety disorders [23],
anorexia [20], substance abuse [57], specific phobia [19], and
autism spectrum disorders [58]. Machine learning using EEG has
also been investigated for schizophrenia classification [59] as it is an
affordable method for functional imaging and since it has a better
temporal resolution than fMRI. While lots of machine learning
studies in psychiatry focused on neuroimaging data, other fields
of research were increasingly interested in using other modalities,
such as proteomic, metabolomic [22], and genetic [24] data.
Machine learning also opens perspectives for the identification
of relevant features (e.g., the measured variables) for the diagnosis.
Using interpretable models such as support vector machines (SVM)
or decision trees lets researchers investigate features that are used in
1014 Ivan Brossollet et al.
2.2 Refining the Since there is a significant overlap in the clinical symptoms of
Classification of different psychiatric disorders, many patients suffer from an impor-
Psychiatric Disorders tant delay in the diagnostic establishment, after a potentially harm-
ful diagnosis wavering. For instance, patients with BD wait on
average 10 years before receiving an accurate diagnosis [62] and
are often misdiagnosed with unipolar depression for years. As for
MDD, it is often underdiagnosed even though fast and accurate
diagnosis could avoid long-term cognitive impairment in under-
treated patients [63]. For all these reasons, making the right diag-
nosis as early as possible is a major public health challenge.
Machine learning may be a useful tool to discriminate between
different diagnoses. Indeed, the interest in machine learning is not
only to distinguish a patient with a psychiatric disorder from a
healthy subject – which is not the most difficult task for the
clinician – but it could be used to help the clinician when the
diagnosis becomes more difficult, e.g., to distinguish bipolar
depression from unipolar depression [18] or to identify a patient
at risk of psychosis [24].
As studies investigate new biomarkers to differentiate between
different conditions, our current classification of psychiatric disor-
ders appears to be limited. There are numerous different classifica-
tion criteria to describe psychopathology, and theoretical
frameworks are evolving rapidly [52], which contributes to our
Machine Learning for Psychiatric Disorders 1015
instance, Wolfer et al. [70] showed that deviations from the nor-
mative model of gray matter volume were frequent in both SZ and
BD but highly heterogeneous. However, these models also induce
an asymmetry as they hypothesize that the controls are homoge-
neous, which is debatable in practice. Nevertheless, it appears that
subtyping leads to increased predictive accuracy in identifying indi-
viduals with mental illnesses compared with healthy controls, even
though results are mixed [71]. This approach could gain attention
with the development of new tools such as longitudinal normative
brain charts that cover the whole lifespan [72].
3.1 Classification Classification of patients with psychiatric disorder vs. healthy con-
Versus Healthy trols is a widely studied area of research. Even though most studies
Controls fail to obtain the 80% of accuracy needed for clinical relevance, they
yield promising results and give important methodological insights.
Regarding MDD, using sMRI, machine learning studies [55]
found accuracies ranging from 67.6% to 90.3%. These results
should be taken with great caution since they are usually obtained
from small samples. For example, Mwangi et al. [39] obtained an
accuracy of 90.3% using relevance vector machines and a sample of
60 subjects. They also showed that the brain regions identified
during the features selection process were consistent with those of
previous studies that reported gray matter reductions in patients
with MDD, which were mostly located in the frontal lobe, the
orbitofrontal and cingulate cortex, the middle frontal gyrus, and
the inferior and superior gyri [76]. As for fMRI studies, Gao et al.
[55] found an accuracy ranging from 56% to 99%; Ramasubbu et al.
[36] found a significant accuracy of 66% for very severe depression
using resting-state fMRI in 19 control subjects vs. 45 patients with
different intensities of depression; and Fu et al. [21] obtained an
accuracy of 86% in a sample of 19 patients with MDD and 19 HC
who were processing sad faces during fMRI scanning.
Regarding bipolar disorder (BD), a recent literature review
counted 25 studies using machine learning with different MRI
modalities to classify BD vs. HC [54]. They found a median accu-
racy of 66% for BD vs. HC classification. Even though most studies
used samples of less than 100 subjects, a study stood out by the
number of samples. Using 3040 subjects, sMRI, and a linear SVM,
Nunes et al. [43] obtained an accuracy of 65.23% using aggregate
subject-level analyses and an accuracy of 58.67% when testing on
left out sites. Their results, which highlighted the importance of
regions such as the hippocampus, the amygdala, and the inferior
frontal gyrus for the classification, were in good accordance with
previous MRI studies in BD [76–78]. Regarding fMRI, the review
of Claude et al. [54] highlighted that machine learning studies
performed with an accuracy range between 37.5% and 83.5%. The
minimum accuracy was 37.5% for the classification of bipolar
depression vs. HC, during angry face processing using a Gaussian
process classifier (GPC) [37]. DWI was not investigated much. In
the review of Claude et al. [54], only two DWI studies were
referenced. Achalia et al. [15] used DWI and machine learning on
60 subjects and obtained an accuracy of 74% for DWI alone. Even
Machine Learning for Psychiatric Disorders 1019
3.2 Inter-Illness One major challenge of machine learning studies using MRI is to
Classification and be able to correctly distinguish or classify patients suffering from
Clustering different disorders. Several studies focused on the classification
between BD and SZ. In their review, Claude et al. [54] found
that three studies used sMRI in combination with machine learning
algorithms to discriminate between BD and SZ with an accuracy
ranging between 58% and 66%. Precisely, Schnack et al. [44]
showed good classification performance on an independent dataset,
with an average classification accuracy of 66%. Mothi et al. [45]
1020 Ivan Brossollet et al.
3.3 Treatment Another perspective is the use of MRI and machine learning algo-
Response and Illness rithms to predict treatment response. This was done by the team of
Prediction Liu et al. [47] who tested the sensitivity to antidepressants in
patients with MDD. Precisely, the study included 17 subjects that
Table 1
Summary of reviewed studies
Fu et al. [21] Sad face processing, fMRI Classification of MDD MDD: 19, HC: SVM Accuracy of 86% (84% of
versus HC 19 sensitivity and 89% of
specificity)
1021
(continued)
Table 1
1022
(continued)
89 At risk
patients
1023
(continued)
Table 1
1024
(continued)
Relevance Vector
Machine prediction and
illness severity
1025
(continued)
Table 1
1026
(continued)
with
accuracy = 78.12%,
sensitivity = 68.75%,
specificity = 87.5%
Costafreda et al. sMRI Classification of 37 MDD, 37 HC SVM and leave-one-out The whole brain
[41] MDD vs. HC and cross-validation structural
prognosis prediction of neuroanatomy
CBT and fluotexine predicted 88.9% of the
treatment clinical response.
Accuracy of the structural
neuroanatomy as a
diagnostic marker was
67.6%.
Xiao et al. [42] sMRI features Classification of first 163 drug naive SVM and 10-fold cross- Surface Area. Accuracy:
episode psychosis first episode validation 85.0% (specificity
patients vs. HC psychosis = 87.0%, sensitivity
patients, = 83.0%)
163 HC Cortical thickness.
Accuracy: 81.8%
(specificity = 85.0%,
sensitivity = 76.9%)
Abraham et al. for sMRI features Classification of BD vs. HC 853 BD, 2167 SVM and 30-fold cross- Aggregate subject
the ENIGMA HC validation or leave-one- analysis: accuracy
Bipolar site-out = 65.23%
Disorders Leave-one-site-out cross-
Working Group validation: accuracy
et al. [43] = 58.67%
Schnack et al. [44] sMRI features (GMD) Classification of 112 SZ, 113 BD 3 SVM models (HC vs. On the validation sample:
BD vs. SZ vs. HC and 109 HC BD, HC vs. SZ and ROC AUC = 66% for
BD vs. SZ) tested on a BD vs. SZ, 55.5% for
validation sample BD vs. HC, 79.2% for
SZ vs. HC
Mothi et al. [45] Clinical measures, eye Unsupervised clustering of 251 SZ, 164 SZA, K-means with 10 first 3 clusters with one with a
movement tracks, BD, SZ and 195 BD components of a PCA majority of SZ, one
cognitive assessment, schizoaffective patients ran on clinical measures, with a majority of BD,
sMRI eye movement, cognitive and one distributed
assessment across the 3 disorders
Vai et al. [46] sMRI and dMRI features Classification of 74 BD, 74 MDD MVPA with 10-fold cross- Balanced accuracy of
(FA) BD vs. MDD validation 73.65% with a
sensitivity for BD of
74.32% and specificity
for MDD of 72.97%
Liu et al. [47] sMRI (VBM of GM or Classification of treatment 17 TRD, 17 TSD, MVPA with leave-one-out GM results:
WM) resistant 17 HC cross-validation TRD vs. TSD: 82.9%,
depression vs. treatment TRD vs. HC 85.7%,
sensitive depression TSD vs. HC 82.4%.
WM results: TRD vs. TSD
82.9, TRD vs. HC
85.7%, TSD vs. HC
91.2%
Hajek et al. [48] sMRI (GM and WM Classification of unaffected 45 UHRO, 45 C SVM, GPC, leave-2-out SVM with WM for
Machine Learning for Psychiatric Disorders
(continued)
1028
Table 1
Ivan Brossollet et al.
(continued)
collective efforts for data sharing are undertaken. This issue dee-
pens when looking at more specific subsets of patients. The field
therefore needs more and larger datasets to work on. These datasets
start to be collected, with, e.g., the UK BioBank dataset (~40,000
subjects). Even though they are not focused on psychiatric disor-
ders, they are interesting because they are multimodal datasets,
with genetic, clinical, and MRI data, and some participants will
develop psychiatric syndromes throughout the follow-up. Recent
efforts have been specifically made for psychiatric disorders, e.g., by
the ENIGMA Consortium, a multisite and multimodal project
including several working groups focused on different diseases,
such as bipolar disorder, schizophrenia, autism, ADHD, etc.
Larger datasets are often multisite ones, and they bring their
own challenges. Since the MRI devices that are used for different
studies have different magnetic field strengths, different vendors,
coils, etc., there are large site effects that need to be considered.
These site effects are particularly important for DWI and fMRI, but
they even appear for sMRI [82], the most robust method of imag-
ing. A second source of site effects lies in the preprocessing of the
data, which may vary between different sites and protocols. The
preprocessing steps are of major importance and need to be homo-
genized since different softwares can lead to different results
[83]. The remaining “site effects” can be partially corrected, thanks
to different methods. Statistics-based methods include adjusted
residualizations or ComBat [84, 85], a method originally proposed
to remove batch effects in genomics [86] and then adapted for
DWI and then for sMRI [87]. Other methods are more specific to
MRI, such as RAVEL [88], which aims at capturing the sites’
variability using the signal from the CSF, with mixed results for
now. Since the extent of the efficiency of these corrections is still
under discussion [89], we must consider the site effect in our
models and use validation methods such as leave-one-site-out vali-
dation to evaluate the reproducibility of our approaches.
The site effect highlights a deeper and more fundamental limi-
tation of our studies, the signal-to-noise ratio. That issue, which is
faced by all imaging studies, is particularly present in neuroimaging
for psychiatric diseases as the changes that we are looking for are
subtle and probably not the main causes of variation in our datasets
(e.g., one important cause of variance is the age, which produces
consequent variations in the gray and white matter density [72]).
We therefore need to be vigilant and make specific efforts when
interpreting the results of machine learning algorithms as they can
learn some information that are irrelevant for psychiatric disorders.
Nevertheless, it is possible to improve this signal-to-noise ratio.
One way to do so is to improve the signal; the second is to diminish
the noise. Larger datasets improve the statistical power of the
algorithms but may induce noise (such as the multisite noise). In
addition to the fact that methodological modifications can change
Machine Learning for Psychiatric Disorders 1031
Acknowledgments
References
1. Kupfer DJ, Frank E, Phillips ML (2012) Major 12. Cardno AG, Marshall EJ, Coid B et al (1999)
depressive disorder: new clinical, neurobiologi- Heritability estimates for psychotic disorders:
cal, and treatment perspectives. Lancet the Maudsley twin psychosis series. Arch Gen
379(9820):1045–1055. https://doi.org/10. Psychiatry 56(2):162. https://doi.org/10.
1016/S0140-6736(11)60602-8 1001/archpsyc.56.2.162
2. Kessler RC, Angermeyer M, Anthony JC et al 13. Davis JM, Chen N, Glick ID (2003) A meta-
(2007) Lifetime prevalence and age-of-onset analysis of the efficacy of second-generation
distributions of mental disorders in the World antipsychotics. Arch Gen Psychiatry 60(6):
Health Organization’s world mental health 553. https://doi.org/10.1001/archpsyc.60.
survey initiative. World Psychiatry 6(3): 6.553
168–176 14. Insel TR, Cuthbert BN (2015) Brain disor-
3. Diagnostic and statistical manual of mental dis- ders? Precisely. Science 348(6234):499–500.
orders: DSM-5™, 5th ed. American Psychiat- https://doi.org/10.1126/science.aab2358
ric Publishing, Inc. 2013. p. xliv, 947. https:// 15. Achalia R, Sinha A, Jacob A et al (2020) A
doi.org/10.1176/appi.books. proof of concept machine learning analysis
9780890425596 using multimodal neuroimaging and neuro-
4. Grande I, Berk M, Birmaher B, Vieta E (2016) cognitive measures as predictive biomarker in
Bipolar disorder. Lancet 387(10027): bipolar disorder. Asian J Psychiatr 50:101984.
1561–1572. https://doi.org/10.1016/ https://doi.org/10.1016/j.ajp.2020.101984
S0140-6736(15)00241-X 16. Lin E, Lin CH, Lai YL, Huang CH, Huang YJ,
5. Jablensky A (1997) The 100-year epidemiol- Lane HY (2018) Combination of G72 genetic
ogy of schizophrenia. Schizophr Res 28(2–3): variation and G72 protein level to detect
111–125. https://doi.org/10.1016/S0920- schizophrenia: machine learning approaches.
9964(97)85354-6 Front Psych 9:566. https://doi.org/10.
6. Institute of Medicine (US) Committee on Ner- 3389/fpsyt.2018.00566
vous System Disorders in Developing 17. Yamada Y, Matsumoto M, Iijima K, Sumiyoshi
Countries (2001) Neurological, psychiatric, T (2020) Specificity and continuity of schizo-
and developmental disorders: meeting the phrenia and bipolar disorder: relation to bio-
challenge in the developing world. National markers. Curr Pharm Des 26(2):191–200.
Academies Press (US). Accessed September h t t p s : // d o i . o r g / 1 0 . 2 1 7 4 /
10, 2021. http://www.ncbi.nlm.nih.gov/ 1381612825666191216153508
books/NBK223475/ 18. Grotegerd D, Suslow T, Bauer J et al (2013)
7. Mueser KT, McGurk SR (2004) Schizophre- Discriminating unipolar and bipolar depression
nia. Lancet 363(9426):2063–2072. https:// by means of fMRI and pattern classification: a
doi.org/10.1016/S0140-6736(04)16458-1 pilot study. Eur Arch Psychiatry Clin Neurosci
8. Fenton WS (1991) Natural history of schizo- 263(2):119–131. https://doi.org/10.1007/
phrenia subtypes: II. Positive and negative s00406-012-0329-4
symptoms and long-term course. Arch Gen 19. Visser RM, Haver P, Zwitser RJ, Scholte HS,
Psychiatry 48(11):978. https://doi.org/10. Kindt M (2016) First steps in using multi-voxel
1001/archpsyc.1991.01810350018003 pattern analysis to disentangle neural processes
9. Sayers SL, Curran PJ, Mueser KT (1996) Fac- underlying generalization of spider fear. Front
tor structure and construct validity of the scale Hum Neurosci 10. https://doi.org/10.3389/
for the assessment of negative symptoms. fnhum.2016.00222
Psychol Assess 8(3):269–280. https://doi. 20. Lavagnino L, Amianto F, Mwangi B et al
org/10.1037/1040-3590.8.3.269 (2015) Identifying neuroanatomical signatures
10. Green MF, Kern RS, Braff DL, Mintz J (2000) of anorexia nervosa: a multivariate machine
Neurocognitive deficits and functional out- learning approach. Psychol Med 45(13):
come in schizophrenia: are we measuring the 2805–2812. https://doi.org/10.1017/
“right stuff”? Schizophr Bull 26(1):119–136. S0033291715000768
https://doi.org/10.1093/oxfordjournals. 21. Fu CHY, Mourao-Miranda J, Costafreda SG
schbul.a033430 et al (2008) Pattern classification of sad facial
11. McGuffin P, Owen Michael J, Farmer Anne E processing: toward the development of neuro-
(1995) Genetic basis of schizophrenia. Lancet biological markers in depression. Biol Psychia-
346(8976):678–682. https://doi.org/10. try 63(7):656–662. https://doi.org/10.
1016/S0140-6736(95)92285-7 1016/j.biopsych.2007.08.020
Machine Learning for Psychiatric Disorders 1033
60. Bzdok D, Ioannidis JPA (2019) Exploration, schizophrenia and bipolar disorder using nor-
inference, and prediction in neuroscience and mative models. JAMA Psychiatry 75(11):1146.
biomedicine. Trends Neurosci 42(4):251–262. https://doi.org/10.1001/jamapsychiatry.
https://doi.org/10.1016/j.tins.2019.02.001 2018.2467
61. Haufe S, Meinecke F, Görgen K et al (2014) 71. Marquand AF, Rezek I, Buitelaar J, Beckmann
On the interpretation of weight vectors of lin- CF (2016) Understanding heterogeneity in
ear models in multivariate neuroimaging. Neu- clinical cohorts using normative models:
roImage 87:96–110. https://doi.org/10. beyond case-control studies. Biol Psychiatry
1016/j.neuroimage.2013.10.067 80(7):552–561. https://doi.org/10.1016/j.
62. Bowden CL (2001) Strategies to reduce misdi- biopsych.2015.12.023
agnosis of bipolar depression. Psychiatr Serv 72. Bethlehem RAI, Seidlitz J, White SR et al
52(1):51–55. https://doi.org/10.1176/appi. (2022) Brain charts for the human lifespan.
ps.52.1.51 Nature 604(7906):525–533. https://doi.
63. Galimberti C, Bosi MF, Volontè M, org/10.1038/s41586-022-04554-y
Giordano F, Dell’Osso B, Viganò CA (2020) 73. Kessler RC, van Loo HM, Wardenaar KJ et al
Duration of untreated illness and depression (2016) Testing a machine-learning algorithm
severity are associated with cognitive to predict the persistence and severity of major
impairment in mood disorders. Int J Psychiatry depressive disorder from baseline self-reports.
Clin Pract 24(3):227–235. https://doi.org/ Mol Psychiatry 21(10):1366–1371. https://
10.1080/13651501.2020.1757116 doi.org/10.1038/mp.2015.198
64. Arnedo J, Svrakic DM, del Val C et al (2015) 74. Suvisaari J, Mantere O, Kein€anen J et al (2018)
Uncovering the hidden risk architecture of the Is it possible to predict the future in first-
schizophrenias: confirmation in three indepen- episode psychosis? Front Psych 9:580.
dent genome-wide association studies. Am J https://doi.org/10.3389/fpsyt.2018.00580
Psychiatry 172(2):139–153. https://doi.org/ 75. Sun H, Jiang R, Qi S et al (2019) Preliminary
10.1176/appi.ajp.2014.14040435 prediction of individual response to electrocon-
65. Schnack HG (2019) Improving individual pre- vulsive therapy using whole-brain functional
dictions: machine learning approaches for magnetic resonance imaging data. Neuroimage
detecting and attacking heterogeneity in Clin 26:102080. https://doi.org/10.1016/j.
schizophrenia (and other psychiatric diseases). nicl.2019.102080
Schizophr Res 214:34–42. https://doi.org/ 76. Hajek T, Cullis J, Novak T et al (2013) Brain
10.1016/j.schres.2017.10.023 structural signature of familial predisposition
66. Varol E, Sotiras A, Davatzikos C (2017) for bipolar disorder: replicable evidence for
HYDRA: revealing heterogeneity of imaging involvement of the right inferior frontal gyrus.
and genetic patterns through a multiple Biol Psychiatry 73(2):144–152. https://doi.
max-margin discriminative analysis framework. org/10.1016/j.biopsych.2012.06.015
NeuroImage 145:346–364. https://doi.org/ 77. Hajek T, Kopecek M, Kozeny J, Gunde E,
10.1016/j.neuroimage.2016.02.041 Alda M, Höschl C (2009) Amygdala volumes
67. Schulz MA, Chapman-Rounds M, Verma M, in mood disorders — meta-analysis of magnetic
Bzdok D, Georgatzis K (2020) Inferring dis- resonance volumetry studies. J Affect Disord
ease subtypes from clusters in explanation 115(3):395–410. https://doi.org/10.1016/j.
space. Sci Rep 10(1):12900. https://doi.org/ jad.2008.10.007
10.1038/s41598-020-68858-7 78. Ganzola R, Duchesne S (2017) Voxel-based
68. Yang T, Frangou S, Lam RW et al (2021) morphometry meta-analysis of gray and white
Probing the clinical and brain structural matter finds significant areas of differences in
boundaries of bipolar and major depressive dis- bipolar patients from healthy controls. Bipolar
order. Transl Psychiatry 11:48. https://doi. Disord 19(2):74–83. https://doi.org/10.
org/10.1038/s41398-020-01169-7 1111/bdi.12488
69. Louiset R, Gori P, Dufumier B, Houenou J, 79. Pinaya WHL, Gadelha A, Doyle OM et al
Grigis A, Duchesnay E. UCSL: a machine (2016) Using deep belief network modelling
learning expectation-maximization framework to characterize differences in brain morphome-
for unsupervised clustering driven by super- try in schizophrenia. Sci Rep 6(1):38897.
vised learning. ArXiv210701988 Cs Stat. Pub- https://doi.org/10.1038/srep38897
lished online July 5, 2021. Accessed March 80. Hibar DP, Westlye LT, Doan NT et al (2018)
1, 2022. http://arxiv.org/abs/2107.01988 Cortical abnormalities in bipolar disorder: an
70. Wolfers T, Doan NT, Kaufmann T et al (2018) MRI analysis of 6503 individuals from the
Mapping the heterogeneous phenotype of ENIGMA bipolar disorder working group.
1036 Ivan Brossollet et al.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
Disclosure Statement of the Editor
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9,
© The Editor(s) (if applicable) and The Author(s) 2023
1037
INDEX
Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9,
© The Editor(s) (if applicable) and The Author(s) 2023
1039
MACHINE LEARNING FOR BRAIN DISORDERS
1040 Index
Biomarker ............................................................ 256, 302, Classification...............................................viii, 10, 25, 79,
303, 327, 368, 427, 454, 460, 462, 463, 466, 118, 140, 203, 249, 268, 286, 338, 356, 399,
468, 472, 477, 495, 501, 502, 504, 513–517, 446, 460, 498, 534, 602, 657, 707, 774, 823,
519, 520, 523–525, 536, 537, 540–542, 557, 849, 927, 965, 983, 1010
561, 562, 575, 641, 681, 700, 758, 770, 773, Clinically isolated syndrome (CIS) ............ 902, 912, 913
778, 786, 787, 808, 812, 815, 821–826, 828, Clustering ................................................... 12, 13, 15, 26,
829, 849, 863, 869, 880, 889, 946, 968–969, 60–66, 301, 303, 492–505, 517, 524, 525, 527,
1014, 1015 551, 555, 557, 764, 828, 868, 881, 888, 889,
Bipolar disorder (BD) .................................... ix, 268, 371, 906, 931, 987–989, 992–994, 996, 1015,
372, 534, 535, 758, 771, 994, 1009–1011, 1019–1020, 1023, 1027
1013–1016, 1018–1023, 1025–1031 CMB, see Cerebral microbleed (CMB)
Blood-oxygen-level-dependent (BOLD)....................267, CNN, see Convolutional neural network (CNN)
268, 769, 872, 904 Coding schemes .......................................... 339, 340, 344
BOLD, see Blood-oxygen-level-dependent (BOLD) Cognitive ................................................4, 234, 256, 288,
Brain computer interface (BCI) ......................... 303–307, 358, 460, 511, 612, 681, 758, 808, 848, 879,
759, 760, 775 899, 940, 965, 980, 1010
Brain tumour segmentation challenge (BraTS) .........176, Computational pathology............................. viii, 533–563
417, 426, 427, 470, 474, 475, 555, 969, 970, 972 Computed tomography (CT) ...............................vii, 144,
BraTS, see Brain tumour segmentation challenge (BraTS) 145, 149, 173, 176, 177, 183, 253, 254, 256,
Brisbane adolescent twin study (BATS) ............ 761, 768, 259, 269–271, 274, 277, 392, 396, 425, 447,
793, 794 449, 465, 471, 553, 660, 679–681, 686, 689,
701, 707, 721, 725, 726, 732, 766, 768, 815,
C 831, 922–925, 927, 929–932, 934–938, 940
CAD, see Computer assisted diagnosis (CAD) Computer assisted diagnosis (CAD)............................ 937
CADASIL, see Cerebral autosomal dominant arteriopathy Confidence interval ...............................................viii, 602,
with subcortical infarcts and leukoencephalopathy 620–622, 624, 625, 633, 641, 740, 744, 746
(CADASIL) Connected objects ......................................... vii, 355–383
Calibration...........................................305, 306, 609, 610 Convolution ....................................................10, 78, 133,
Canadian institute for advanced research 142, 203, 392, 446, 552, 662, 928, 997
(CIFAR)..............................................86, 203, 207 Convolutional neural network (CNN) .......... vii, 77–112,
CBD, see Cortico-basal degeneration (CBD) 133–136, 140, 142, 153, 161, 174, 175, 182,
CBF, see Cerebral blood flow (CBF) 185, 203, 205–207, 210, 212, 213, 216, 222,
CBV, see Cerebral blood volume (CBV) 224, 392, 394, 408, 410, 411, 414, 437,
Cerebellar variant of multiple system atrophy 446–450, 468, 474, 551–553, 555–557, 561,
(MSA-C) .........................848, 853, 857, 861, 862 563, 632, 660, 661, 666, 676, 681, 682,
Cerebral autosomal dominant arteriopathy with 684–686, 688, 689, 698, 722, 853–855, 857,
subcortical infarcts and leukoencephalopathy 886, 904, 913, 927, 930–932, 934, 935, 938,
(CADASIL) ....................................................... 810 939, 944–946, 948, 969, 971
Cerebral blood flow (CBF) ................................ 267, 270, Coronal ..................... 256, 257, 428, 688, 883, 903, 931
275, 923, 925, 926, 932, 934–935 Cortico-basal degeneration (CBD)............ 557, 849, 850
Cerebral blood volume (CBV)........................... 270, 923, Cost function.......................... 17, 19, 21, 31, 37, 39, 41,
925, 926, 968 71, 72, 87–90, 110, 164, 377, 437, 438, 441–445
Cerebral microbleed (CMB) .............................. 940, 941, Cross-attention...................................196–197, 199, 201,
945–946, 948 202, 209, 212, 214, 216, 219, 220, 225, 411
Cerebral small vessel disease (cSVD) ..........922, 940–949 Cross-entropy loss...........................................50, 88, 164,
Cerebrospinal fluid (CSF) ................ 264, 327, 405, 460, 172, 206, 215, 398, 399, 497
463, 468, 472, 682, 700, 758, 815, 824, 829, Cross-sectional .......................... 495, 512–518, 523, 526,
853, 857, 860, 863, 869, 909, 936, 947, 1030 758–766, 768, 786, 787, 828, 831, 868, 906–908
Cerebrovascular ..............................ix, 270, 374, 921–949 Cross-validation (CV) ............... 427, 469, 501, 615–619,
CIFAR, see Canadian institute for advanced research 622, 624, 626, 713, 714, 846, 858–862, 866,
(CIFAR) 867, 869, 870, 872, 873, 886, 891, 909, 928,
933, 985, 995, 998, 1022, 1023, 1025–1027
MACHINE LEARNING FOR BRAIN DISORDERS
Index 1041
cSVD, see Cerebral small vessel disease (cSVD) Diffusion-weighted Imaging (DWI)...........................469,
CT, see Computed tomography (CT) 471, 574, 755, 758–762, 769, 820, 861, 902,
CV, see Cross-validation (CV) 903, 922, 926, 929–933, 941, 985, 998,
CycleGAN ...........................................176–178, 404, 448 1017–1020, 1030
Digital pathology ................................................ 534, 536,
D 537, 544–545, 552
Data augmentation ............................................... 95, 376, Dimensionality reduction .......................... 12, 13, 26, 63,
408–409, 417, 427, 428, 535, 747 66–70, 494, 555, 618, 619, 682, 683, 717, 871
Database of genotypes and phenotypes Direct problem .............................................................. 298
(dbGAP) .........................765, 783, 784, 787, 788 DL, see Deep learning (DL)
Data-driven disease progression modelling DLB, see Dementia with Lewy bodies (DLB)
(D3PM) .......................... 468, 511–527, 828, 869 Domain adaptation .................... 178, 188, 425, 447–449
Data harmonization ................... 404, 478, 823, 828–830 Dominantly inherited Alzheimer network
Data leakage ............................................... 602, 617–619, (DIAN) ..................................................... 822, 828
633, 640, 641, 649, 829 Dopamine ..................................238, 247, 275, 689, 701,
Data split................... 426, 615, 619, 633, 638, 640–642 811, 848, 850, 857, 860, 863, 869, 871–872
Data warehouse ...................................332, 766, 789–790 D3PM, see Data-driven disease progression modelling
dbGAP, see Database of genotypes and phenotypes (D3PM)
(dbGAP) DSC, see Dice similarity coefficient (DSC)
Decoder ..................................... 110, 111, 127–129, 134, DTI, see Diffusion tensor imaging (DTI)
135, 153, 154, 156, 183, 185, 186, 194, DWI, see Diffusion-weighted Imaging (DWI)
199–202, 209, 211, 212, 215, 392, 409–412,
E
421, 423, 424, 427, 448, 676, 930, 932, 938, 939
Deep learning (DL) ...................................... vii, 6, 10, 11, EBM, see Event-based model (EBM)
13, 18, 73, 77–112, 166, 198, 202, 214, 224, EEG, see Electroencephalography (EEG)
258, 261, 264, 277, 348, 360, 391–428, EfficientNet ..................................................109–110, 556
435–454, 460, 469, 470, 474, 497, 501, 534, EHR, see Electronic health record (EHR)
537, 544, 549, 552–554, 557, 559, 561, 563, Electrode ..............................................99, 247, 289, 291,
618, 619, 638, 647, 657, 659, 665, 669, 675, 292, 296, 300, 681, 682, 774, 775, 872, 884
681, 707, 719, 722, 727, 729, 789, 824–827, Electroencephalography (EEG) ............................vii, 248,
859, 865, 882, 883, 885, 886, 890, 905, 909, 285–307, 356, 380, 460, 481, 500, 576, 679,
927, 931, 937, 939, 942–946, 963, 969–972, 681, 682, 755, 757–760, 762, 764, 766, 768,
983, 984, 986, 991–993, 997–998, 1014 774–775, 780, 830, 855–857, 872, 884, 986,
Deformable registration.............406, 438, 440, 441, 445 1013, 1016, 1017, 1023, 1024
Dementia ............................................... ix, 234, 254, 303, Electronic health record (EHR)................... vii, 331–351,
459, 499, 511, 535, 681, 768, 807, 848, 922 594, 714, 755, 757, 761, 789–792, 995
Dementia with Lewy bodies (DLB).............. ix, 276, 463, Encoder ..............................................110, 111, 127–129,
469, 811–812, 821, 848, 850 134, 135, 153–156, 188, 194, 199–203, 209,
Diagnosis ................................................12, 48, 136, 234, 211, 212, 214–216, 220–221, 223, 392, 394,
253, 303, 342, 371, 459, 500, 534, 576, 642, 395, 409–412, 421, 423, 424, 448, 675, 691,
690, 705, 768, 808, 849, 902, 964, 979, 1010 693, 930, 932, 938, 939
DIAN, see Dominantly inherited Alzheimer Enhancing neuroimaging genetics through meta-analysis
network (DIAN) (ENIGMA) ..................................... 758, 771, 772,
Dice loss........................................................398–400, 427 780–781, 1020, 1027, 1030
Dice similarity coefficient (DSC) .............. 401, 402, 425, ENIGMA, see Enhancing neuroimaging genetics through
553, 728, 926, 930–932, 936, 939, 943, 944, 947 meta-analysis (ENIGMA)
Differential diagnosis ............................................ 48, 234, Epilepsy.................................................. ix, 246–248, 273,
240, 461–464, 467–469, 471, 476, 480, 812, 302–303, 327, 370, 373, 480, 536, 576, 726,
817, 823, 825, 827, 833, 849, 851, 853, 760, 773–775, 792, 879–891, 978, 996
861–862, 873, 903–905, 915 Event-based model (EBM)................................. 495, 512,
Diffusion models.................. vii, 148, 150, 186–187, 215 514–517, 523, 524, 828, 855, 869, 912
Diffusion tensor imaging (DTI) ........................ 267, 468, Evoked activity .............................................287–288, 299
573, 829, 861, 862, 887, 903, 904, 909
MACHINE LEARNING FOR BRAIN DISORDERS
1042 Index
F Genomics .................................................... 315–320, 322,
324–327, 474, 504, 593, 756–758, 760,
FCD, see Focal cortical dysplasia (FCD) 762–764, 766, 768, 770, 780–783, 785–789,
Feature extraction ...................................... 172, 299–302, 887, 889, 989, 997, 998, 1030
366, 377, 403, 446–447, 639, 647, 682, 698, Glioblastoma (GBM) .......................................... 247, 467,
707, 722, 754, 860, 928, 991 471, 476, 504, 964–969, 971, 972
Features.............................................. 5, 26, 77, 118, 164, Glioma .......................................176, 275, 463, 464, 467,
199, 236, 254, 285, 319, 332, 358, 392, 435, 469–471, 474, 552, 553, 555–557, 963–972
464, 493, 512, 534, 575, 614, 639, 655, 707, Global positioning system (GPS)....................... 367, 371,
754, 808, 848, 880, 899, 927, 963, 978, 1011 372, 378, 793
Feature selection .................................... 32, 34, 299–302, GMM, see Gaussian mixture model (GMM)
366, 618, 619, 657, 721, 722, 858, 860, 866, GPS, see Global positioning system (GPS)
936, 988, 1018, 1021, 1025 GPU, see Graphical processing unit (GPU)
FLAIR, see Fluid attenuated inversion recovery (FLAIR) Grad-CAM ........................................ 551, 561, 660, 666,
Fluid attenuated inversion recovery (FLAIR) ............263, 667, 670, 679–681, 685, 686, 694–696, 698
265, 412, 426, 427, 469, 471, 555, 755, 758, Gradient descent ........................................ 19, 20, 35, 78,
760, 761, 769, 818, 820, 825, 882–886, 890, 79, 89–91, 93, 96, 132, 153, 588, 700
902, 903, 906, 907, 909, 911, 922, 926, 929, Graph-based ...................... 259, 260, 265, 267, 274, 276
930, 941, 943, 944, 946, 969, 970, 972 Graphical processing unit (GPU) ....................10, 78, 91,
fMRI, see Functional magnetic resonance imaging (fMRI) 180, 183, 408, 638, 640, 790
Focal cortical dysplasia (FCD) ....................879, 884–888 Gray matter (GM).....................264, 265, 405, 468, 502,
Focal loss ....................................................................... 399 554, 684, 688, 690, 769, 817–821, 824, 860,
Fronto-temporal lobar degeneration 884, 889, 912–914, 979, 1016, 1018, 1027–1029
(FTLD) ............................................ 463, 468, 469 GRU, see Gated recurrent unit (GRU)
Functional magnetic resonance imaging (fMRI) .......267, GWAS, see Genome-wide association study (GWAS)
268, 358, 502, 573, 574, 576, 581, 680, 681, Gyroscope.........361–366, 795, 858, 864, 865, 867, 872
755, 759, 769, 821, 853, 856, 857, 993, 994,
1013, 1017–1025, 1028, 1030, 1031 H
G HAR, see Human activity recognition (HAR)
HGG, see High-grade glioma (HGG)
GAN, see Generative adversarial network (GAN) Hierarchical clustering ...... 492–494, 500, 502, 503, 993
Gated recurrent unit (GRU) .......................................118, High-grade glioma (HGG) ................467, 469, 553, 972
124–125, 932, 944 High-througput ................ 315, 317–319, 321, 491, 785
Gaussian mixture model (GMM).............. 26, 60, 63–66, Hippocampal sclerosis (HS) ........................809, 880–883
147–149, 424, 497, 824, 825, 857, 859, 942, Hippocampus ................... 264, 425, 473, 631, 686–688,
944, 946, 948, 988 808, 811, 824–826, 866, 880, 881, 996, 1018
Gaussian process.................................................. 512, 520, Histology ........................................... 474, 504, 536, 540,
521, 523, 689, 826, 990, 1016, 1028 553, 555, 562, 563, 681, 888, 964
GBM, see Glioblastoma (GBM) Histopathology .........................109, 175, 176, 474, 540,
GDL, see Generalized Dice loss (GDL) 541, 543, 549, 551, 558, 560–563, 884, 889
Generalization .......................................... 78, 92–96, 400, HS, see Hippocampal sclerosis (HS)
495, 496, 550, 574, 602, 615, 617, 619, 620, Human activity recognition (HAR)............................356,
622–625, 825, 984, 1000, 1021 360–367, 375
Generalized Dice loss (GDL) .............................. 398–400
Generative adversarial network (GAN).................vii, 112, I
139–188, 209, 212, 404, 424–425, 448, 496,
497, 534, 550, 659, 909, 928 IA, see Intracranial aneurysm (IA)
Generative models......................................... vii, 112, 134, ID, see Intelligence disabilities (ID)
139–188, 212, 376, 496, 524, 577, 579, 591, 659 Idiopathic rapid eye movement sleep behavior disorder
Genetic FTD initiative (GENFI) ........................ 822, 828 (iRBD) ............................................. 852, 857, 859
GENFO, see Genetic FTD initiative (GENFI) Inertial systems.............................................357, 360–362
Genome-wide association study (GWAS)...................326, Intelligence disabilities (ID) ............................... 341, 786,
755, 764, 783, 789, 813, 989, 994 788, 977, 979, 981
MACHINE LEARNING FOR BRAIN DISORDERS
Index 1043
Intensity rescaling ................................................ 263–264 Logistic regression (LR) ..................4, 13, 25, 26, 32–35,
Interpretability ...................................... viii, 38, 333, 477, 37, 42, 48–51, 86, 258, 471, 472, 555, 607, 621,
479, 480, 559, 574, 575, 592, 655–701, 744, 852–858, 860, 862, 863, 866, 873, 904, 929,
988, 992, 999–1000 986, 993, 997, 998, 1021–1025
Intracranial aneurysm (IA) ................................. 922, 923, Logopenic progressive aphasia (LPA).......................... 809
935, 937–940 Longitudinal ...............vii, 136, 262, 336, 373, 392, 414,
Inverse problem ................................................... 298–299 440, 473, 503, 512–514, 519–525, 619, 692,
iRBD, see Idiopathic rapid eye movement sleep behavior 757–768, 770, 771, 777, 786, 791, 828, 829,
disorder (iRBD) 831, 868–870, 881, 906, 908, 941, 946, 1016
The ischemic stroke lesion segmentation Long short-term memory (LSTM) .......... 118, 122–125,
(ISLES) ....................................417, 930, 931, 933 133–136, 187, 194, 202, 676, 852, 857, 859,
873, 933
K Loss ............... 16, 27, 79, 119, 139, 205, 265, 318, 333,
KDE, see Kernel density estimate (KDE) 367, 397, 445, 465, 497, 511, 581, 639, 664,
Kernel......................................... 9, 26, 39, 41, 42, 44–47, 718, 808, 848, 880, 899, 923, 965, 997, 1010
70–73, 100–106, 108, 393, 406, 409, 472, 512, Low-grade glioma (LGG) .........467, 469, 474, 553, 972
582–583, 591, 661, 662, 864, 1020, 1026 LPA, see Logopenic progressive aphasia (LPA)
Kernel density estimate (KDE) .......................... 405–406, LR, see Logistic regression (LR)
512, 515, 527 LRP, see Layer-wise relevance (LRP)
k-means ................................................. 13, 26, 60–63, 65, LSTM, see Long short-term memory (LSTM)
66, 69, 492–494, 500–502, 524, 551, 555, 855, LVO, see Large vessel occlusion (LVO)
906, 931, 988, 1020, 1023, 1027
M
k-nearest neighbors (kNN).......................... 28, 336, 721,
851, 852, 854, 855, 857, 859, 860, 864, 866, 942 Machine learning (ML) ...................................... 3, 25, 77,
kNN, see k-nearest neighbors (kNN) 348, 355, 437, 460, 491, 511, 536, 575, 601,
Kullback-Leibler divergence (KL/KLD)........... 111, 154, 631, 655, 705, 755, 789, 822, 847, 880, 905,
156–160, 162, 163, 166, 416, 423, 589 949, 963, 979, 1009
Magnetic resonance imaging (MRI)....................... vii, 19,
L 144–149, 176, 254, 256–270, 274, 277, 298,
Lacunes ....................................................... 809, 810, 819, 396, 404, 405, 412, 413, 425, 426, 447, 464,
468, 472, 474, 477, 479, 555–558, 561, 590,
825, 940, 941, 946–948
Large vessel occlusion (LVO) ............464, 737, 923–929 621, 632, 645, 660, 681, 686, 700, 707, 726,
LASSO, see Least absolute shrinkage and selection 754, 756–769, 809, 880, 922, 924, 1028
Magnetoencephalography (MEG) .............. viii, 285–307,
operator (LASSO)
LATE, see Limbic-predominant age-related TDP-43 460, 755, 757–762, 766, 768, 774, 775
encephalopathy (LATE) Major depressive disorder (MDD)..................... 500, 502,
758, 760, 763, 771, 853, 857, 1010, 1013, 1014,
Latent variable .................................................86, 89, 134,
146, 154, 186, 208, 575, 577–591, 691, 868 1016, 1018, 1020–1028
Layer-wise relevance (LRP) ................................ 660, 668, Major depressive episode (MDE) ..............................1010
Malignant ................................................... 248, 464, 723,
669, 679–681, 685, 686, 693–695, 697, 698
LDA, see Linear discriminant analysis (LDA) 745, 767, 787, 791, 964
Least absolute shrinkage and selection operator Markov chain monte carlo (MCMC) ................. 150, 515
(LASSO) ............................. 657, 904, 1021, 1024 MDD, see Major depressive disorder (MDD)
MDE, see Major depressive episode (MDE)
Lesion .......................... viii, 11, 235, 265, 374, 391, 462,
543, 576, 686, 711, 769, 809, 880, 899, 922, 969 Mean squared error (MSE) ..............................48, 58, 88,
LGG, see Low-grade glioma (LGG) 107, 154, 416, 422, 441, 449, 700, 718, 719
MEDA, see Minimal evidence of disease activity (MEDA)
Limbic-predominant age-related TDP-43 encephalopathy
(LATE)............................................................... 809 Medical segmentation decathlon (MSD) .................... 425
Linear discriminant analysis (LDA)........... 26, 55, 69–71, MEG, see Magnetoencephalography (MEG)
Meningioma ........................................247, 471, 729, 965
852–854, 856, 857, 883, 1023
Linear regression .......................................... 9, 14, 19, 20, Messenger RNA (mRNA) ..................315, 321, 324, 325
25, 26, 30–32, 35, 38, 40, 42, 523, 578, 579, 583, Metabolomics ....................................... vii, 315, 317, 318,
585, 657, 829, 865, 944, 1021 320, 323, 324, 327, 758, 761–763, 998, 1013
MACHINE LEARNING FOR BRAIN DISORDERS
1044 Index
MI, see Mutual information (MI) Neuroimaging ...................................... vii, 240, 253–277,
Microbleed ..................................................... ix, 818, 921, 371, 376, 377, 380, 391, 392, 403, 412, 460,
935, 940, 945, 947 462, 463, 478, 491–505, 521, 575–577, 581,
Minimal evidence of disease activity (MEDA) ............ 907 592, 593, 648, 655–701, 754, 769–774, 780,
ML, see Machine learning (ML) 822, 879–891, 940, 946, 981, 988, 989,
MLP, see Multi-layer perceptron (MLP) 991–993, 998, 1013, 1016, 1018, 1019, 1030
MND, see Motor neuron disease (MND) Neurology............................................. vii, 241, 246, 249,
Mobile devices................................................ vii, 355–383 250, 275, 307, 512, 525, 691, 761, 763, 1017
Morphometry .................................... 453, 454, 468, 552, Neuromyelitis optica spectrum disorder
860, 889, 999, 1019, 1025, 1028 (NMOSD) ................................................ 904, 905
Motor................................ 235, 237, 238, 249, 250, 264, Neuropsychology ................................................. 240, 241
276, 288, 304, 305, 371, 373, 382, 701, 759, NIH, see National Institutes of Health (NIH)
774–776, 793, 810–812, 848–851, 862–865, NLP, see Natural language processing (NLP)
867–869, 871, 872, 889, 891, 899, 965, 977 NMOSD, see Neuromyelitis optica spectrum disorder
Motor neuron disease (MND) ...........371, 768, 776, 811 (NMOSD)
MRI, see Magnetic resonance imaging (MRI) NN, see Neural network (NN)
mRNA, see Messenger RNA (mRNA) Normal appearing white matter
MS, see Multiple sclerosis (MS) (NAWM)................................................... 405, 914
MSA, see Multiple system atrophy (MSA) Normalized cross correlation (NCC) ................. 442, 447
MSD, see Medical segmentation decathlon (MSD)
MSE, see Mean squared error (MSE) O
Multi-layer perceptron (MLP) .............. 81–83, 102, 104,
Obsessive-compulsive disorder (OCD) ......................500,
126, 199–204, 217–219, 222, 398, 852, 854, 857 758, 771, 978, 994, 1009, 1010
Multimodal ................................................... viii, 196, 197, OCD, see Obsessive-compulsive disorder (OCD)
212–216, 225, 256, 372, 392, 413, 425, 444,
Oligodendroglioma..................................... 247, 555, 965
468, 475, 550, 562, 573–594, 769, 770, 823, Optimization ...................................................4, 9, 17, 31,
830, 882, 889, 890, 969, 986, 995, 1030, 1031 35, 38–42, 78, 88–98, 126, 132, 146, 159, 160,
Multiple sclerosis (MS) .................................. ix, 247, 249,
163, 172, 323, 358, 419, 438, 441, 442, 444,
263, 265, 274, 324, 342, 343, 370, 392, 410, 445, 452, 467, 480, 498, 499, 548, 553, 578,
411, 414, 417, 425, 427, 428, 462, 465, 466, 580, 581, 583–585, 588–593, 620, 621, 663,
473, 474, 476, 536, 680, 686, 701, 726, 768, 672, 711, 719, 722, 906, 939
769, 773, 792, 899–915
Oscillatory activity...................... 287–289, 299, 302, 303
Multiple system atrophy (MSA)............ ix, 848, 850, 857 Overfitting .........................................................17, 25, 32,
Mutual information (MI) .................................... 443–444 34–37, 40, 58, 78, 92–96, 445, 556, 642, 714,
723, 944, 984–986, 988, 990, 993, 998
N
National Institutes of Health (NIH) ................. 332, 771, P
777, 783, 787, 927 Parkinson ............................................... ix, 144, 235, 238,
Natural language processing (NLP) .............. vii, 15, 136, 246, 247, 249, 276, 370, 372, 374, 391, 478,
193, 199, 202, 203, 225, 236, 337, 338, 346, 360
514, 516, 534–536, 577, 645, 681, 689, 701,
NAWM, see Normal appearing white matter (NAWM) 772, 781, 791, 794, 795, 812, 847–874
NCC, see Normalized cross correlation (NCC) Parkinsonian variant of multiple system atrophy
NDDs, see Neurodevelopmental disorders (NDDs) (MSA-P)..........................848, 853, 857, 861, 862
Neural network (NN) ......................................... 8–11, 78,
Parkinson’s disease (PD) ............................... ix, 144, 238,
82–95, 99, 102, 106, 109, 110, 118, 130, 132, 246, 247, 249, 276, 370–374, 391, 478, 514,
142, 154, 155, 169, 177, 186, 270, 377, 412, 516, 534–536, 577, 645, 681, 689, 701, 772,
421, 423, 446, 468, 551, 558, 588, 590, 591,
781, 791, 794, 795, 812, 847–874
593, 594, 620, 659, 668, 699, 722, 824, 873, Parkinson’s progression markers initiative
883, 890, 928, 931, 938, 971, 986, 991 (PPMI)............................................ 276, 478, 680,
Neurodevelopmental disorders
681, 772, 852, 856, 860, 869
(NDDs)..................................499, 701, 977–1000
MACHINE LEARNING FOR BRAIN DISORDERS
Index 1045
Partial least squares (PLS) .................................. 472, 575, Q
579–585, 588
Partial volume ...................................................... 936, 944 QTAB, see Queensland twin adolescent brain (QTAB)
Patch ........................................................... 185, 188, 203, QTIM, see Queensland twin imaging (QTIM)
204, 206–208, 210, 215, 217, 220–222, Queensland twin adolescent brain (QTAB) ...............760,
408–410, 446, 447, 449, 549–553, 555, 556, 772, 776, 793, 794
558–561, 577, 670, 672, 688, 690, 691, 698, Queensland twin imaging (QTIM) ............................761,
824, 826, 882, 885, 886, 928, 930, 938, 944 772, 776, 777
PCA, see Posterior cortical atrophy (PCA); Principal
R
component analysis (PCA)
PD, see Parkinson’s disease (PD) Random forest.................. 13, 26, 58–59, 460, 471–473,
Performance assessment .......................viii, 171, 705–747 722, 826, 858, 860, 862, 865, 866, 869, 873,
Performance metrics .................................... viii, 171, 478, 929, 930, 932, 940, 945, 947, 1019, 1022, 1025
479, 602–613, 625, 712, 718, 730, 732, 733, Rapid eye movement
736, 738, 741, 745, 746 (REM).............................................. 811, 859, 870
Perivascular spaces (PVS).................................... 819, 825, Reader study ............................... 707, 726, 740–744, 747
940, 941, 946–948 Rectified linear unit (ReLU) ................... 85–87, 91, 100,
PET, see Positron emission tomography (PET) 101, 107, 108, 392, 393, 395, 665–668, 695, 698
p-hacking .............................................................. 633, 649 Recurrent neural networks (RNNs)........................ vii, 78,
Pick’s disease (PiD) ....................................................... 557 117–137, 149, 194, 202, 460, 551, 691, 692,
PiD, see Pick’s disease (PiD) 826, 856, 857, 859, 870
PLS, see Partial least squares (PLS) Registration ......................................... viii, 263–265, 403,
PML, see Progressive multifocal leukoencephalopathy 406, 427, 435–454, 558, 642, 679, 708, 777,
(PML) 794, 824, 907, 928, 1000
Positioning systems .............................357, 360–362, 367 Regression ............... 9, 25, 88, 151, 258, 336, 446, 460,
Positron emission tomography (PET) ..................vii, 165, 513, 578, 602, 657, 851, 904, 929, 983, 1019
253, 254, 256, 257, 259, 271–277, 392, 463, Regulatory .................................. viii, 250, 376, 378, 381,
468, 472, 473, 558, 573, 574, 581, 619, 700, 517, 549, 614, 637, 649, 705–747, 808, 997
707, 726, 755, 757, 758, 772–774, 778, 811, Reinforcement learning .......................... 12–14, 551, 872
815–817, 819, 822, 826–828, 860, 909, 910, Relapsing remitting multiple sclerosis
934, 935 (RRMS).................. 343, 900, 901, 905, 912, 913
Posterior cortical atrophy (PCA) ........................ 809, 828 Reliability ........................................... 243, 246, 301, 305,
PPA, see Primary progressive aphasia (PPA) 479, 635, 659, 661, 677, 687, 696, 777, 865
PPMI, see Parkinson’s progression markers initiative ReLU, see Rectified linear unit (ReLU)
(PPMI) REM, see Rapid eye movement (REM)
PPMS, see Primary progressive multiple sclerosis (PPMS) Repeatability ................................................ 534, 632, 635
Prediction ......................... 9, 27, 79, 118, 151, 206, 265, Replicability .......................................................... 632, 635
401, 446, 459, 521, 551, 577, 601, 657, 717, Representativeness .......................................350, 715–717
775, 823, 849, 880, 905, 923, 963, 980, 1014 Reproducibility...................viii, 319, 382, 475, 479, 501,
Preregistration ........................................... 633, 642, 1000 549, 631–649, 690, 719, 720, 728, 747, 753,
Primary progressive aphasia (PPA)...................... 246, 811 786, 827, 830, 968, 969, 989, 993, 999, 1030
Primary progressive multiple sclerosis Residual neural network (ResNet) ..................... 109, 140,
(PPMS) ............................................ 343, 900, 912 215, 448, 553, 557, 561, 682, 935, 938
Principal component analysis (PCA)................ 26, 66–67, Ribonucleic acid (RNA) ............................ 313–316, 318,
300, 366, 551, 555, 582 319, 323–326, 563, 766, 787, 788, 821, 968
Progressive multifocal leukoencephalopathy Ridge regression.......................................... 26, 39, 41, 72
(PML) ....................................................... 901–903 Rigidity ........................................................ 237, 238, 848
Progressive supranuclear palsy (PSP) .....................ix, 286, Rigid registration ........................................ 264, 439, 440
287, 557, 811, 848, 850 RMSE, see Root mean square error (RMSE)
Proteomics...........vii, 315–318, 320, 323–327, 757, 758 RNA, see Ribonucleic acid (RNA)
PSP, see Progressive supranuclear palsy (PSP) RNNs, see Recurrent neural networks (RNNs)
Psychiatry.............................................. vii, 376, 758, 777, Root mean square error
781, 985, 989, 1012–1017, 1031 (RMSE)..................................................... 612, 613
MACHINE LEARNING FOR BRAIN DISORDERS
1046 Index
S Subtyping.................................................... 267, 277, 464,
491–495, 497, 499–501, 503–505, 512, 513,
Sagittal ................................................................. 256, 257, 517, 524, 526, 527, 993, 1016
428, 886, 903, 931 SUD, see Substance use disorder (SUD)
Saliency ........................................................ 659, 665, 696 Sum of squared difference (SSD)........................ 441, 442
Schizophrenia ............................... ix, 246, 268, 371, 491, Supervised learning ........................................... 12–16, 20,
495, 501, 534, 535, 726, 763, 771, 979, 994, 25, 26, 60, 110, 376, 406, 415, 420, 985
996, 1009, 1011–1017, 1019, 1028, 1030, 1031 Support vector machine (SVM) ......................... 9, 13, 25,
Secondary progressive multiple sclerosis 26, 42–48, 460, 468, 471, 472, 474, 496, 498,
(SPMS)...................................................... 343, 900 553, 556, 621, 721, 852–861, 863–866, 882,
Segmentation .........................................13, 88, 140, 207, 883, 905, 913, 919, 929, 933, 946, 969, 970,
263, 363, 391, 435, 470, 534, 602, 645, 682, 986, 1013, 1028
707, 824, 881, 906, 923, 966 SUVR, see Standardized uptake value ratio (SUVR)
Self-attention .............................................. 128, 196–199, SVM, see Support vector machine (SVM)
201–203, 208, 209, 218, 219 Swedish twin registry (STR)................................ 761, 776
Semi-supervised learning (SSL) ..................415–417, 419 Synucleinopathy ............................................................ 848
Sensors ................vii, 289–292, 296–298, 300, 355–383,
460, 757–760, 762, 764, 768, 774, 775, T
793–795, 851, 858, 861, 864, 865, 867, 872, 874
Similarity function................................................ 437, 447 Tau ........................... 273, 463, 472, 473, 542, 553, 554,
Single-cell ................................................... 316, 323–324, 557, 558, 700, 808, 810, 812, 815–817, 848, 850
326, 757, 787, 821, 968 Temporal lobe epilepsy (TLE) ........... 880–883, 887–891
Single-nucleotide polymorphism (SNP)............ 324, 326, Testing set/test set.................................92, 93, 400, 612,
574, 592–594, 761, 780, 784, 979, 997, 1014 614–622, 624, 625, 633, 641, 692, 696, 714,
Single-photon emission computed tomography 715, 718, 859, 864, 867, 935, 936, 939, 949, 985
(SPECT) ................................... vii, 253, 254, 256, TLE, see Temporal lobe epilepsy (TLE)
259, 275–277, 680, 681, 689, 701, 707, 811 Training set.................12–14, 16, 17, 36, 79, 87–89, 91,
Skull stripping .....................................264, 403, 404, 427 92, 111, 399, 400, 426, 591, 614, 615, 618, 619,
Smartphone ...................... 349, 356, 360, 361, 366, 367, 633, 687, 714, 715, 718, 832, 883, 943, 987
369–371, 373–375, 460, 756, 758, 760, 762, Transcriptomics ............................vii, 315–319, 322, 324,
764, 766, 768, 793–795, 832, 851, 858, 1031 325, 327, 504, 549, 558, 563, 758, 762, 763, 998
SNP, see Single-nucleotide polymorphism (SNP) Transformation model ............... 437–441, 446, 449, 452
Source reconstruction .......................................... 295, 297 Transformer ......................... vii, 112, 128–129, 140, 184,
Sparsity...................................................39, 111, 498, 593 185, 187, 188, 193–225, 396–398, 411, 449–451
Spatial normalization .......................................... 274, 276, Tremor ..............................238, 239, 246, 247, 276, 373,
406, 453, 829 701, 847, 848, 854, 858, 865, 868, 872, 873
SPECT, see Single-photon emission computed Tumor ...................................................11, 104, 176, 248,
tomography (SPECT) 254, 326, 394, 459, 463, 500, 533, 603, 644,
SPMS, see Secondary progressive multiple sclerosis 701, 707, 773, 815, 879, 963
(SPMS) Tversky loss.................................................................... 400
SSD, see Sum of squared difference (SSD)
U
SSL, see Semi-supervised learning (SSL)
Standardized uptake value ratio UKB, see UK biobank (UKB)
(SUVR) ............................................ 274, 817, 862 UK biobank (UKB) ................................... 754, 757, 761,
Statistical testing...........................................619–622, 624 769, 771, 778–781, 784, 793, 794, 822, 1030
STR, see Swedish rwin registry (STR) Underfitting ................................................ 35, 36, 39, 94
Stratification ............................................... 492, 500, 514, U-Net ................................................ 111, 151, 170, 171,
552, 557, 559, 576, 613–617, 684, 685, 717, 173–176, 392–398, 410–412, 427, 447, 449,
886–888, 913–915, 983, 987, 988, 1025 451, 551, 553, 554, 556, 558, 560, 561, 824,
Stroke.........................................143, 237, 270, 303, 333, 825, 905, 927, 928, 930–935, 938, 943–945
392, 459, 536, 576, 737, 792, 810, 879, 921 United States food and drug administration
StyleGAN.............................................................. 178–180 (FDA)................................................534, 705–708
Substance use disorder Unsupervised learning ............................... 12–14, 26, 60,
(SUD) ................................... 535, 758, 763, 1009 110, 415, 445, 512, 534, 551, 987–989
MACHINE LEARNING FOR BRAIN DISORDERS
Index 1047
V W
VaD, see Vascular dementia (VaD) Wasserstein ...................................................166–170, 172
VAE, see Variational autoencoder (VAE) Wearables .................................................... 291, 356, 357,
Validation..............................................92, 171, 250, 305, 367, 370, 373, 374, 378–380, 383, 463, 757,
333, 377, 400, 468, 501, 552, 601, 633, 713, 770, 793, 821, 832, 851, 865
824, 859, 882, 914, 927, 984, 1030 WGS, see Whole genome sequence (WGS)
Validation set ...................................................92, 93, 400, White matter hyperintensity (WMH) ................. 941–943
469, 617–619, 622, 623, 641, 714 White matter (WM) ..................................... viii, 264, 265,
Variational autoencoder (VAE) ........... 13, 134, 152–160, 267, 270, 273, 343, 405, 541, 576, 682, 684,
184–186, 412, 423–424, 452, 588, 589, 659 700, 701, 769, 777, 809, 817–820, 824, 825,
Vascular cognitive impairment (VCI) ................. 463, 810 860, 869, 884, 909, 913, 914, 929, 940–943,
Vascular dementia (VaD) ............................... ix, 809–810, 996, 1015, 1017, 1019, 1025, 1028–1030
812, 821, 940 WHO, see World Health Organization (WHO)
VCI, see Vascular cognitive impairment (VCI) Whole genome sequence (WGS) ....................... 758, 760,
Vertex-based ........................................................ 258, 259, 779, 780, 783–785, 787, 788
265, 268, 274, 276, 1019 Whole slide image (WSI) ........................... 533–535, 537,
VETSA, see Vietnam era twin study of aging (VETSA) 541, 542, 544–559, 561, 563
Vietnam era twin study of aging WM, see White matter (WM)
(VETSA) ..................................761, 772, 776, 777 WMH, see White matter hyperintensity (WMH)
V-Net ........................................................... 392, 393, 936 World Health Organization (WHO) ................. 326, 341,
Voxel-based ...................... 258, 259, 264, 265, 267, 268, 342, 464, 967
270, 274, 276, 453, 468, 860, 884, 948, 1019 WSI, see Whole slide image (WSI)